You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we merge #182 with the suggested fix, we will be able to restart training from a checkpoint. However, we will always use the hparams as specified in the config .yaml file.
Would we want to have the option to use the hparams from the checkpoint?
Checkpoints in pytorch lightning include hyperparameters, but it is not clear to me when these are loaded.
If we don't pass config to load_from_checkpoint we would in principle use the hparams from the checkpoint. However, this leads to a mismatch between the logged hparams in MLflow and the actual hparams used.
Remove the config argument we pass to FasterRCNN.load_from_checkpoint()
Train a model for one epoch (specifying n_epochs=1 in the yaml file) and save a weights_only checkpoint.
the checkpoint is at the path_to_checkpoints parameter logged in MLflow (the name is last.ckpt).
then launch a training job that starts from that checkpoint . Before I launch it, I edit the config file to have n_epochs=3.
In MLflow, this second training job has the same hyperparameters as the job that produced the training (so it has n_epochs=1 etc), but in reality the job runs for as many epochs as in the yaml file. So it logs n_epochs=1, but runs for n_epochs=3.
The text was updated successfully, but these errors were encountered:
If we merge #182 with the suggested fix, we will be able to restart training from a checkpoint. However, we will always use the hparams as specified in the config .yaml file.
Would we want to have the option to use the hparams from the checkpoint?
Checkpoints in pytorch lightning include hyperparameters, but it is not clear to me when these are loaded.
If we don't pass
config
toload_from_checkpoint
we would in principle use the hparams from the checkpoint. However, this leads to a mismatch between the logged hparams in MLflow and the actual hparams used.To reproduce this bug:
config
argument we pass toFasterRCNN.load_from_checkpoint()
n_epochs=1
in the yaml file) and save aweights_only
checkpoint.path_to_checkpoints
parameter logged in MLflow (the name islast.ckpt
).n_epochs=3
.n_epochs=1
etc), but in reality the job runs for as many epochs as in the yaml file. So it logsn_epochs=1
, but runs forn_epochs=3
.The text was updated successfully, but these errors were encountered: