Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using hyperparameters from a checkpoint that is "weight-only" #194

Open
sfmig opened this issue Jun 25, 2024 · 0 comments
Open

Using hyperparameters from a checkpoint that is "weight-only" #194

sfmig opened this issue Jun 25, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@sfmig
Copy link
Collaborator

sfmig commented Jun 25, 2024

If we merge #182 with the suggested fix, we will be able to restart training from a checkpoint. However, we will always use the hparams as specified in the config .yaml file.

Would we want to have the option to use the hparams from the checkpoint?
Checkpoints in pytorch lightning include hyperparameters, but it is not clear to me when these are loaded.

If we don't pass config to load_from_checkpoint we would in principle use the hparams from the checkpoint. However, this leads to a mismatch between the logged hparams in MLflow and the actual hparams used.

lightning_model = FasterRCNN.load_from_checkpoint(
	self.checkpoint_path,
	config=self.config,
)

To reproduce this bug:

  1. Remove the config argument we pass to FasterRCNN.load_from_checkpoint()
  2. Train a model for one epoch (specifying n_epochs=1 in the yaml file) and save a weights_only checkpoint.
    • the checkpoint is at the path_to_checkpoints parameter logged in MLflow (the name is last.ckpt).
  3. then launch a training job that starts from that checkpoint . Before I launch it, I edit the config file to have n_epochs=3.
  4. In MLflow, this second training job has the same hyperparameters as the job that produced the training (so it has n_epochs=1 etc), but in reality the job runs for as many epochs as in the yaml file. So it logs n_epochs=1, but runs for n_epochs=3.
@sfmig sfmig added the bug Something isn't working label Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant