Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding checkpoint_path for resume training #182

Merged
merged 25 commits into from
Jun 26, 2024
Merged

Conversation

nikk-nikaznan
Copy link
Collaborator

@nikk-nikaznan nikk-nikaznan commented Jun 7, 2024

closes #179

  • Checkpoint options
  • add checkpoint documentation in the guide

@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2024

Codecov Report

Attention: Patch coverage is 15.78947% with 16 lines in your changes missing coverage. Please review.

Project coverage is 37.75%. Comparing base (d7a5cef) to head (7f425cc).

Files Patch % Lines
crabs/detection_tracking/train_model.py 15.78% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #182      +/-   ##
==========================================
- Coverage   37.95%   37.75%   -0.20%     
==========================================
  Files          20       20              
  Lines        1333     1348      +15     
==========================================
+ Hits          506      509       +3     
- Misses        827      839      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nikk-nikaznan nikk-nikaznan changed the title Adding ckpt_path to fit to resume training Adding ckpt_path for resume training Jun 10, 2024
@nikk-nikaznan nikk-nikaznan changed the title Adding ckpt_path for resume training Adding checkpoint_path for resume training Jun 21, 2024
@nikk-nikaznan nikk-nikaznan marked this pull request as ready for review June 21, 2024 19:35
@nikk-nikaznan nikk-nikaznan requested a review from sfmig June 24, 2024 09:39
Copy link
Collaborator

@sfmig sfmig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Nik 🙌

While reviewing this PR I found a super tricky bug - but I think I found a fix too.

the bug

Steps to reproduce:

  • train a model for one epoch (specifying n_epochs=1 in the yaml file) and save a weights_only checkpoint.
    • the checkpoint is at the path_to_checkpoints parameter logged in MLflow (the name is last.ckpt).
  • then launch a training job that starts from that checkpoint . Before I launch it, I edit the config file to have n_epochs=3.
  • In MLflow, this second training job has the same hyperparameters as the job that produced the training (so it has n_epochs=1 etc), but in reality the job runs for as many epochs as in the yaml file. So it logs n_epochs=1, but runs for n_epochs=3.

This is a problem because the logged hyperparameters and the actual hyperparameters used don't match.

This does not happen if you restart training from a "full" checkpoint. In that case the .yaml parameters are logged and used.

the fix

I am still not sure why this is happening - I opened an issue to further investigate. But I found we can overwrite the hyperparameters that are logged in the weights-only checkpoint with the yaml ones, by passing the config to load_from_checkpoint.

lightning_model = FasterRCNN.load_from_checkpoint(
	self.checkpoint_path,
	config=self.config,
)

Question

If we make this changes, whenever we restart training from a ckpt ("full" or "weights-only"), the hyperparameters used will be the .yaml config ones, and not the ckpt ones.

The question is: would we want the opposite behaviour? (which we can't have now because I don't know how to properly fix this bug). That would be using the hyperparams from the ckpt when loading a ckpt. It seems convenient, but sounds like something one would forget about. Let me know what you think.

I guess this also highlights that we need more detailed tests 😢

crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved
crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved
crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved
guides/TrainingModelsHPC.md Outdated Show resolved Hide resolved
@nikk-nikaznan
Copy link
Collaborator Author

Good job for spotting the bug and thank you for the review.
Sorry I only check the metrics most of the time on mlflow. It makes sense, because in the original code, when lightning_model = FasterRCNN(self.config) we created the model instance, all the hyper parameters are saving to mlflow there. When we call lightning_model = FasterRCNN.load_from_checkpoint(self.checkpoint_path), we recreated the instance, but because we didn't pass any config, it will take everything from the checkpoint. And because the save_hyperparameters is there, it will overwrite the config from yaml. Hence this happened only to the weights-only and not the full one.

I don't see anything wrong with the new implementation you proposed, definitely make sense. It is only happened on the weights-only where technically the fine-tuning/transfer learning. So we only care about the new parameters we used to do the fine-tuning. And the hyper parameters from the checkpoint can always be traced back.

For the full one, it will always be based on the checkpoint, as it is resume training. That's why we need the full state. And, if you train for 10 epochs, and not change the yaml to make it more than 10, no training will happening. As it reached the max_epoch.

Hope this makes sense, happy to chat offline.

@nikk-nikaznan nikk-nikaznan requested a review from sfmig June 26, 2024 09:58
@sfmig
Copy link
Collaborator

sfmig commented Jun 26, 2024

thanks for clarifying Nik!

Some questions:

  • I think in the "full" one, it continues the training with the number of epochs it left off, but it will use the batch size from the yaml config. I think this is correct right? is this what we expect?
  • the thing I found odd with the "weights-only" was that in MLflow it logged the checkpoint's hyperparameters, but it used the config yaml's hyperparameters instead. I am not sure why that happened, do you know?

We can chat later today if it's easier yes!

Copy link
Collaborator

@sfmig sfmig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Nik! 🚀
Just one question and small suggestions

@@ -77,16 +80,16 @@ def setup_trainer(self):
)

# Define checkpointing callback for trainer
config = self.config.get("checkpoint_saving")
if config:
config_ckpt = self.config.get("checkpoint_saving")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like a different name 😁 ✨

crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved
crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved
crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved
@nikk-nikaznan nikk-nikaznan merged commit 4e70c2c into main Jun 26, 2024
6 checks passed
@nikk-nikaznan nikk-nikaznan deleted the nikkna/resume_training branch June 26, 2024 15:22
sfmig added a commit that referenced this pull request Jul 8, 2024
* adding ckpt_path to fit to resume training

* option to resume or fine tunning

* small changes

* add checkpoint option in the guide

* cleaned up guide

* cleaned up

* tring rename the ckpt

* cleaned up after rebase

* some changes in the guide

* run pre-commit

* fixed test

* parsing the config to model instance during fine-tunning

* small changes on guide

* changes based on the review

* small changes

* Update crabs/detection_tracking/train_model.py

Co-authored-by: sfmig <[email protected]>
Signed-off-by: nikk-nikaznan <[email protected]>

* Update crabs/detection_tracking/train_model.py

Co-authored-by: sfmig <[email protected]>
Signed-off-by: nikk-nikaznan <[email protected]>

* Update crabs/detection_tracking/train_model.py

Co-authored-by: sfmig <[email protected]>
Signed-off-by: nikk-nikaznan <[email protected]>

* cleaned up pre-commit

---------

Signed-off-by: nikk-nikaznan <[email protected]>
Co-authored-by: sfmig <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to restart training from a checkpoint
3 participants