Adding checkpoint_path for resume training #182

nikk-nikaznan · 2024-06-07T13:22:58Z

closes #179

Checkpoint options
add checkpoint documentation in the guide

codecov-commenter · 2024-06-07T13:26:56Z

Codecov Report

Attention: Patch coverage is 15.78947% with 16 lines in your changes missing coverage. Please review.

Project coverage is 37.75%. Comparing base (d7a5cef) to head (7f425cc).

Files	Patch %	Lines
crabs/detection_tracking/train_model.py	15.78%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #182      +/-   ##
==========================================
- Coverage   37.95%   37.75%   -0.20%     
==========================================
  Files          20       20              
  Lines        1333     1348      +15     
==========================================
+ Hits          506      509       +3     
- Misses        827      839      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: nikk-nikaznan <[email protected]>

…Centre/crabs-exploration into nikkna/resume_training

sfmig

thanks Nik 🙌

While reviewing this PR I found a super tricky bug - but I think I found a fix too.

the bug

Steps to reproduce:

train a model for one epoch (specifying n_epochs=1 in the yaml file) and save a weights_only checkpoint.
- the checkpoint is at the path_to_checkpoints parameter logged in MLflow (the name is last.ckpt).
then launch a training job that starts from that checkpoint . Before I launch it, I edit the config file to have n_epochs=3.
In MLflow, this second training job has the same hyperparameters as the job that produced the training (so it has n_epochs=1 etc), but in reality the job runs for as many epochs as in the yaml file. So it logs n_epochs=1, but runs for n_epochs=3.

This is a problem because the logged hyperparameters and the actual hyperparameters used don't match.

This does not happen if you restart training from a "full" checkpoint. In that case the .yaml parameters are logged and used.

the fix

I am still not sure why this is happening - I opened an issue to further investigate. But I found we can overwrite the hyperparameters that are logged in the weights-only checkpoint with the yaml ones, by passing the config to load_from_checkpoint.

lightning_model = FasterRCNN.load_from_checkpoint(
	self.checkpoint_path,
	config=self.config,
)

Question

If we make this changes, whenever we restart training from a ckpt ("full" or "weights-only"), the hyperparameters used will be the .yaml config ones, and not the ckpt ones.

The question is: would we want the opposite behaviour? (which we can't have now because I don't know how to properly fix this bug). That would be using the hyperparams from the ckpt when loading a ckpt. It seems convenient, but sounds like something one would forget about. Let me know what you think.

I guess this also highlights that we need more detailed tests 😢

crabs/detection_tracking/train_model.py

guides/TrainingModelsHPC.md

…Centre/crabs-exploration into nikkna/resume_training

nikk-nikaznan · 2024-06-26T08:10:28Z

Good job for spotting the bug and thank you for the review.
Sorry I only check the metrics most of the time on mlflow. It makes sense, because in the original code, when lightning_model = FasterRCNN(self.config) we created the model instance, all the hyper parameters are saving to mlflow there. When we call lightning_model = FasterRCNN.load_from_checkpoint(self.checkpoint_path), we recreated the instance, but because we didn't pass any config, it will take everything from the checkpoint. And because the save_hyperparameters is there, it will overwrite the config from yaml. Hence this happened only to the weights-only and not the full one.

I don't see anything wrong with the new implementation you proposed, definitely make sense. It is only happened on the weights-only where technically the fine-tuning/transfer learning. So we only care about the new parameters we used to do the fine-tuning. And the hyper parameters from the checkpoint can always be traced back.

For the full one, it will always be based on the checkpoint, as it is resume training. That's why we need the full state. And, if you train for 10 epochs, and not change the yaml to make it more than 10, no training will happening. As it reached the max_epoch.

Hope this makes sense, happy to chat offline.

sfmig · 2024-06-26T09:58:53Z

thanks for clarifying Nik!

Some questions:

I think in the "full" one, it continues the training with the number of epochs it left off, but it will use the batch size from the yaml config. I think this is correct right? is this what we expect?
the thing I found odd with the "weights-only" was that in MLflow it logged the checkpoint's hyperparameters, but it used the config yaml's hyperparameters instead. I am not sure why that happened, do you know?

We can chat later today if it's easier yes!

sfmig

thanks Nik! 🚀
Just one question and small suggestions

sfmig · 2024-06-26T09:59:33Z

crabs/detection_tracking/train_model.py

@@ -77,16 +80,16 @@ def setup_trainer(self):
        )

        # Define checkpointing callback for trainer
-        config = self.config.get("checkpoint_saving")
-        if config:
+        config_ckpt = self.config.get("checkpoint_saving")


I like a different name 😁 ✨

crabs/detection_tracking/train_model.py

Co-authored-by: sfmig <[email protected]> Signed-off-by: nikk-nikaznan <[email protected]>

* adding ckpt_path to fit to resume training * option to resume or fine tunning * small changes * add checkpoint option in the guide * cleaned up guide * cleaned up * tring rename the ckpt * cleaned up after rebase * some changes in the guide * run pre-commit * fixed test * parsing the config to model instance during fine-tunning * small changes on guide * changes based on the review * small changes * Update crabs/detection_tracking/train_model.py Co-authored-by: sfmig <[email protected]> Signed-off-by: nikk-nikaznan <[email protected]> * Update crabs/detection_tracking/train_model.py Co-authored-by: sfmig <[email protected]> Signed-off-by: nikk-nikaznan <[email protected]> * Update crabs/detection_tracking/train_model.py Co-authored-by: sfmig <[email protected]> Signed-off-by: nikk-nikaznan <[email protected]> * cleaned up pre-commit --------- Signed-off-by: nikk-nikaznan <[email protected]> Co-authored-by: sfmig <[email protected]>

adding ckpt_path to fit to resume training

ecc750d

nikk-nikaznan changed the title ~~Adding ckpt_path to fit to resume training~~ Adding ckpt_path for resume training Jun 10, 2024

nikk-nikaznan and others added 13 commits June 11, 2024 08:49

option to resume or fine tunning

68c0053

small changes

d412475

add checkpoint option in the guide

839df35

cleaned up guide

c9b64a3

Merge branch 'main' into nikkna/resume_training

7938a6b

Merge branch 'main' into nikkna/resume_training

562c20e

Signed-off-by: nikk-nikaznan <[email protected]>

cleaned up

05856d2

tring rename the ckpt

933aeef

Merge branch 'main' into nikkna/resume_training

154c1ec

Signed-off-by: nikk-nikaznan <[email protected]>

Merge branch 'nikkna/resume_training' of github.com:SainsburyWellcome…

6db4b76

…Centre/crabs-exploration into nikkna/resume_training

cleaned up after rebase

1d37237

some changes in the guide

a553a20

run pre-commit

7210436

nikk-nikaznan changed the title ~~Adding ckpt_path for resume training~~ Adding checkpoint_path for resume training Jun 21, 2024

fixed test

466265c

nikk-nikaznan marked this pull request as ready for review June 21, 2024 19:35

nikk-nikaznan requested a review from sfmig June 24, 2024 09:39

Merge branch 'main' into nikkna/resume_training

0666762

sfmig requested changes Jun 25, 2024

View reviewed changes

crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved

crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved

crabs/detection_tracking/train_model.py Outdated Show resolved Hide resolved

guides/TrainingModelsHPC.md Outdated Show resolved Hide resolved

sfmig mentioned this pull request Jun 25, 2024

Using hyperparameters from a checkpoint that is "weight-only" #194

Open

nikk-nikaznan added 2 commits June 25, 2024 19:38

parsing the config to model instance during fine-tunning

6aa414f

Merge branch 'nikkna/resume_training' of github.com:SainsburyWellcome…

ba5fb78

…Centre/crabs-exploration into nikkna/resume_training

nikk-nikaznan added 3 commits June 26, 2024 09:13

small changes on guide

d717661

changes based on the review

7f71743

small changes

5df0361

nikk-nikaznan requested a review from sfmig June 26, 2024 09:58

sfmig approved these changes Jun 26, 2024

View reviewed changes

nikk-nikaznan and others added 4 commits June 26, 2024 16:11

Update crabs/detection_tracking/train_model.py

35e08e5

Co-authored-by: sfmig <[email protected]> Signed-off-by: nikk-nikaznan <[email protected]>

Update crabs/detection_tracking/train_model.py

6c2d99e

Co-authored-by: sfmig <[email protected]> Signed-off-by: nikk-nikaznan <[email protected]>

Update crabs/detection_tracking/train_model.py

36b5cbe

Co-authored-by: sfmig <[email protected]> Signed-off-by: nikk-nikaznan <[email protected]>

cleaned up pre-commit

7f425cc

nikk-nikaznan merged commit 4e70c2c into main Jun 26, 2024
6 checks passed

nikk-nikaznan deleted the nikkna/resume_training branch June 26, 2024 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding checkpoint_path for resume training #182

Adding checkpoint_path for resume training #182

nikk-nikaznan commented Jun 7, 2024 •

edited

Loading

codecov-commenter commented Jun 7, 2024 •

edited

Loading

sfmig left a comment •

edited

Loading

nikk-nikaznan commented Jun 26, 2024

sfmig commented Jun 26, 2024

sfmig left a comment

sfmig Jun 26, 2024

Adding checkpoint_path for resume training #182

Adding checkpoint_path for resume training #182

Conversation

nikk-nikaznan commented Jun 7, 2024 • edited Loading

codecov-commenter commented Jun 7, 2024 • edited Loading

Codecov Report

sfmig left a comment • edited Loading

Choose a reason for hiding this comment

the bug

the fix

Question

nikk-nikaznan commented Jun 26, 2024

sfmig commented Jun 26, 2024

sfmig left a comment

Choose a reason for hiding this comment

sfmig Jun 26, 2024

Choose a reason for hiding this comment

nikk-nikaznan commented Jun 7, 2024 •

edited

Loading

codecov-commenter commented Jun 7, 2024 •

edited

Loading

sfmig left a comment •

edited

Loading