Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running experiments systematically in the cluster #145

Closed
11 tasks done
Tracked by #134
sfmig opened this issue Apr 5, 2024 · 0 comments
Closed
11 tasks done
Tracked by #134

Running experiments systematically in the cluster #145

sfmig opened this issue Apr 5, 2024 · 0 comments
Assignees

Comments

@sfmig
Copy link
Collaborator

sfmig commented Apr 5, 2024

There are a few things to polish before we are able to systematically run experiments in the cluster.

Experiments = training a detector with different (hyper)parameters, for at least 3 dataset splits, and comparing them.

Some bits noted so far:

  • write a bash script for training in the cluster.
    • fix OOM issues
    • agree on handling of environments
  • Save checkpoints every (n?) epochs (right now, just saving the final model).
  • Save loss vs epoch, to monitor training.
  • Use validation set during training.
  • Link slurm job ID - MLflow run - final model (right now it's tricky to link them all)
  • Verify that we can overfit the training dataset. Right now the loss seems a bit high after ~100 epochs.
    • check first without data augm
@sfmig sfmig self-assigned this Apr 5, 2024
@samcunliffe samcunliffe mentioned this issue Mar 22, 2024
Closed
3 tasks
@sfmig sfmig closed this as completed Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant