-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix validation and test split not being reproducible #218
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #218 +/- ##
==========================================
+ Coverage 46.68% 47.75% +1.07%
==========================================
Files 24 24
Lines 1476 1493 +17
==========================================
+ Hits 689 713 +24
+ Misses 787 780 -7 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done for spotting this. I leave a comment, as I am not sure myself if we need two generators.
Other option, can we just use seed_everything?
Cool, I didn't know about this! I think for now I'd prefer to constraint the seeding to the dataset creation, because that is the part I need to be reproducible. But good to have this in the radar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I think having separate generators is a good move. Nice test as well! Just some small comments below.
65d1084
to
fb1ad06
Compare
thanks for the help Nik! 🌟 |
Why is this PR needed?
Currently when we create the test and validation splits we don't pass a generator. We do pass one when we create the training split.
This means that given a seed, the splitting of the dataset into
train
andtest-val
sets is reproducible, but the subsequent splitting of thetest-val
set into atest
set and aval
set is not.What does this PR do?
This PR:
Smaller bits
Notes
I decided to pass a different generator for each call to
random_split
to try to make it a bit "future-proof". That way we guarantee the splits are repeatable even if some randomisation code is added in between the two calls torandom_split
.