-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data hosting #100
Data hosting #100
Conversation
Codecov Report
@@ Coverage Diff @@
## main #100 +/- ##
==========================================
+ Coverage 39.94% 42.64% +2.70%
==========================================
Files 12 13 +1
Lines 691 734 +43
==========================================
+ Hits 276 313 +37
- Misses 415 421 +6
📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍉 OK, this all looks really good.
No comments about the structure or code quality or robustness. (It was left hanging for months and still works ✨.) I've checked and it all works perfectly for me in my existing months-old wazp-env
conda environment.
I have one noncritical, somewhat major, comment:
The tools in scripts/generate_sample_projects
are all untested. Is it worth or feasible to do a dry run of them as part of a testing job? Either tack to the end of the current or a new workflow.
Suuuper rough sketch:
name: test-tools
on:
push:
branches: main
pull_request:
jobs:
run_sample_project_gen:
runs-on: ubuntu-latest
steps:
- name: Checkout source
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
- name: Install self
run: python -m pip install .
- name: Setup local data
run: |
mkdir ./Code/Data/WAZP
gin clone https://gin.g-node.org/SainsburyWellcomeCentre/WAZP
- name: Run sample generation
run: python scripts/generate_sample_projects/main.py
- name: Check generation was OK
run: test -f /path/to/output/file.zip
Now that won't improve our test coverage as reported by pytest
but at least we will know that we don't break these tools.
... I also don't insist on this (hence approving). If you think it's worth shooting for right now, that's cool. If it's worth farming this into an issue for future Sam and Niko, that's also cool. If it's too much of a pain then we can chat in the next gemba.
Co-authored-by: Sam Cunliffe <[email protected]>
I thought about it, and this would actually, be very hard to do, as things stand. The main reason is that these samples are sourced from our non-public internal server storage. So they cannot "really" be tested without access to that source (which the GitHub runners can't have). Downloading the test data on GIN only gives you the output of this pipeline, the input is inaccessible. Of course we still have some options, we could:
Since all the above require considerable work, and this is not a priority right now, I opened an issue for future reference. |
Apart from the above, I took care of all smaller comments, so I'm going ahead with the merge 🤞🏼 Thanks a ton @samcunliffe! |
Co-authored-by: Sam Cunliffe <[email protected]>
Closes #33
Sample projects
I added some sample WAZP projects to be used for testing, examples and tutorials on an external data repository.
Our hosting platform of choice is called GIN and is maintained by the German Neuroinformatics Node.
GIN has a GitHub-like interface and git-like CLI functionalities.
Project organisation
The projects are stored in folders named after the species - e.g.
jewel-wasp
(Ampulex compressa).Each species folder may contain various WAZP sample projects as zipped archives. For example, the
jewel-wasp
folder contains the following projects:short-clips_raw.zip
- a project containing short ~10 second clips extracted from raw .avi files.short-clips_compressed.zip
- same as above, but compressed using the H.264 codec and saved as .mp4 files.entire-video_raw.zip
- a project containing the raw .avi file of an entire video, ~32 minutes long.entire-video_compressed.zip
- same as above, but compressed using the H.264 codec and saved as .mp4 file.Each WAZP sample project has the following structure:
Fetching projects
To fetch the data from GIN, we use the pooch Python package, which can download data from pre-specified URLs and store them locally for all subsequent uses. It also provides some nice utilities, like verification of sha256 hashes and decompression of archives.
The relevant funcitonality is implemented in the
wazp.datasets.py
module. The most important parts of this module are:sample_projects
registry, which contains a list of the zipped projects and their known hashes.find_sample_projects()
function, which returns the names of available projects per species, in the form of a dictionary.get_sample_project()
function, which downloads a project (if not already cached locally), unzips it, and returns the path to the unzipped folder.Example usage:
Local storage
By default, the projects are stored in the
~/.WAZP/sample_data
folder. This can be changed by setting theLOCAL_DATA_DIR
variable in thewazp.datasets.py
module.Adding new projects
Only core WAZP developers may add new projects to the external data repository.
To add a new poject, you will need to:
gin login
in a terminal.gin get SainsburyWellcomeCentre/WAZP
in a terminal.git add
, andgit commit
, just like you would with a GitHub repository. Make sure to follow the project ornanisation as described above. Don't forget to modify the README file accordingly.gin upload
. Latest changes to the repository can be pulled viagin download
.gin sync
will synchronise the latest changes bidirectionally.sha256sum {project-name.zip}
in a terminal. Alternatively, you can usepooch
to do this for you:python -c "import pooch; pooch.file_hash('/path/to/file.zip')"
. If you wish to generate a text file containing the hashes of all the files in a given folder, you can usepython -c "import pooch; pooch.make_registry('/path/to/folder', 'hash_registry.txt')
.wazp.datasets.py
module on the WAZP GitHub repository by adding the new projects to thesample_projects
registry. Make sure to include the correct sha256 hash, as determined in the previous step. Follow all the usual guidelines for contributing code. Additionally, you may want to update the scripts inscripts/generate_sample_projects
, depending on how you generated the new projects. Make sure to test whether the new projects can be fetched successfully (see fetching projects above) before submitting your pull request.You can also perform steps 3-6 via the GIN web interface, if you prefer to avoid using the CLI.
Using sample projects in tests
I think the best way to do that is through pytest fixtures.
For example, I've added one in
tests/test_unit/conftest.py
:This gets the smallest sample project and returns its local path, to be used in tests.