Data hosting (#100)

Co-authored-by: Sam Cunliffe <[email protected]>
SainsburyWellcomeCentre · Nov 10, 2023 · 4993bbc · 4993bbc
1 parent b329030
commit 4993bbc
Show file tree

Hide file tree

Showing 136 changed files with 960 additions and 1,024 deletions.
diff --git a/.github/workflows/test_and_deploy.yml b/.github/workflows/test_and_deploy.yml
@@ -36,6 +36,14 @@ jobs:
         - os: windows-latest
           python-version: "3.10"
     steps:
+      # Cache the test data to avoid re-downloading
+      - name: Cache Test Data
+        uses: actions/cache@v3
+        with:
+          path: ${{ github.workspace }}/.WAZP/*
+          key: cached-test-data
+          enableCrossOsArchive: true
+
       # A hack because chrome isn't in the PATH on Windows
       - name: Fix Chrome application path on Windows
         if: matrix.os == 'windows-latest'

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -124,6 +124,12 @@ For Windows, be sure to download the ``chromedriver_win32.zip`` file, extract th
 
 It's a good idea to test locally before pushing. Pytest will run all tests and also report test coverage.
 
+#### Test data
+For some tests, you will need to use real experimental data.
+We store some sample projects in an external data repository.
+See [sample projects](#sample-projects) for more information.
+
+
 ### Continuous integration
 All pushes and pull requests will be built by [GitHub actions](https://docs.github.com/en/actions). This will usually include linting, testing and deployment.
 
@@ -139,7 +145,7 @@ We use [semantic versioning](https://semver.org/), which includes `MAJOR`.`MINOR
 * MINOR = new feature
 * MAJOR = breaking change
 
-We use [`setuptools_scm`](https://github.com/pypa/setuptools_scm) to automatically version WAZP. It has been pre-configured in the `pyproject.toml` file. [`setuptools_scm` will automatically infer the version using git](https://github.com/pypa/setuptools_scm#default-versioning-scheme). To manually set a new semantic version, create a tag and make sure the tag is pushed to GitHub. Make sure you commit any changes you wish to be included in this version. E.g. to bump the version to `1.0.0`:
+We use [`setuptools_scm`](https://github.com/pypa/setuptools_scm) to automatically version WAZP. It has been pre-configured in the `pyproject.toml` file. `setuptools_scm` will automatically infer the version using git. To manually set a new semantic version, create a tag and make sure the tag is pushed to GitHub. Make sure you commit any changes you wish to be included in this version. E.g. to bump the version to `1.0.0`:
 
 ```sh
 git add .
@@ -175,8 +181,6 @@ If you create a new documentation source file (e.g. `my_new_file.md` or `my_new_
    my_new_file
 ```
 
-
-
 ### Building the documentation locally
 We recommend that you build and view the documentation website locally, before you push it.
 To do so, first install the requirements for building the documentation:
@@ -197,5 +201,78 @@ rm -rf docs/build
 sphinx-build docs/source docs/build
 ```
 
+## Sample projects
+
+We maintain some sample WAZP projects to be used for testing, examples and tutorials on an [external data repository](https://gin.g-node.org/SainsburyWellcomeCentre/WAZP).
+Our hosting platform of choice is called [GIN](https://gin.g-node.org/) and is maintained by the [German Neuroinformatics Node](https://www.g-node.org/).
+GIN has a GitHub-like interface and git-like [CLI](https://gin.g-node.org/G-Node/Info/wiki/GIN+CLI+Setup#quickstart) functionalities.
+
+### Project organisation
+
+The projects are stored in folders named after the species - e.g. `jewel-wasp` (*Ampulex compressa*).
+Each species folder may contain various WAZP sample projects as zipped archives. For example, the `jewel-wasp` folder contains the following projects:
+- `short-clips_raw.zip` - a project containing short ~10 second clips extracted from raw .avi files.
+- `short-clips_compressed.zip` - same as above, but compressed using the H.264 codec and saved as .mp4 files.
+- `entire-video_raw.zip` - a project containing the raw .avi file of an entire video, ~32 minutes long.
+- `entire-video_compressed.zip` - same as above, but compressed using the H.264 codec and saved as .mp4 file.
+
+Each WAZP sample project has the following structure:
+```
+{project-name}.zip
+    └── videos
+        ├── {video1-name}.{ext}
+        ├── {video1-name}.metadata.yaml
+        ├── {video2-name}.{ext}
+        ├── {video2-name}.metadata.yaml
+        └── ...
+    └── pose_estimation_results
+        ├── {video1-name}{model-name}.h5
+        ├── {video2-name}{model-name}.h5
+        └── ...
+    └── WAZP_config.yaml
+    └── metadata_fields.yaml
+```
+To learn more about how the sample projects were generated, see `scripts/generate_sample_projects` in the [WAZP GitHub repository](https://github.com/SainsburyWellcomeCentre/WAZP).
+
+### Fetching projects
+To fetch the data from GIN, we use the [pooch](https://www.fatiando.org/pooch/latest/index.html) Python package, which can download data from pre-specified URLs and store them locally for all subsequent uses. It also provides some nice utilities, like verification of sha256 hashes and decompression of archives.
+
+The relevant funcitonality is implemented in the `wazp.datasets.py` module. The most important parts of this module are:
+
+1. The `sample_projects` registry, which contains a list of the zipped projects and their known hashes.
+2. The `find_sample_projects()` function, which returns the names of available projects per species, in the form of a dictionary.
+3. The `get_sample_project()` function, which downloads a project (if not already cached locally), unzips it, and returns the path to the unzipped folder.
+
+Example usage:
+```python
+>>> from wazp.datasets import find_sample_projects, get_sample_project
+
+>>> projects_per_species = find_sample_projects()
+>>> print(projects_per_species)
+{'jewel-wasp': ['short-clips_raw', 'short-clips_compressed', 'entire-video_raw', 'entire-video_compressed']}
+
+>>> project_path = get_sample_project('jewel-wasp', 'short-clips_raw')
+>>> print(project_path)
+/home/user/.WAZP/sample_data/jewel-wasp/short-clips_raw
+```
+
+### Local storage
+By default, the projects are stored in the `~/.WAZP/sample_data` folder. This can be changed by setting the `LOCAL_DATA_DIR` variable in the `wazp.datasets.py` module.
+
+### Adding new projects
+Only core WAZP developers may add new projects to the external data repository.
+To add a new poject, you will need to:
+
+1. Create a [GIN](https://gin.g-node.org/) account
+2. Ask to be added as a collaborator on the [WAZP data repository](https://gin.g-node.org/SainsburyWellcomeCentre/WAZP) (if not already)
+3. Download the [GIN CLI](https://gin.g-node.org/G-Node/Info/wiki/GIN+CLI+Setup#quickstart) and set it up with your GIN credentials, by running `gin login` in a terminal.
+4. Clone the WAZP data repository to your local machine, by running `gin get SainsburyWellcomeCentre/WAZP` in a terminal.
+5. Add your new projects, followed by `gin commit -m <message> <filename>`. Make sure to follow the [project organisation](#project-organisation) as described above. Don't forget to modify the README file accordingly.
+6. Upload the committed changes to the GIN repository, by running `gin upload`. Latest changes to the repository can be pulled via `gin download`. `gin sync` will synchronise the latest changes bidirectionally.
+7. Determine the sha256 checksum hash of each new project archive, by running `sha256sum {project-name.zip}` in a terminal. Alternatively, you can use `pooch` to do this for you: `python -c "import pooch; pooch.file_hash('/path/to/file.zip')"`. If you wish to generate a text file containing the hashes of all the files in a given folder, you can use `python -c "import pooch; pooch.make_registry('/path/to/folder', 'hash_registry.txt')`.
+8. Update the `wazp.datasets.py` module on the [WAZP GitHub repository](https://github.com/SainsburyWellcomeCentre/WAZP) by adding the new projects to the `sample_projects` registry. Make sure to include the correct sha256 hash, as determined in the previous step. Follow all the usual [guidelines for contributing code](#contributing-code). Additionally, you may want to update the scripts in `scripts/generate_sample_projects`, depending on how you generated the new projects. Make sure to test whether the new projects can be fetched successfully (see [fetching projects](#fetching-projects) above) before submitting your pull request.
+
+You can also perform steps 3-6 via the GIN web interface, if you prefer to avoid using the CLI.
+
 ## Template
 This package layout and configuration (including pre-commit hooks and GitHub actions) have been copied from the [python-cookiecutter](https://github.com/SainsburyWellcomeCentre/python-cookiecutter) template.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -4,9 +4,8 @@ include *.md
 recursive-include wazp/*.py
 recursive-include wazp/pages *.py
 
-recursive-exclude sample_project *.avi
-recursive-exclude sample_project *.h5
 recursive-exclude docs *
+recursive-exclude scripts *
 recursive-exclude * __pycache__
 recursive-exclude * *.py[co]
 

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -3,5 +3,5 @@ myst-parser
 nbsphinx
 pydata-sphinx-theme
 setuptools-scm
-sphinx
+sphinx>=7.1
 sphinx-autodoc-typehints
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -83,12 +83,17 @@
     "**/includes/**",
 ]
 
+# Don't check the anchors for the following URLs during linkcheck
+linkcheck_anchors_ignore_for_url = [
+    "https://gin.g-node.org/G-Node/Info/wiki/",
+]
+
 # -- Options for HTML output -------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
 html_theme = "pydata_sphinx_theme"
 html_title = "wazp"
 
-# Cutomize the theme
+# Customize the theme
 html_theme_options = {
     "icon_links": [
         {

diff --git a/pyproject.toml b/pyproject.toml
@@ -25,7 +25,9 @@ dependencies = [
   "PyYAML",
   "shapely",
   "openpyxl",
-  "defusedxml"
+  "defusedxml",
+  "pooch",
+  "tqdm",
 ]
 
 classifiers = [