Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements on "docker biocontainers" to bio.tools metadata sync #12

Open
hmenager opened this issue Jan 10, 2024 · 7 comments
Open

Improvements on "docker biocontainers" to bio.tools metadata sync #12

hmenager opened this issue Jan 10, 2024 · 7 comments

Comments

@hmenager
Copy link

(discussed with @mboudet today)
There are a few flaws that need to be adressed in the CI process (as implemented in https://github.com/BioContainers/ci/blob/master/github-ci/src/biocontainersci/biotools.py) that updates the metadata in the RSEc each time a new pull request is merged on the biocontainers containers repository:

Unique biocontainers filenames

We need to generate unique filenames for the biocontainers metadata files generated, e.g. instead of data/fastqc/biocontainers.yaml, https://github.com/research-software-ecosystem/content/blob/master/data/fastqc/fastqc.biocontainers.yaml. Here, the new filename pattern is data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml. This will avoid collisions in case multiple containers refer to the same software in bio.tools, in which case any new container wrapping a bio.tools already packaged in another container would end replacing the contents of the previous file.

Generate files locally

biocontainers metadata files should be generated, at least as an option, in a local copy of the git repository, instead of creating a pull request, for easier testing.

Batch files generation

It would be practical to enable generating/updating metadata files for all the containers available in the repository, instead of only one, crawling all Dockerfile files in a local checkout of BioContainers/containers, and generating/updating the *.biocontainers.yaml files of a local checkout of research-software-ecosystem/content.

review metadata mapping

Have a exhaustive metadata review, to check that all metadata (at least LABEL, FROM, MAINTAINER) are mapped to the yaml file.

hmenager added a commit to research-software-ecosystem/content that referenced this issue Jan 17, 2024
Current files in the form *.biocontainers.yaml come from a parrallel
repository, https://github.com/BioContainers/tools-metadata/, which
hasn't been updated for the last three years. Removing this.
biocontainers sync will be revisited (see
BioContainers/ci#12 and
BioContainers/ci#13).
@mboudet
Copy link
Contributor

mboudet commented Feb 9, 2024

Regarding the new 'filepath' (data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml):

The 'bio.tools ID' is an optional part of the submitted dockerfile (as some tools do not have a biotool ID). How should we manage this situation?

(Currently, we already use the biotools.id in the path if available, else we default to the software name in the path of the provided dockerfile I believe).

@mboudet
Copy link
Contributor

mboudet commented Feb 19, 2024

Also, regarding the 'biocontainer ID': what should we use? [Tool_name]_[version] ?

(IE: diann:1.8.1_cv2 ? Or should we remove the _cv2, to make sure we update the tool yaml, and not create a new one?)

The cv1 / cv2 is linked to the 'biocontainer dockerfile version', and not the tool version itself (ex here). Should we have separate files?

As an example, with the 'cadd-with-script' PR, using the cadd biotool id (cadd_phred), we would have:

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml

Each update to the Dockerfile (for the same version of cadd), would add another file.

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml

And if we had a PR with cadd itself (instead of cadd-scripts-xxx), it would be

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
cadd_phred/cadd_1.6.post1_cv1.yaml
cadd_phred/cadd_1.6_cv1.yaml

@hmenager
Copy link
Author

Regarding the new 'filepath' (data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml):

The 'bio.tools ID' is an optional part of the submitted dockerfile (as some tools do not have a biotool ID). How should we manage this situation?

(Currently, we already use the biotools.id in the path if available, else we default to the software name in the path of the provided dockerfile I believe).

So, the way it works in the import now (didn't use to) is that:
1-all containers are imported to imports/biocontainers/[biocontainers ID].biocontainers.yaml
2-if additionally they have a biotools ID, they are also imported to data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml

@hmenager
Copy link
Author

Also, regarding the 'biocontainer ID': what should we use? [Tool_name]_[version] ?

(IE: diann:1.8.1_cv2 ? Or should we remove the _cv2, to make sure we update the tool yaml, and not create a new one?)

The cv1 / cv2 is linked to the 'biocontainer dockerfile version', and not the tool version itself (ex here). Should we have separate files?

As an example, with the 'cadd-with-script' PR, using the cadd biotool id (cadd_phred), we would have:

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml

Each update to the Dockerfile (for the same version of cadd), would add another file.

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml

And if we had a PR with cadd itself (instead of cadd-scripts-xxx), it would be

cadd_phred/cadd-scripts-with-envs_1.6.post1_cv2.yaml
cadd_phred/cadd-scripts-with-envs_1.6.post1_cv1.yaml
cadd_phred/cadd-scripts-with-envs_1.6_cv1.yaml
cadd_phred/cadd_1.6.post1_cv1.yaml
cadd_phred/cadd_1.6_cv1.yaml

I would say that if cv2 replaces cv1 but keeps the same metadata and the same tool in the same version, we should use the same ID (e.g. cadd_phred/cadd_1.6.biocontainers.yaml).

@mboudet
Copy link
Contributor

mboudet commented Sep 30, 2024

@hmenager
To sum up:

In all case, we add a file in imports/biocontainers/[biocontainers ID].biocontainers.yaml
If there is a biotool ID, we add a file in data/[bio.tools ID]/[biocontainers ID].biocontainers.yaml

Regarding the biocontainers ID, there are three ways we can do it:

  1. Just the software name
  2. Software name + software version
  3. Software name + software version + Dockerfile version (usually cv1 or cv2)

The question is:

  • Do we need the history (either in the file itself, or as a separate version file)?
  • Or do we only need the last version?

Github already takes care of versioning, and all the differents version will be in https://github.com/BioContainers anyway.
If we do need the history, 1) and 2) are probably going to look weird if the software metadata change.

It might be good to have a look at what exactly we want in term of metadata content, and the formatting.

@hmenager
Copy link
Author

hmenager commented Oct 4, 2024

for the biocontainers ID, we need Just the software name, and we only need the last version!
for the metadata, anything which is available and maintained (e.g. not in https://github.com/BioContainers/tools-metadata) is relevant. If it contains information about the software, or how it can be accessed with BioContainers, then it's valuable.

@mboudet
Copy link
Contributor

mboudet commented Oct 16, 2024

Just as a reminder for myself, but if we only need the last version, we need a way to skip the biotool part of the CI for some PR, juste in case someone make a PR with a older version 🤔

(Since there are many way of versioning that are difficult to parse).
Maybe just setting a label skip-biotool-pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants