Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ncbi egapx #6456

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

richard-burhans
Copy link
Contributor

Adding NCBI Eukaryotic Genome Annotation Pipeline - External (EGAPx)

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

@richard-burhans
Copy link
Contributor Author

I start with NCBI's docker image from dockerhub and create a new one which includes an installation of Nexflow. The augmented container is uploaded to quay.io. The strategy is to run this augmented docker container on a single large node.

I'm working on a scheduled github action that will check for new versions of EGAPx on dockerhub, build new augmented containers, and upload them to quay.io. If someone already has something like this, I can just reuse it. If not, when it's ready I can submit the github action to the IUC as well.

tools/ncbi_egapx/macros.xml Outdated Show resolved Hide resolved
tools/ncbi_egapx/macros.xml Outdated Show resolved Hide resolved
tools/ncbi_egapx/ncbi_egapx.xml Outdated Show resolved Hide resolved
#else:
#set yamlconfig = $yamlin
#end if
source /galaxy/env.bash &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add please a comment what this one is doing?

</param>
</when>
<when value="history">
<param name="yamlin" type="data" format="yaml,txt" label="egapx configuration yaml file to pass to Nextflow"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<param name="yamlin" type="data" format="yaml,txt" label="egapx configuration yaml file to pass to Nextflow"/>
<param name="yamlin" type="data" format="yaml" label="egapx configuration yaml file to pass to Nextflow"/>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should add here more red warnings ... if people specify here too many cores, most schedulers will just kill the job ... so people need to do what they are doing here.

tools/ncbi_egapx/ncbi_egapx.xml Outdated Show resolved Hide resolved
tools/ncbi_egapx/ncbi_egapx.xml Outdated Show resolved Hide resolved
help="Select a built in, history or remote URI for the reference genome fasta">
<option value="history" selected="True">Use a genome fasta file from the current history</option>
<option value="indexed">Use a Galaxy server built-in genome</option>
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference fasta file</option>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference fasta file</option>
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference FASTA file</option>

add a validator?
maybe a proper URI validator in Galaxy would be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm assuming there's already python code somewhere that validates strings based on RFC 3986. Maybe there should also be a "check" or similar option to the URL validator that if the input string passes the RFC 3986 check, it will do an HTTP HEAD to see if the resource actually exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URI validator should also include RFC 3987, Internationalized Resource Identifiers (IRIs).

</collection>
</outputs>
<tests>
<test expect_test_failure="true">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No successful test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests both succeed. I set expect_test_failure="true" because I wasn't sure if the tests would succeed when being run by the repo's CI. I can give it a try on my personal repo and see how it goes.


.. class:: warningmark

**Proof of concept: a hack to run a NF workflow inside a specialised Galaxy tool wrapper**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really surprising that this works? We can execute anything that runs on a command line, or?


It is also very new and in rapid development. Investing developer effort and keeping updated as EGAPx changes rapidly may be *inefficient of developer resources*.

This wrapper is designed to allow measuring how *inefficient* it is in terms of computing resource utilisation, in comparison to the developer effort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper is designed to allow measuring

How will this be measured?

inefficient it is in terms of computing resource utilisation

This was my 1st thought, having one resource allocation for the whole pipeline will not work on many systems. On our system jobs that are not utilizing the allocated CPUs/memory efficiently are automatically killed.

We already have quite a few tools wrapping (non-NF) pipelines which all have this problem. On my system I can only assign 1 CPU to them, because they are killed otherwise.


**Proof of concept: a hack to run a NF workflow inside a specialised Galaxy tool wrapper**

EGAPx is a big, complicated Nextflow workflow, challenging and costly to re-implement **properly**, requiring dozens of new tools and replicating a lot of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My second thought was that, while it is certainly challenging and a lot of work to integrate the tools it would pay of for the Galaxy project as a whole.

EGAPx is a big, complicated Nextflow workflow,

Is it systematically tested (not as far as I see, at least not on github)

<param name="genome_type_select" value="uri"/>
<param name="uri" value="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/809/275/GCF_020809275.1_ASM2080927v1/GCF_020809275.1_ASM2080927v1_genomic.fna.gz"/>
<param name="rna_type_select" value="list"/>
<param name="rnaseq" value="https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR8506572.1 https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR8506572.2 https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR9005248.1 https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR9005248.2"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be location instead of value and the URLs comma separated?

</when>
</conditional>
<param name="proteins" type="data" format="fasta,tasta.gz" optional="true" label="Select a protein set"/>
<param name="xtra" type="text" area="true" label="Additional yaml to append to the egapx.yaml configuration"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need a validator for free text entry (and disabling the sanitizer as definitely a bad idea). Maybe better allow yaml upload?

help="Select a built in, history or remote URI for the reference genome fasta">
<option value="history" selected="True">Use a genome fasta file from the current history</option>
<option value="indexed">Use a Galaxy server built-in genome</option>
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference fasta file</option>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should implement the URI download in a tool. Users can upload URIs and defer the upload until job execution.

</param>
</when>
<when value="history">
<param name="rnaseq" type="data" format="fastqsanger,fastqsanger.gz" multiple="true" label="Select multiple RNA-seq fastqsanger inputs from the current history"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paired end is irrelevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants