Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow deferred datasets to behave as URIs #19077

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

davelopez
Copy link
Contributor

@davelopez davelopez commented Oct 29, 2024

Closes #19065

Some tools might accept URIs as input. In those cases there is no need to materialize deferred datasets in Galaxy, we can just pass the source URI directly to the tool.

To do that, we can now use allow_uri_if_protocol when declaring the input param in the tool.

<param name="input" type="data" allow_uri_if_protocol="http,https,s3" />

With the above example, if you provide a deferred dataset as input which source URI starts with one of the listed protocols (http, https or s3), when you reference your $input it will contain the source URI, otherwise, it will be the path to the materialized file.

You can also use allow_uri_if_protocol="*" to treat all deferred datasets as URIs regardless of the protocol.

How to test the changes?

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    • Upload a deferred dataset
    • Run a tool that uses allow_uri_if_protocol in the input param definition with an appropriate value to filter the protocol or *.
    • When you reference the input param in your command line it will contain the source URI of the deferred dataset.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@davelopez davelopez changed the title Allow deferred datasets to behave as URI Allow deferred datasets to behave as URIs Oct 29, 2024
@jmchilton
Copy link
Member

I recommended is_prefix because maybe like say the SRA tools could do something more optimal if the URI is an SRA URL, but this does not conflict with that, makes more sense conceptually for our current use cases, etc.. so I think I like your implementation better than my suggestion. Really nice work - this seems perfect to me.

@davelopez davelopez marked this pull request as ready for review October 30, 2024 15:28
@github-actions github-actions bot added this to the 24.2 milestone Oct 30, 2024
@davelopez davelopez marked this pull request as draft October 30, 2024 15:55
@davelopez davelopez marked this pull request as ready for review October 30, 2024 16:05
@davelopez
Copy link
Contributor Author

Should I use the new tool execution testing framework for a94a80d?

@jmchilton
Copy link
Member

I think the test is fine as is.

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @davelopez!

lib/galaxy/tool_util/xsd/galaxy.xsd Show resolved Hide resolved
deferred_input = self.tool.inputs.get(input_name)
if not deferred_input:
# TODO: Can this ever happen? It seems like it happens in the test suite.
# For example, in test_metadata_validator_on_deferred_input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BAM file in this test is a special case. BAM files do have a second file associated to it as metadata, a bai file.
The bai file is probably created during materialization. Can this trigger it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputs in the pileup tool are:

<inputs>
    <param name="input1" type="data" format="unsorted.bam" multiple="true" min="1" label="BAM Inputs">
        <validator check="bam_index" message="Metadata missing, click the pencil icon in the history item and use the auto-detect feature to correct this issue." type="metadata" />
    </param>
    <param name="reference" type="data" format="fasta" label="Fasta Input"/>
</inputs>

But during evaluation, we got 2 inputs (apart from $reference) as deferred $input1 and $input11. Maybe that $input11 was the index file as you mentioned? I would have expected that to be under $input1.metadata.bam_index and not in its own input entry. Also at this point, nothing has been materialized yet, but maybe it was "staged" to be materialized... 🤔

Anyway, if this is expected I will just remove the TODO comment, I wanted to make sure this was a known case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea what $input11 is. @jmchilton can you please have a look? Is it ok to merge this PR or is this something we should look at?

@davelopez
Copy link
Contributor Author

The integration test failure is unrelated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow tools accept URIs for deferred datasets.
4 participants