Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification required: tesOutput.path_prefix #194

Open
vsmalladi opened this issue Aug 16, 2023 · 5 comments
Open

Clarification required: tesOutput.path_prefix #194

vsmalladi opened this issue Aug 16, 2023 · 5 comments
Milestone

Comments

@vsmalladi
Copy link
Contributor

In the spec it appears that tesOutput.path_prefix is ignored if tesOutput.path doesn't contain wildcards.

However, when there are no wildcards provided we are still pruning the paths of the parent to just include the child.

tesOutput.path: /path/on/container/output/
tesOutput.path_prefix: # ignored
tesOutput.url: https://my.storage.org/bucket/output/

prune:  /path/on/container/

This selective use of path_prefix seems confusing and what we be more consistnt to implement is always use the path_prefix such that you could do this:

tesOutput.path: /path/on/container/output/
tesOutput.path_prefix: /path/on
tesOutput.url: https://my.storage.org/bucket/container/output/

Thoughts @uniqueg, @kellrott , @wleepang ?

@vsmalladi vsmalladi added this to the 1.2 milestone Aug 16, 2023
@MattMcL4475
Copy link
Contributor

MattMcL4475 commented Aug 16, 2023

A few more clarifications to discuss:

  1. If tesOutput.type is not specified, do we assume it defaults to FILE ?

  2. File specified with trailing slash:

tesOutput.tesFileType = "FILE"
tesOutput.path: /path/on/container/output/

Ideally: TES should return HTTP status code 400 "Bad Request"
Alternatively: TES task state should be set to "EXECUTOR_ERROR"
Reasoning: a file can never end in a trailing slash

  1. tesFileType of "FILE" is misleading when it's a pattern:
tesOutput.tesFileType = "FILE"
tesOutput.path: /path/on/container/output/*.bam

It seems we should add a tesFileType of PATTERN for this use case. FILE doesn't make sense, because it's really a pattern, and it results in an array of files.

  1. "DRY" - two ways of doing the same thing, correct? Maybe this is fine but would like to confirm they are equivalent.
    Are these equivalent?
tesOutput.tesFileType = "FILE"
tesOutput.path: /path/on/container/output/*
tesOutput.tesFileType = "DIRECTORY"
tesOutput.path: /path/on/container/output/
  1. Does TES consider these as the same or different? For example, in rsync, the first example would include "output/" in the file names, while the second example would not.
tesOutput.tesFileType = "DIRECTORY"
tesOutput.path: /path/on/container/output
tesOutput.tesFileType = "DIRECTORY"
tesOutput.path: /path/on/container/output/
  1. Why does TES need path_prefix at all? As a tool developer, imagine my tool outputs all files to:

/home/matt/myprogram/outputs/

Logically, I just want to upload all children of that path, to the URL destination of my choice. As a tool developer, this is the most canonical thing I would want to do:

tesOutput.tesFileType = "DIRECTORY"
tesOutput.path: /path/on/container/output/
tesOutput.url: https://my.storage.org/bucket/3f257b56-84c6-4fcd-b223-9a7367c42a98/

After running the tool, here are the local files:

/home/matt/myprogram/outputs/matt.bam
/home/matt/myprogram/outputs/other/info.csv

And what I would expect in cloud storage:

https://my.storage.org/bucket/3f257b56-84c6-4fcd-b223-9a7367c42a98/matt.bam
https://my.storage.org/bucket/3f257b56-84c6-4fcd-b223-9a7367c42a98/other/info.csv
  1. No trailing slash in URL for type directory - are these equivalent, or should the first result in 400/Executor_error?
tesOutput.tesFileType = "DIRECTORY"
tesOutput.path: /path/on/container/output/
tesOutput.url: https://my.storage.org/bucket/3f257b56-84c6-4fcd-b223-9a7367c42a98
tesOutput.tesFileType = "DIRECTORY"
tesOutput.path: /path/on/container/output/
tesOutput.url: https://my.storage.org/bucket/3f257b56-84c6-4fcd-b223-9a7367c42a98/

@uniqueg
Copy link
Contributor

uniqueg commented Aug 17, 2023

All good questions. I think some of the answers to the questions (or at least the reasoning why things were changed/introduced as they were) may be available in the following PRs and corresponding issues:

I would also like to add @bentsherman & @pditommaso to this thread, as handling of wildcards as well as making the file type optional were changes introduced mostly with Nextflow in mind.

@bentsherman
Copy link

From Nextflow's perspective, we just need to be able to specify an output glob and send it to the task directory, which the 1.1 spec appears to provide.

path_prefix allows us to send the glob from the task container to the task directory. The name is a bit confusing because path is a container path while path_prefix is a remote path, maybe something like publish_prefix or remote_prefix would have been better, but it gets the job done.

Making the output type optional just makes sense because the implementation should be able to figure it out at runtime.

We are testing the Nextflow changes in this PR: nextflow-io/nextflow#4195 if anyone would like to follow along, particularly the changes to TesTaskHandler.

@MattMcL4475
Copy link
Contributor

@bentsherman we have a proposal to remove path_prefix , can you take a look and see if you agree? #195

@bentsherman
Copy link

I assume you're proposing to use the url instead of path_prefix to determine the final output path? If so, that's fine by me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants