Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add fastq_util tool fastq_pre_barcodes to qc dir #252

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

irisdianauy
Copy link
Contributor

@irisdianauy irisdianauy commented Jan 27, 2022

Galaxy wrappers for qc tools in fastq_utils

Copy link
Member

@pcm32 pcm32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, very good for a first trial! I would reformat as adviced the command part. I suspect that the linting is failing due to the WARNINGS of not having help and tests.

tools/qc/fastq_utils/fastq_pre_barcodes.xml Outdated Show resolved Hide resolved
tools/qc/fastq_utils/fastq_pre_barcodes.xml Outdated Show resolved Hide resolved
@pcm32
Copy link
Member

pcm32 commented Jan 27, 2022

I will fix the testing setup, then you can see the linting here. Also, try to find a small test dataset that can be used for this. Maybe the original tool as some.

@@ -0,0 +1,145 @@
<tool id="fastq_pre_barcodes" name="FASTQ barcodes preprocessor" profile="10" version="conda-package-version+galaxy0">
<description>Preprocesses the reads to move the barcodes (UMI, Cell, ...) to the respective readname, optionally discarding reads with bases in the barcode regions below a given threshold.</description>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the requirements as well (the bioconda package that this will use to run)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 51b56db but I'm not sure if I referenced the correct version for samtools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that that will be hard to know, @pinin4fjords might be able to point you to where IRAP is installed on Noah to check the version used. We could in principle try a few runs with this (I suspect most up to date) version and if results are equivalent maybe we keep the newest version. Although maybe for a start, might be better to go if possible with the currently used version in noah.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irap has samtools samtools 1.9, fastq_utils 0.16.3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 3896d3d, but the test log says it's still using fastq_utils 0.25.1. Any idea how I might force it to use the correct fastq_utils version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing what you mean in the logs right now, but it may be because that version isn't available in Conda- see https://anaconda.org/bioconda/fastq_utils/files. You could try picking the oldest version available for now, but since we can't easily match versions maybe we should bite the bullet and use the latest. Okay with you @pcm32 ?

Copy link
Contributor Author

@irisdianauy irisdianauy Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The html output of the local planemo test that I ran says fastq_utils 0.25.1 in its report. Not sure how to view the html here, but maybe they're using the same version.

According to the fastq_utils repo, these are the dependencies:
samtools (version 0.1.19) and zlib (http://zlib.net/) version 1.2.11 or latest are required to compile fastq_utils. ... The bam_annotate.sh script requires samtools (version 1.5 or higher).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to latest version in c40fc6a

@pcm32
Copy link
Member

pcm32 commented Jan 27, 2022

I added the shed file to see if this explains why the tests are being skipped.

@irisdianauy
Copy link
Contributor Author

XML as of 51b56db still fails during linting, and the only ERROR that remains is an invalid profile version, while the only WARNINGS that remain are the tool version (not PEP compliant) and the lack of citation (for now).

Copy link
Member

@pcm32 pcm32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, a lot of improvement. A few more comments to help further improve this.

@@ -0,0 +1,21 @@
name: fastq_utils
owner: ebi-gxa
description: "scanpy-scripts, command-line wrapper scripts around Scanpy."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to a suitable description and long description of the whole package (fastq_utils).

description_template: "Wrapper for the fastq tool suite: {{ tool_name }}"
suite:
name: "fastq_utils"
description: "Please add one"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here please.

@@ -0,0 +1,145 @@
<tool id="fastq_pre_barcodes" name="FASTQ barcodes preprocessor" profile="10" version="conda-package-version+galaxy0">
<description>Preprocesses the reads to move the barcodes (UMI, Cell, ...) to the respective readname, optionally discarding reads with bases in the barcode regions below a given threshold.</description>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that that will be hard to know, @pinin4fjords might be able to point you to where IRAP is installed on Noah to check the version used. We could in principle try a few runs with this (I suspect most up to date) version and if results are equivalent maybe we keep the newest version. Although maybe for a start, might be better to go if possible with the currently used version in noah.

@@ -0,0 +1,145 @@
<tool id="fastq_pre_barcodes" name="FASTQ barcodes preprocessor" profile="10" version="conda-package-version+galaxy0">
<description>Preprocesses the reads to move the barcodes (UMI, Cell, ...) to the respective readname, optionally discarding reads with bases in the barcode regions below a given threshold.</description>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#end if

#if $Cell_read:
--$Cell_read '$Cell_read'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an extra $ here. Also, are the flags in the tool capitalised? that would be unusual.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected in 3896d3d.

Comment on lines 115 to 119
<param name="read1" label="Read1" argument="--read1" type="data" format='?' help="fastq (optional gzipped) file name"/>
<param name="read2" label="Read2" argument="--read2" type="data" format='?' help="fastq (optional gzipped) file name"/>
<param name="index1" label="Index1" argument="--index1" type="data" format='?' help="fastq (optional gzipped) file name"/>
<param name="index2" label="Index2" argument="--index2" type="data" format='?' help="fastq (optional gzipped) file name"/>
<param name="index3" label="Index3" argument="--index3" type="data" format='?' help="fastq (optional gzipped) file name"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need format set here. I suspect fastqsanger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3896d3d. I initially set format to fastq,fastqsanger because the param is supposed to accept a fastq file or a gzipped version, but this generates an error. Finally set it to just fastqsanger, but needs an edit to include fastq.

Comment on lines 143 to 145
<data label="${tool.name} on ${on_string}: Output file 1" name="outfile1" format='?' />
<data label="${tool.name} on ${on_string}: Output file 2" name="outfile2" format='?' />
<data label="${tool.name} on ${on_string}: Output file 3" name="outfile3" format='?' />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also format needs to be set here, please see galaxy datatypes in the Galaxy docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3896d3d.

<param name="read1_offset" value="0"/>
<param name="read1_size" value="-1"/>
<param name="read1" value="test-data/barcode_test_2.fastq.gz"/>
<output name="outfile1" file="test.fastq.gz"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I suspect that you might need some assertion logic here. See galaxy tools docs and other tools' tests in this repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might explain why tests are skipped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of comparing to a file (and have to upload/download the result file), I suggest that you assert the success through an estimate of the file size. Since you know the correct file, you can check that file size and add some delta. See my Galaxy tests docs or look at example tests on my Seurat 4 branch: https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary/tree/feature/seurat_4/tools/tertiary-analysis/seurat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is failing because diff won't compare files that are binary (gz in this case).

@pcm32
Copy link
Member

pcm32 commented Jan 28, 2022

This is why it skipped tests, it is missing the test-data https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary/runs/4979342741?check_suite_focus=true#step:9:182

Please try adding a file like this one: https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary/blob/feature/seurat_4/tools/tertiary-analysis/seurat/get_test_data.sh with logic to place all the tests datasets inside a test-data folder. If this file is present and executable, the CI will retrieve the data before the tests.

Copy link
Member

@pcm32 pcm32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress, try to avoid pushing the test data files to the repo, that is the purpose of having get_test_data.sh

@pcm32
Copy link
Member

pcm32 commented Feb 2, 2022

Here you might need to request a container for the combination of conda packages in use at https://biocontainers.pro/multipackage

@pcm32
Copy link
Member

pcm32 commented Feb 2, 2022

Most likely the CI is failing due to a lack of a container for that set of conda packages together.

@irisdianauy
Copy link
Contributor Author

irisdianauy commented Feb 3, 2022

In my local planemo test, at least, all tests pass using the 0.25.1 version. I checked the planemo test here, 2 of the 3 tests failed. I'll check what's different.

As for requesting a container, the required fastq_utils version and zlib version are not available in BioContainers. Would this mean creating a container from scratch?

@pinin4fjords
Copy link
Member

pinin4fjords commented Feb 3, 2022

In my local planemo test, at least, all tests pass using the 0.25.1 version. I checked the planemo test here, 2 of the 3 tests failed. I'll check what's different.

As for requesting a container, the required fastq_utils version and zlib version are not available in BioContainers. Would this mean creating a container from scratch?

Since we can't match the version of fastq_utils currently used in our pipeline, I think we should just use the most recent. We can make any adjustments required to deal with new arguments etc.

Plus, I think you should be able to use a single dependency of fastq_utils. That package will automatically pull in samtools and zlib via its own dependencies- see https://github.com/bioconda/bioconda-recipes/blob/master/recipes/fastq_utils/meta.yaml.

@pcm32
Copy link
Member

pcm32 commented Feb 3, 2022

If you use a single package and that package is in biocontainers, then the container will exist before hand.

<inputs>
<param name="verbose" label="Verbose" optional="true" value='false' argument="--verbose" type="boolean" truevalue='--verbose' falsevalue='' checked="true" help="Increase level of messages printed to stderr"/>
<param name="brief" label="Brief" optional="true" value="true" argument="--brief" type="boolean" truevalue='--brief' falsevalue='' checked="true" help="Decrease level of messages printed to stderr"/>
<param name="read1" label="Read1" argument="--read1" type="data" format="fastqsanger" optional="false" help="fastq (optional gzipped) file name"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the tool actually accepts fastq.gz, I would try adding fastqsanger.gz here instead of fastqsanger (sorry, I know it was my original suggestion). What happens here is that if the tool can accept .gz, then galaxy is decompressing this unnecesarily to pass it as fastqsanger instead of fastqsanger.gz. On The inputs I think that you can accept more than one format (so you could use both, comma separated within the field).

Copy link
Contributor Author

@irisdianauy irisdianauy Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I put "fastq,fastqsanger", incorrectly interpreting that one of them stood for the .gz version, and this gave me an error. Would you know where format values are documented? This page about data types does not list the actual format values that should be used in the xml.

Copy link
Contributor Author

@irisdianauy irisdianauy Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 36546ac.

<param name="x" label="X" argument="-X" type="text" optional="true" help="No documentation"/>
</inputs>
<outputs>
<data label="${tool.name} on ${on_string}: Output file 1" name="outfile1" format="fastqsanger" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tools is directly sending out fastq.gz and the next step uses fastq.gz, then please set the output formats as well to fastqsanger.gz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can later add some conditionality on this, but not needed for now I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 36546ac

@irisdianauy
Copy link
Contributor Author

irisdianauy commented Feb 4, 2022

In the most recent CI checks, the first test that failed had this message:
Test output file (test.fastq.gz) is missing. If you are using planemo, try adding --update_test_data to generate it.
I use this flag when I do the test locally; maybe that's why locally, it succeeds and the one in CI fails (can't compare the files)?

If I'm not able to add the --update_test_data flag in the CI planemo test, perhaps I should add the output file where planemo can find it so it can compare (?), but not sure exactly where. I'm not sure also if this is the solution.

I'll still try comparing output by size first to see if it works.

@irisdianauy
Copy link
Contributor Author

irisdianauy commented Feb 4, 2022

The other test failed becuase it cannot see one of the input files:

raise AssertionError(f"Test input file ({filename}) cannot be found.") AssertionError: Test input file (barcode_test2_1.fastq.gz) cannot be found.

I removed all files in test-data , but I made sure that the get_test_data.sh works. Also, the other tests succeeded and did not have to look for the inputs, which I assume means their inputs were successfully donwloaded.

In my branch, the directory test-data has disappeared, but I made sure that I did not remove it locally when I deleted all its contents. Is github not displaying empty directories? Could this be why the test can't find the input? But this can't explain the fact that the other tests aren't complaining about their input files.

@irisdianauy
Copy link
Contributor Author

In my local planemo test, at least, all tests pass using the 0.25.1 version. I checked the planemo test here, 2 of the 3 tests failed. I'll check what's different.
As for requesting a container, the required fastq_utils version and zlib version are not available in BioContainers. Would this mean creating a container from scratch?

Since we can't match the version of fastq_utils currently used in our pipeline, I think we should just use the most recent. We can make any adjustments required to deal with new arguments etc.

Plus, I think you should be able to use a single dependency of fastq_utils. That package will automatically pull in samtools and zlib via its own dependencies- see https://github.com/bioconda/bioconda-recipes/blob/master/recipes/fastq_utils/meta.yaml.

Done in c40fc6a

Copy link
Member

@pcm32 pcm32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements.

Comment on lines +147 to +149
<data label="${tool.name} on ${on_string}: Output file 1" name="outfile1" format="fastqsanger,fastqsanger.gz" />
<data label="${tool.name} on ${on_string}: Output file 2" name="outfile2" format="fastqsanger,fastqsanger.gz" />
<data label="${tool.name} on ${on_string}: Output file 3" name="outfile3" format="fastqsanger,fastqsanger.gz" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, outputs can only have one data type, please choose one based on what is more relevant for our process. In time we can add conditionality for the user to choose, but not pressing now.

<param name="read1_offset" value="0"/>
<param name="read1_size" value="-1"/>
<param name="read1" value="barcode_test_2.fastq.gz"/>
<output name="outfile1" file="test.fastq.gz" md5="e4ab7532f32c14ff5bce10c021422f3f"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely checksum won't be the same after software version changes compared to the files in Nuno's repo. I would go with file size comparison, with a 5 or 10 % delta. Also, when using md5 or size comparison, you don't need to add file=...

Suggested change
<output name="outfile1" file="test.fastq.gz" md5="e4ab7532f32c14ff5bce10c021422f3f"/>
<output name="outfile1">
<assert_contents>
<has_size value="176978" delta="17600"/>
</assert_contents>
</output>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you run it and it fails, you can get a better estimate for the size.

<param name="read1" value="barcode_test2_2.fastq.gz"/>
<param name="read2" value="barcode_test2_2.fastq.gz"/>
<param name="sam" value="--sam"/>
<output name="outfile1" file="test_1.fastq.gz" md5="d41d8cd98f00b204e9800998ecf8427e"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I would use size comparison here, and drop the file field part, for both outputs.

@pcm32 pcm32 marked this pull request as ready for review February 7, 2022 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants