Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add fastq_util tool fastq_pre_barcodes to qc dir #252

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions tools/qc/fastq_utils/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: fastq_utils
owner: ebi-gxa
description: "Set of tools for handling fastq files"
long_description: "fastq_utils is a set of Linux utilities to validate and manipulate fastq files.
It also includes a set of programs to preprocess barcodes (namely UMIs,
cells and samples), add the barcodes as tags in BAM files and count UMIs."
homepage_url: https://github.com/nunofonseca/fastq_utils
remote_repository_url: https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary/tree/develop/tools/qc/fastq_utils
type: unrestricted
categories:
- Transcriptomics
- RNA
auto_tool_repositories:
name_template: "{{ tool_id }}"
description_template: "Set of tools for handling fastq files: {{ tool_name }}"
suite:
name: "suite_fastq_utils"
description: "Set of tools for handling fastq files"
long_description: "fastq_utils is a set of Linux utilities to validate and manipulate fastq files.
It also includes a set of programs to preprocess barcodes (namely UMIs,
cells and samples), add the barcodes as tags in BAM files and count UMIs."
227 changes: 227 additions & 0 deletions tools/qc/fastq_utils/fastq_pre_barcodes.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
<tool id="fastq_pre_barcodes" name="FASTQ barcodes preprocessor" profile="18.01" version="0.16.3+galaxy0">
<description>Preprocesses the reads to move the barcodes (UMI, Cell, ...) to the respective readname, optionally discarding reads with bases in the barcode regions below a given threshold.</description>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the requirements as well (the bioconda package that this will use to run)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 51b56db but I'm not sure if I referenced the correct version for samtools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that that will be hard to know, @pinin4fjords might be able to point you to where IRAP is installed on Noah to check the version used. We could in principle try a few runs with this (I suspect most up to date) version and if results are equivalent maybe we keep the newest version. Although maybe for a start, might be better to go if possible with the currently used version in noah.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irap has samtools samtools 1.9, fastq_utils 0.16.3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 3896d3d, but the test log says it's still using fastq_utils 0.25.1. Any idea how I might force it to use the correct fastq_utils version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing what you mean in the logs right now, but it may be because that version isn't available in Conda- see https://anaconda.org/bioconda/fastq_utils/files. You could try picking the oldest version available for now, but since we can't easily match versions maybe we should bite the bullet and use the latest. Okay with you @pcm32 ?

Copy link
Contributor Author

@irisdianauy irisdianauy Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The html output of the local planemo test that I ran says fastq_utils 0.25.1 in its report. Not sure how to view the html here, but maybe they're using the same version.

According to the fastq_utils repo, these are the dependencies:
samtools (version 0.1.19) and zlib (http://zlib.net/) version 1.2.11 or latest are required to compile fastq_utils. ... The bam_annotate.sh script requires samtools (version 1.5 or higher).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to latest version in c40fc6a

<requirements>
<requirement type="package" version="0.25.1">fastq_utils</requirement>
</requirements>
<command detect_errors="exit_code"><![CDATA[
fastq_pre_barcodes --read1 '$read1' --outfile1 '$outfile1'

#if $read2:
--read2 '$read2'
#end if

#if $index1:
--index1 '$index1'
#end if

#if $index2:
--index2 '$index2'
#end if

#if $index3:
--index3 '$index3'
#end if

#if $phred_encoding:
--phred_encoding '$phred_encoding'
#end if

#if $min_qual:
--min_qual '$min_qual'
#end if

#if $outfile2:
--outfile2 '$outfile2'
#end if

#if $outfile3:
--outfile3 '$outfile3'
#end if

#if $interleaved:
--interleaved '$interleaved'
#end if

#if $umi_read:
--umi_read '$umi_read'
#end if

#if $umi_offset:
--umi_offset '$umi_offset'
#end if

#if $umi_size:
--umi_size '$umi_size'
#end if

#if $cell_read:
--cell_read '$cell_read'
#end if

#if $cell_offset:
--cell_offset '$cell_offset'
#end if

#if $cell_size:
--cell_size '$cell_size'
#end if

#if $sample_read:
--sample_read '$sample_read'
#end if

#if $sample_offset:
--sample_offset '$sample_offset'
#end if

#if $sample_size:
--sample_size '$sample_size'
#end if

#if $read1_offset:
--read1_offset '$read1_offset'
#end if

#if $read1_size:
--read1_size '$read1_size'
#end if

#if $read2_offset:
--read2_offset '$read2_offset'
#end if

#if $read2_size:
--read2_size '$read2_size'
#end if

#if $use_10x:
'$use_10x'
#end if

#if $sam:
'$sam'
#end if

#if $x:
'$x'
#end if

#if $brief:
'$brief'
#elif $verbose:
'$verbose'
#end if
]]></command>
<inputs>
<param name="verbose" label="Verbose" optional="true" value='false' argument="--verbose" type="boolean" truevalue='--verbose' falsevalue='' checked="true" help="Increase level of messages printed to stderr"/>
<param name="brief" label="Brief" optional="true" value="true" argument="--brief" type="boolean" truevalue='--brief' falsevalue='' checked="true" help="Decrease level of messages printed to stderr"/>
<param name="read1" label="Read1" argument="--read1" type="data" format="fastqsanger" optional="false" help="fastq (optional gzipped) file name"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the tool actually accepts fastq.gz, I would try adding fastqsanger.gz here instead of fastqsanger (sorry, I know it was my original suggestion). What happens here is that if the tool can accept .gz, then galaxy is decompressing this unnecesarily to pass it as fastqsanger instead of fastqsanger.gz. On The inputs I think that you can accept more than one format (so you could use both, comma separated within the field).

Copy link
Contributor Author

@irisdianauy irisdianauy Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I put "fastq,fastqsanger", incorrectly interpreting that one of them stood for the .gz version, and this gave me an error. Would you know where format values are documented? This page about data types does not list the actual format values that should be used in the xml.

Copy link
Contributor Author

@irisdianauy irisdianauy Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 36546ac.

<param name="read2" label="Read2" argument="--read2" type="data" format="fastqsanger" optional="true" help="fastq (optional gzipped) file name"/>
<param name="index1" label="Index1" argument="--index1" type="data" format="fastqsanger" optional="true" help="fastq (optional gzipped) file name"/>
<param name="index2" label="Index2" argument="--index2" type="data" format="fastqsanger" optional="true" help="fastq (optional gzipped) file name"/>
<param name="index3" label="Index3" argument="--index3" type="data" format="fastqsanger" optional="true" help="fastq (optional gzipped) file name"/>
<param name="phred_encoding" label="PHRED Encoding" argument="--phred_encoding" type="select" optional="true" help="PHRED encoding used in the input files">
<option value="33" selected="true">33</option>
<option value="64">64</option>
</param>
<param name="min_qual" label="Minimum Quality" optional="true" value='' argument="--min_qual" type="integer" min="0" max="40" help="[0-40]. Defines the minimum quality that all bases in the UMI, Cell or Sample should have (reads that do not pass the criteria are discarded). 0 disables the filter."/>
<param name="interleaved" label="Interleaved Data" argument="--interleaved" type="text" optional="true" help="Interleaved data, in this format: (read1|read2|index1|index2|index3),(read1|read2|index1|index2|index3)"/>
<param name="umi_read" label="UMI read" argument="--umi_read" type="text" optional="true" help="File in which UMI read can be found, in this format: (read1|read2|index1|index2|index3)"/>
<param name="umi_offset" label="UMI offset" argument="--umi_offset" type="integer" optional="true" help="Offset (integer)"/>
<param name="umi_size" label="UMI Size" argument="--umi_size" type="integer" optional="true" help="Number of bases after the offset"/>
<param name="cell_read" label="Cell Read" argument="--cell_read" type="text" optional="true" help="File in which Cell can be found, in this format: (read1|read2|index1|index2|index3)"/>
<param name="cell_offset" label="Cell Offset" argument="--cell_offset" type="integer" optional="true" help="Offset"/>
<param name="cell_size" label="Cell Size" argument="--cell_size" type="integer" optional="true" help="Number of bases after the offset"/>
<param name="sample_read" label="Sample Read" argument="--sample_read" type="text" optional="true" help="File in which sample barcode can be found, in this format: (read1|read2|index1|index2|index3)"/>
<param name="sample_offset" label="Sample Offset" argument="--sample_offset" type="integer" optional="true" help="Offset"/>
<param name="sample_size" label="Sample Size" argument="--sample_size" type="integer" optional="true" help="Number of bases after the offset"/>
<param name="read1_offset" label="read1 Offset" argument="--read1_offset" type="integer" optional="true" help="None"/>
<param name="read1_size" label="read1 Size" argument="--read1_size" type="integer" optional="true" help="None"/>
<param name="read2_offset" label="read2 Offset" argument="--read2_offset" type="integer" optional="true" help="None"/>
<param name="read2_size" label="read2 Size" argument="--read2_size" type="integer" optional="true" help="None"/>
<param name="use_10x" label="Use 10x tags" argument="--10x" type="text" optional="true" help="Use 10X UMI tags (UB and UY) instead of the default tags defined in the SAM specification"/>
<param name="sam" label="SAMM" argument="--sam" type="text" optional="true" help="No documentation"/>
<param name="x" label="X" argument="-X" type="text" optional="true" help="No documentation"/>
</inputs>
<outputs>
<data label="${tool.name} on ${on_string}: Output file 1" name="outfile1" format="fastqsanger" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tools is directly sending out fastq.gz and the next step uses fastq.gz, then please set the output formats as well to fastqsanger.gz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can later add some conditionality on this, but not needed for now I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 36546ac

<data label="${tool.name} on ${on_string}: Output file 2" name="outfile2" format="fastqsanger" />
<data label="${tool.name} on ${on_string}: Output file 3" name="outfile3" format="fastqsanger" />
</outputs>
<tests>
<test>
<param name="index1" value="barcode_test_1.fastq.gz"/>
<param name="phred_encoding" value="33"/>
<param name="min_qual" value="10"/>
<param name="umi_read" value="index1"/>
<param name="umi_offset" value="0"/>
<param name="umi_size" value="16"/>
<param name="read1_offset" value="0"/>
<param name="read1_size" value="-1"/>
<param name="read1" value="barcode_test_2.fastq.gz"/>
<output name="outfile1" file="test.fastq.gz"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I suspect that you might need some assertion logic here. See galaxy tools docs and other tools' tests in this repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might explain why tests are skipped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of comparing to a file (and have to upload/download the result file), I suggest that you assert the success through an estimate of the file size. Since you know the correct file, you can check that file size and add some delta. See my Galaxy tests docs or look at example tests on my Seurat 4 branch: https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary/tree/feature/seurat_4/tools/tertiary-analysis/seurat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is failing because diff won't compare files that are binary (gz in this case).

</test>
<test>
<param name="index1" value="barcode_test2_1.fastq.gz"/>
<param name="index2" value="barcode_test2_1.fastq.gz"/>
<param name="index3" value="barcode_test2_1.fastq.gz"/>
<param name="phred_encoding" value="33"/>
<param name="min_qual" value="1"/>
<param name="umi_read" value="index1"/>
<param name="umi_offset" value="0"/>
<param name="umi_size" value="16"/>
<param name="read1_offset" value="0"/>
<param name="read1_size" value="-1"/>
<param name="cell_read" value="index2"/>
<param name="cell_offset" value="0"/>
<param name="cell_size" value="8"/>
<param name="sample_read" value="index3"/>
<param name="sample_offset" value="0"/>
<param name="sample_size" value="4"/>
<param name="read1" value="barcode_test2_2.fastq.gz"/>
<param name="read2" value="barcode_test2_2.fastq.gz"/>
<param name="sam" value="--sam"/>
<output name="outfile1" file="test_1.fastq.gz"/>
<output name="outfile2" file="test_2.fastq.gz"/>
</test>
<test expect_failure="true">
<param name="interleaved" value="read1"/>
<param name="read1" value="inter.fastq.gz"/>
<param name="index1" value="inter.fastq.gz"/>
<param name="umi_read" value="index1"/>
<param name="umi_offset" value="0"/>
<param name="umi_size" value="16"/>
<param name="sam" value="--sam"/>
</test>
</tests>
<help><![CDATA[
=======================================================
Preprocess barcodes of fstq files (fastq_pre_barcodes)
=======================================================

Preprocess the reads to move the barcodes (UMI, Cell, ...) to the respective readname, optionally discarding reads with bases in the barcode regions below a given threshold.

Example:

fastq_pre_barcodes --read1 my.umi.fastq.gz --outfile1 tmp.fastq.gz --phred_encoding 33 --read1_offset 22 --read1_size -1 --umi_read read1 --umi_size=8 --umi_offset 12

In the above command, the UMIs (starting in the base 12 and with a length of 8 bases) are extracted from the sequences and inserted in the respective read name. The read sequences in the output file includes the bases starting in position 22 until the end of the sequence. The modified readname will have the following format

@STAGS_CELL=[cell]_UMI=[umi]_SAMPLE=[sample]_ETAGS_[ORIGINAL READ NAME]

where [cell], [umi], and [sample] will have the value of the barcode (if available) and [ORIGINAL_READ_NAME] is, as the name suggest, the read name found in the input fastq file.

]]></help>
<citations>
<citation type="bibtex"><![CDATA[
@ARTICLE{Fonseca2017,
author = {Fonseca, N.},
title = {fastq_utils},
year = {2017},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nunofonseca/fastq_utils}},
commit = {c6cf3f954c5286e62fbe36bb9ffecd89d7823b07}
}]]></citation>
</citations>
</tool>
34 changes: 34 additions & 0 deletions tools/qc/fastq_utils/get_test_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env bash

BASE_LINK="https://raw.githubusercontent.com/nunofonseca/fastq_utils/master/tests"

BAR11_FILE="barcode_test_1.fastq.gz"
BAR12_FILE="barcode_test_2.fastq.gz"
BAR21_FILE="barcode_test2_1.fastq.gz"
BAR22_FILE="barcode_test2_2.fastq.gz"
INTER_FILE="inter.fastq.gz"

BAR11_LINK=$BASE_LINK"/"$BAR11_FILE
BAR12_LINK=$BASE_LINK"/"$BAR12_FILE
BAR21_LINK=$BASE_LINK"/"$BAR21_FILE
BAR22_LINK=$BASE_LINK"/"$BAR22_FILE
INTER_LINK=$BASE_LINK"/"$INTER_FILE

function get_data {
local link=$1
local fname=$2

if [ ! -f $fname ]; then
echo "$fname not available locally, downloading.."
wget -O $fname --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 -t 3 $link
fi
}

# Get test data
pushd test-data

get_data $BAR11_LINK $BAR11_FILE
get_data $BAR12_LINK $BAR12_FILE
get_data $BAR21_LINK $BAR12_FILE
get_data $BAR22_LINK $BAR22_FILE
get_data $INTER_LINK $INTER_FILE
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added tools/qc/fastq_utils/test-data/barcode_test_2.fastq.gz
Binary file not shown.
Binary file added tools/qc/fastq_utils/test-data/inter.fastq.gz
Binary file not shown.
Binary file added tools/qc/fastq_utils/test-data/test.fastq.gz
Binary file not shown.
Empty file.
Empty file.