A command line tool for transferring Unique Molecular Identifiers (UMIs) provided as separate FastQ file to the header of records in paired FastQ files.
- Background on Unique Molecular Identifiers
- Installing
umi-transfer
- Using
umi-transfer
to integrate UMIs - Benchmarks and parameter recommendations
- Chaining with other software
- Contributing bugfixes and new features
To increase the accuracy of quantitative DNA sequencing experiments, Unique Molecular Identifiers may be used. UMIs are short sequences used to uniquely tag each molecule in a sample library, enabling precise identification of read duplicates. They must be added during library preparation and prior to sequencing, therefore require appropriate arrangements with your sequencing provider.
Most tools capable of taking UMIs into consideration during an analysis workflow, expect the respective UMI sequence to be embedded into the read's ID. Please consult your tools' manuals regarding the exact specification.
For some library preparation kits and sequencing adapters, the UMI sequence needs to be read together with the index from the antisense strand. Consequently, it will be output as a separate FastQ file during the demultiplexing process.
This tool efficiently integrates these separate UMIs into the headers and can also correct divergent read numbers back to the canonical 1
and 2
.
Binaries for umi-transfer
are available for most platforms and can be obtained from the Releases page on GitHub. Simply navigate to the releases and download the appropriate binary for your operating system. Once downloaded, you can place it in a directory of your choice and optionally add the binary to your system's $PATH
.
umi-transfer
is also available on BioConda. Please refer to the Bioconda documentation for comprehensive installation instructions. If you are already familiar with conda and BioConda, here’s a quick reference:
mamba install umi-transfer
If you wish to create a separate virtual environment for the tool, replace <myenvname>
with a suitable environment name of your choice and run
mamba create --name <myenvname> umi-transfer
Docker provides a platform for packaging software into self-contained units called containers. Containers encapsulate all the dependencies and libraries needed to run an application, making it easy to deploy and run the software consistently across different environments.
To use umi-transfer
with Docker, you can pull the pre-made Docker image from Docker Hub. Open a terminal or command prompt and run the following command:
docker pull mzscilifelab/umi-transfer:latest
Once the image is downloaded, you can run umi-transfer
within a Docker container using:
docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer --help
A complete command might look like the example below. The options -t -v -w
to Docker will ensure that your local directory is mapped to and available inside the container. Everything after the image command resembles the standard command line syntax:
docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer external --in=read1.fq --in2=read2.fq --umi=umi.fq
Optionally, you can create an alias for the Docker part of the command to be able to use the containerized version as if it was locally installed. Add the line below to your ~/.profile
, ~/.bash_aliases
, ~/.bashrc
or ~/.zprofile
(depending on the terminal & configuration being used).
alias umi-transfer="docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer"
Given that you have Rust installed on your computer, clone or download this repository and run
cargo build --release
That should create an executable target/release/umi-transfer
that can be placed anywhere in your $PATH
or be executed directly by specifying its path:
./target/release/umi-transfer --version
umi-transfer 1.5.0
The tool requires three FastQ files as input. You can manually specify the names and location of the output files with --out
and --out2
or the tool will automatically append a with_UMI
suffix to your input file names. It additionally accepts to choose a custom UMI delimiter with --delim
and to set the flags -f
, -c
and -z
.
-c
is used to ensure the canonical 1
and 2
of paired files as read numbers in the output, regardless of the read numbers of the input reads. -f
/ --force
will overwrite existing output files without prompting the user and -z
enables the internal compression of the output files. Alternatively, you can also specify an output file name with .gz
suffix to obtain compressed output.
$ umi-transfer external --help
Integrate UMIs from a separate FastQ file
Usage: umi-transfer external [OPTIONS] --in <R1_IN> --in2 <R2_IN> --umi <RU_IN>
Options:
-c, --correct_numbers
Read numbers will be altered to ensure the canonical read numbers 1 and 2 in output file sequence headers.
-z, --gzip
Compress output files. Turned off by default.
-l, --compression_level <COMPRESSION_LEVEL>
Choose the compression level: Maximum 9, defaults to 3. Higher numbers result in smaller files but take longer to compress.
-t, --threads <NUM_THREADS>
Number of threads to use for processing. Defaults to the number of logical cores available.
-f, --force
Overwrite existing output files without further warnings or prompts.
-d, --delim <DELIM>
Delimiter to use when joining the UMIs to the read name. Defaults to `:`.
--in <R1_IN>
[REQUIRED] Input file 1 with reads.
--in2 <R2_IN>
[REQUIRED] Input file 2 with reads.
-u, --umi <RU_IN>
[REQUIRED] Input file with UMI.
--out <R1_OUT>
Path to FastQ output file for R1.
--out2 <R2_OUT>
Path to FastQ output file for R2.
-h, --help
Print help
-V, --version
Print version
A typical run may look like this:
umi-transfer external -fz -d '_' --in 'R1.fastq' --in2 'R3.fastq' --umi 'R2.fastq'
umi-transfer
warrants paired input files. To run on singletons, use the same input twice and redirect one output to /dev/null
:
umi-transfer external --in read1.fastq --in2 read1.fastq --umi read2.fastq --out output1.fastq --out2 /dev/null
With the release of version 1.5, umi-transfer
features internal multi-threaded output compression. As a result, umi-transfer
1.5 now runs approximately 25 times faster than version 1.0 when using internal compression and about twice as fast compared to using an external compression tool. This improvement is enabled by the outstanding gzp
crate, which abstracts a lot of the underlying complexity away from the main software.
In our first benchmark using 17 threads, version 1.5 of umi-transfer
processed approximately 550,000 paired records per second with the default gzip compression level of 3. At the highest compression level of 9, the rate dropped to just below 200,000 records per second. While the exact numbers may vary depending on your storage, file system, and processors, we expect the relative performance rates to remain approximately constant.
In a subsequent benchmark, we tested the effect of increasing the number of threads. For the default compression level, the maximum speed was achieved with 9 to 11 threads. Since umi-transfer writes two output files simultaneously, this configuration allows for 4 to 5 threads per file to handle the output compression.
Adding more threads per file proved unhelpful, as other steps became the rate-limiting factors. These factors include file system I/O, input file decompression, and the actual editing of the file contents, which now determine the performance of umi-transfer. Only when increasing the compression level to higher settings did adding more threads continue to provide a performance benefit. For the highest compression setting, we did not reach the plateau phase during the benchmark, but it is likely to occur in the range of 53-55 total threads, or about 26 threads per output file.
In summary, we recommend running umi-transfer
with 9 or 11 threads for compression. Odd numbers are favorable as they allow one dedicated main thread, while evenly splitting the remaining threads between the two output files. It's important to note that specifying more threads than the available physical or logical cores on your machine will result in a severe performance loss, since umi-transfer
operates synchronously.
umi-transfer
cannot be used with the pipe operator, because it neither supports writing output to stdout
nor reading input from stdin
. However, FIFOs (First In, First Out buffered pipes) can be used to elegantly combine umi-transfer
with other software on GNU/Linux and MacOS operating systems.
For example, we may want to use external compression software like Parallel Gzip together with umi-transfer
. For this purpose, it would be unfavorable to write the data uncompressed to disk before compressing it. Instead, we create named pipes with mkfifo
, which can be provided to umi-transfer
as if they were regular output file paths. In reality, the data is directly passed on to pigz
via a buffered stream.
First, the named pipes are created:
mkfifo output1
mkfifo output2
Then a multi-threaded pigz
compression is tied to the FIFO. Note the trailing &
to leave these processes running in the background.
$ pigz -p 10 -c > output1.fastq.gz < output1 &
[4] 233394
$ pigz -p 10 -c > output2.fastq.gz < output2 &
[5] 233395
The argument -p 10
specifies the number of threads that each pigz
processes may use. The optimal setting is hardware-specific and will require some testing.
Finally, we can run umi-transfer
using the FIFOs as output paths:
umi-transfer external --in read1.fastq --in2 read3.fastq --umi read2.fastq --out output1 --out2 output2
It's good practice to remove the FIFOs after the program has finished:
rm output1.fastq output2.fastq
umi-transfer
is a free and open-source software developed and maintained by scientists of the Swedish National Genomics Infrastructure. We gladly welcome suggestions for improvement, bug reports and code contributions.
If you'd like to contribute code, the best way to get started is to create a personal fork of the repository. Subsequently, use a new branch to develop your feature or contribute your bug fix. Ideally, use a code linter like rust-analyzer
in your code editor and run the tests with cargo test
.
Before developing a new feature, we recommend opening an issue on the main repository to discuss your proposal upfront. Once you're ready, simply open a pull request to the dev
branch and we'll happily review your changes. Thanks for your interest in contributing to umi-transfer
!