UMI FASTQ file #703

adamrtalbot · 2022-11-29T18:42:39Z

UMI FASTQ file ~~composed of random 9bp synthetic oligos, all with uniform quality.~~

Generated by stripping the UMI sequence from the existing FASTQ and turning it into a separate file. This will be a valid reference format for sequencing kits where the UMI is embedded in the index.

UMI FASTQ file composed of random 9bp synthetic oligos, all with uniform quality. Created synthetically to match existing UMI fastq file(s)

lescai

Hi :)
could you please describe a little more this file?
if this is the use case where UMIs are present in a third FASTQ, then the test dataset should include 3 files: forward and reverse (without UMIs in the sequence), and a UMIs file.

lescai · 2022-11-30T10:00:11Z

also, UMIs structure is needed in order to process the sequences

adamrtalbot · 2022-11-30T10:12:50Z

Yep no problem.

The entire read is the UMI sequence, it matches the existing FASTQs that are in the repository. Here is the existing FASTQ files:

# test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_1.fastq.gz | head -8
@922332/1
ATTTCAGAGAGAGGATCTCGTGTAGAAATTGCTTTGAGCTGTTCTTTGTCATTTTCCCTTAATTCATTGTCTCTAGCTAGTCTGTTACTCTGTAAAATAAAATAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTTAAGGTCAGTG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEAEAEE6AAEEEEE/EAAAA<AEEEEAAEEAAAA<EEE/
@928177/1
ACATAAACAAAAGTATATAAGTAATACATATTTATAAATCTATTAAGAAAGCAAGTAATATGTACCTTAAGAATTTAATGGGAAAATAATTAGACTTACTTTAAATGCCAAAAGAAAAAGTGCCCAATCCTTTGATTAGTCAATGCTTTCT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEE<EE<EEEEEEEEEEEEEEEEEEEEEEEEEAEEE<EEEEEEEAEAAEEEE<EAEAAAAE<<AAEEAEEAEEE

# test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_2.fastq.gz | head -8
@922332/2
TATTATTTTATTTTACAGAGTAACAGACTAGCTAGAGACAATGAATTAAGGGAAAATGACAAAGAACAGCTCAAAGCAATTTCTACACGAGATCCTCTCTCTGAAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACCGCGAT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEA<AEEE<<<<AAEEEEEEEEEEA<EEEAEAE//A<AAE<6
@928177/2
TGAGATTTTTACTGAAGAAAGCATTGACTAATCAAAGGATTGGGCACTTTTTCTTTTGGCATTTAAAGTAAGTCTAATTATTTTCCCATTAAATTCTTAAGGTACATATTACTTGCTTTCTTAATAGATTTATAAATATGTATTACTTATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEEE/EEEE<EEEEEEEEEEEEE<AAEEEAEAEEEEAEE<AAAEA/AEEEEAEAEEEEEAEEAE/

and here is the new one:

# test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_umi.fastq.gz | head -8
@922332
TGACCATTT
+
FFFFFFFFF
@928177
TTTGAACAG
+
FFFFFFFFF

As you can see, the UMI FASTQ file matches the existing FASTQ files, saving us some storage. I generated the FASTQs by:

Aligning to the human genome
Grouping reads by position
Randomly assigning them to a UMI family (poisson distribution, lambda 2)
Creating a FASTQ file based on those families.

I'll upload the script later today and update here. I've checked the method and it seems to work fine in our pipeline.

The bases mask is +T +T +M where input is test.umi_1.fastq.gz test.umi_2.fastq.gz test.umi_umi.fastq.gz. You could be more explicit with 150T 150T 9M, or use the bases mask to cut out the existing UMIs from those files.

adamrtalbot · 2022-11-30T10:35:38Z

I've just checked your development branch, and I think the syntax would be:
ext.args = "--read-structures +T 23S+T +M"

This means it will have the same UMI sequences.

adamrtalbot · 2022-11-30T16:39:53Z

Slight change - I've extracted those first 12bp and put them in that FASTQ file. This now should have exactly the same UMI sequences as the existing FASTQ and should create almost identical consensus reads.

test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_umi.fastq.gz | head -8
@922332
TATTATTTTATT
+
AAAAAEEEEEEE
@928177
TGAGATTTTTAC
+
AAAAAEEEEEEE

@lescai I've checked your subworkflow in development and it already works with three FASTQ files nicely! We just have to add an additional test.

adamrtalbot · 2023-01-13T17:16:48Z

@lescai did you have a chance to check this?

UMI FASTQ file

b1dd553

UMI FASTQ file composed of random 9bp synthetic oligos, all with uniform quality. Created synthetically to match existing UMI fastq file(s)

adamrtalbot requested a review from lescai November 29, 2022 18:42

lescai requested changes Nov 30, 2022

View reviewed changes

UMI FASTQ file uses first 12bp from test.umi_2.fastq.gz

2a1dd97

This means it will have the same UMI sequences.

adamrtalbot mentioned this pull request May 14, 2024

Third umi read nf-core/fastquorum#11

Closed

10 tasks

maxulysse approved these changes May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMI FASTQ file #703

UMI FASTQ file #703

adamrtalbot commented Nov 29, 2022 •

edited

Loading

lescai left a comment

lescai commented Nov 30, 2022

adamrtalbot commented Nov 30, 2022 •

edited

Loading

adamrtalbot commented Nov 30, 2022

adamrtalbot commented Nov 30, 2022

adamrtalbot commented Jan 13, 2023

UMI FASTQ file #703

Are you sure you want to change the base?

UMI FASTQ file #703

Conversation

adamrtalbot commented Nov 29, 2022 • edited Loading

lescai left a comment

Choose a reason for hiding this comment

lescai commented Nov 30, 2022

adamrtalbot commented Nov 30, 2022 • edited Loading

adamrtalbot commented Nov 30, 2022

adamrtalbot commented Nov 30, 2022

adamrtalbot commented Jan 13, 2023

adamrtalbot commented Nov 29, 2022 •

edited

Loading

adamrtalbot commented Nov 30, 2022 •

edited

Loading