This is the Nextflow-based quantification component for used for droplet-based studies in Single Cell Expression Atlas. It's basically just calling Alevin-fry, with some logic for sample handling.
For anyone starting out with droplet RNA-seq analysis, especially using Alevin, we highly recommend this tutorial, which covers a workflow similar to the one used here.
The workflow does the following:
- Downloads fastq files from specified URIs using our FASTQ provider. The FASTQ provider does some special things for EBI users, but will just wget specified files by default.
- Interprets provided sample and configurations to determine the correct arguments for Alevin-fry, combining technical replicate groups where appropriate for analysis.
- Runs Alevin-fry
- Makes a droplet barcode plot for each library
- Runs emptyDrops to remove remove droplets clearly without cells.
The workflow requires as input:
- A pre-prepared Salmon splici-index, that is built from a reference containing spliced transcripts and introns
- A three-column transcript-to-gene mapping file used by Alevin-fry to summarise quantifications to the gene level, that is produced when building the splici transcriptome
- A tabular samples file (SDRF) containing information about the libraries to be quantified
- A Nextflow configuration file describing the data in the tabular samples table
- A string specifying whether the experiment is snRNA or scRNA
- An output directory for results
- A whitelist for 10XV2 and 10XV3 to filter barcodes. (Note: The 10XV3 whitelist needs to be downladed here, and download location needs to be updated in the
nextflow.config
file underparams.10xv3.whitelist
.) NOTE: The download link no longer works; the whitelist file comes with CellRanger.
This is simply a tab-delmited file with transcript and gene identifiers, like:
FBti0019092-RA FBti0019092 S
FBti0019093-RA FBti0019093 S
FBti0019100-RA FBti0019100 S
FBti0019102-RA FBti0019102 S
FBti0019104-RA FBti0019104 S
FBti0019106-RA FBti0019106 S
FBti0019111-RA FBti0019111 S
FBti0019113-RA FBti0019113 S
FBti0019115-RA FBti0019115 S
FBti0019116-RA FBti0019116 S
...
Note: it's important that this map contains a gene for every transcript in your reference.
Here's an example of a real tabular input to the the workflow.
Comment[ENA_RUN] Comment[LIBRARY_LAYOUT] Comment[technical replicate group] Comment[LIBRARY_STRAND] cdna_uri cell_barcode_uri umi_barcode_uri Comment[cDNA read offset] Comment[cell barcode offset] Comment[umi barcode offset] Comment[cDNA read size] Comment[cell barcode size] Comment[umi barcode size] end Comment[cell count]
SRR6327113 SINGLE SAMN08105407 first strand sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/003/SRR6327113/SRR6327113.fastq.gz/SRR6327113_2.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/003/SRR6327113/SRR6327113.fastq.gz/SRR6327113_1.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/003/SRR6327113/SRR6327113.fastq.gz/SRR6327113_1.fastq.gz 16 46 16 10 5 3158
SRR6327115 SINGLE SAMN08105406 first strand sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/005/SRR6327115/SRR6327115.fastq.gz/SRR6327115_2.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/005/SRR6327115/SRR6327115.fastq.gz/SRR6327115_1.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/005/SRR6327115/SRR6327115.fastq.gz/SRR6327115_1.fastq.gz 16 55 16 10 5 1418
SRR6327103 SINGLE SAMN08105415 first strand sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/003/SRR6327103/SRR6327103.fastq.gz/SRR6327103_2.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/003/SRR6327103/SRR6327103.fastq.gz/SRR6327103_1.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/003/SRR6327103/SRR6327103.fastq.gz/SRR6327103_1.fastq.gz 16 55 16 10 5 1661
SRR6327117 SINGLE SAMN08105405 first strand sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/007/SRR6327117/SRR6327117.fastq.gz/SRR6327117_2.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/007/SRR6327117/SRR6327117.fastq.gz/SRR6327117_1.fastq.gz sra/ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR632/007/SRR6327117/SRR6327117.fastq.gz/SRR6327117_1.fastq.gz 16 86 16 10 5 4328
...
The links you see here have a special format here which tells the FASTQ provider to download an unpack SRA files for those libraries, but yours will likely be simple URIs pointing directly to FASTQ files.
The meaning of the fields is covered in more detail below.
params{
protocol = '10xv2'
fields {
run = 'Comment[ENA_RUN]'
layout = 'Comment[LIBRARY_LAYOUT]'
techrep = 'Comment[technical replicate group]'
cdna_uri = 'cdna_uri'
cell_barcode_uri = 'cell_barcode_uri'
umi_barcode_uri = 'umi_barcode_uri'
cdna_read_offset = 'Comment[cDNA read offset]'
cell_barcode_offset = 'Comment[cell barcode offset]'
umi_barcode_offset = 'Comment[umi barcode offset]'
cdna_read_size = 'Comment[cDNA read size]'
cell_barcode_size = 'Comment[cell barcode size]'
umi_barcode_size = 'Comment[umi barcode size]'
end = 'end'
cell_count = 'Comment[cell count]'
}
}
This configuration refers to the fields in the samples table and describes how they should be used to control how Alevin is run.
In this example the libraries are 10X v2, which means they have a 16bp cell barcode and 10bp unique molecular identifier (UMI). The barcode configuration is illustrated in the samples file above, but you should change the configuration dependent on your libraries, for example 10Xv3 libraries have 12bp UMIs.
The allowable protocol names are here. Take special care to use the 5prime library time if necesary for the newer 10X 5-prime libraries, since if you fail to do so your mapping rate will be extremely low.
The workflow uses Conda to provide software dependencies for individual workflow steps. You must also have Nextflow itself installed. With those prerequisites in place the workflow is run like:
nextflow run \
-config CONF_FILE \
--sdrf SAMPLES_FILE \
--protocol PROTOCOL \
--resultsRoot OUTPUT_DIR \
--transcriptToGene TRANSCRIPT_TO_GENE \
--transcriptomeIndex TRANSCRIPTOME_INDEX \
--mode scRNA
-resume \
main.nf \
If the workflow runs successfully you will see a folder for every sample under 'alevin' at your specified output directory. See the Alevin docs for more info on this output.
This directory is a copy of the raw Alevin output converted to a more useful 10X-style MTX that can be read by many tools (Scanpy, dropletUtils etc).
This is the same as above, but with emptyDrops() applied to remove empty droplets.
This folder contains a QC plot like this one:
This shows a barcode plot for the library and illustrates the impact of running emptyDrops().
This plot should show a clear steep-as-possible dropoff between populated and unpopulated droplets.
In this example you should be able to see a population of droplets to the right of the top plot that are absent in the bottom one, removed by emptyDrops(). Note that we have not simply applied one of the thresholds illustrated in the figure, since this can lead to issues with removal of certain genuine cell populations. See the emptyDrops paper to understand what that tool actually does. We would also again point you at this tutorial for more in-depth coverage of this.
With these results in hand we do the following (you may want to do similarly)