Welcome to the nf-hello-gatk repository! This is a demonstration pipeline built using Nextflow, designed to showcase a basic genomic analysis workflow incorporating the GATK (Genome Analysis Toolkit). The pipeline is ideal for demonstrating how to handle input and output files via channels and pass them between processes effectively.
The nf-hello-gatk pipeline performs a variant calling analysis using GATK HaplotypeCaller on a set of BAM files. It comes with test data located in ./data that should run in seconds, allowing you to demonstrate the pipeline quickly. Furthermore, it has built-in support for Docker, which simplifies dependency management and ensures consistent execution environments.
If you are following the hello-nextflow series on https://training.nextflow.io/, you will create a similar version of this pipeline. This one has a few small differences:
- It adds
bcftools stats
and MultiQC to demonstrate some basic quality control and documentation. - It writes some outputs using the workflow output publishing syntax
To run this pipeline locally, you need to have the following software installed:
To run the pipeline, use the following command:
nextflow run seqeralabs/nf-hello-gatk
If you wish you can manually supply your own parameters using command line options. These are the defaults specified from the root of the repository:
nextflow run seqeralabs/nf-hello-gatk \
--bams "./data/bam/*.bam" \
--reference ./data/ref/ref.fasta \
--reference_index ./data/ref/ref.fasta.fai \
--reference_dict ./data/ref/ref.dict \
--calling_intervals data/ref/intervals.bed \
--cohort_name my_cohort
This will run the pipeline using the supplied BAM files and reference data.
The pipeline allows for the following input parameters:
--bams
: A glob pattern to specify the input BAM files.--reference
: Path to the reference genome FASTA file.--reference_index
: Path to the index file (.fai
) of the reference genome.--reference_dict
: Path to the dictionary file (.dict
) of the reference genome.--calling_intervals
: Path to the intervals file for variant calling.--cohort_name
: A name for the cohort being analyzed (used in naming output files).
Example of running the pipeline:
nextflow run seqeralabs/nf-hello-gatk \
--bams "./data/bams/*.bam" \
--reference ./data/ref/hg38.fasta \
--reference_index ./data/ref/hg38.fasta.fai \
--reference_dict ./data/ref/hg38.dict \
--calling_intervals ./data/ref/intervals.bed \
--cohort_name sample_cohort
By default, the pipeline will use Docker for each process. This is enabled via the configuration option in the nextflow.config. Nextflow will handle downloading the necessary Docker images and running the pipeline within containers.
If you wish to disable this, you can use the following configuration option:
docker.enabled = false
Note: You will need to provide the software dependencies yourself or use an alternative method to manage them.
For more advanced usage, such as customizing the workflow, modifying the process definitions, or integrating additional tools, you can edit the main.nf
file or create custom configurations.
Contributions are welcome! Please submit a pull request or open an issue if you have suggestions for improvements or find any bugs.
All training material was originally written by Seqera but has been made open-source (CC BY-NC-ND) for the community.
Copyright 2020-2023, Seqera. All examples and descriptions are licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.