Monroe v1.0.0
Bioinformatics pipeline for SARS-CoV-2 genome assembly and sample cluster detection.
- SARS-CoV-2 genome assembly can be performed from read data generated using the ARTIC PCR tiling protocols (V1, V2, or V3) with either an Illumina sequencing platform (e.g. MiSeq) or an Oxford Nanopore Technologies MinIon device
- Cluster detection can be performed from input assembly files (fasta) generated from any sequencing protocol
Data workflow:
Monroe consists of three separate Nextflow pipelines for Illumina paired-end read assembly (pe_assebly
) Oxford Nanopore Technlogies read assembly (ont_assembly
) cluster analysis from assembled SC2 genomes (cluster_analysis
):
Monroe’s Nextflow pipelines can be executed using the following command format:
$ staphb-wf monroe <monroe_pipeline> [options]
<monroe_pipeline>
must bepe_assembly
,ont_assembly
, orcluster_analysis
Paired-End Read Assembly:
Monroe pe_assembly
uses Trimmomatic and BBDuk to perform read trimming and adapter/PhiX removal prior to mapping read data to a reference SARS-CoV-2 genome (Wuhan-1; NCBI RefSeq NC_045512.2) with minimap2. Paired-fastq files are pulled from the alignment file with SAMtools–these filtered read data (i.e. paired reads that map to the reference genome) are stored as separate files that can be uploaded to public repositories such as NCBI SRA.
The minimap2 alignment file is also used to generate a consensus assembly after trimming ARTIC primers using iVar v1.2.1.
Quick Start
$ staphb-wf monroe pe_assembly <input_dir> -o <output_dir> --primers <ARTIC_primer_version>
<input_dir>
is a positional argument that designates the path to an input directory containing paired-end read data (fastq).-o
specifies the directory to which Monroe will write all results; if an<output_dir>
is not provided, results will be written to amonroe_run_<date>
directory.--primers
The ARTIC primer set version used to generate the read data must be designated asV1
,V2
, orV3
.
Other Options
--profile
: Nextflow profile, either Docker or Singularity. Default will try docker first, then singularity if the docker executable cannot be found.--config
,-c
: Path to a custom Nextflow configuration file--resume
: Resume a previous run
Output:
Monroe pe_assembly
will organize all output into four subdirectories under the specified <output_dir>
:
<output_dir>/alignments
: Sorted BAM files after minimap2 mapping and ivar primer trimming<output_dir>/assemblies
: Consensus genome assemblies in fasta format as well as a<date>_quality_metrics.tsv
file comprising of quality metrics for all genome assemblies<output_dir>/SC2_reads
: Paired read data that have mapped to the Wu-Han-1 reference genome<output_dir>/logs
: Nextflow execution report (Monroe_execution_report.html
), trace file (Monroe_trace.txt
), and task work directories.
Sample quality metrics file:
- sample: isolate ID pulled from the fastq file
- aligned_bases: number of bases mapped to the SARS-CoV-2 reference genome
- percent_cvg: percent of reference genome with mapped read data
- mean_depth: mean depth-of-coverage
- mean_base_q: average quality of basecalls for read data mapped to the reference genome
- mean_map_q: mean mapping quality
- status: “PASS” if percent_cvg >80%, mean_base_q >30, and mean_map_q >30; “WARNING” if any of these quality thresholds are not met
Docker Images
The base Nextflow configuration profiles (Docker, Singularity) for Monroe pe_assebly
incorporate the following StaPH-B Docker Images:
Process | Function | Docker Image | Comment |
---|---|---|---|
preProcess | Renames input read files for downstream processing | staphb/fastqc_container | Light-weight container for quick text processing |
trim | Quality trimming of input read data with bbduk | staphb/trimmomatic:0.39 | trimmomatic parameters set to: minlength=75, windowsize=4, & qualitytrimscore=30 |
cleanreads | Adapter and PhiX removal from input read data | staphb/bbtools:38.76 | bbduk default parameters used |
ivar | Read mapping and consensus genome assembly | staphb/ivar:1.2.1-SC2 | ivar consensus parameters set to: minimum frequency (-t)=0, minimum depth (-m)=1 |
samtools | Gathering alignment quality metrics | staphb/samtools:1.10 | samtools coverage default parameters used |
assembly_results | Curating assembly quality metrics | staphb/tiptoft:1.0.0 | Light-weight container with python3 |
Default docker images and parameters listed above can be adjusted by:
- Copying the template
pe_assebly
config file ($ staphb-wf monroe --get_config pe_assembly
) - Using a text editor to change the
<date>_pe_assembly.config
file - Specifying your custom config file (i.e. the edited
<date>_pe_assembly.config>
file) when running the pipeline:
$ staphb-wf monroe pe_assembly <input_dir> -o <output_dir> --primers <ARTIC_primer_version> -c <custom_config_file> [options]
Oxford Nanopore Technlogies (ONT) Read Assembly:
Monroe ont_assembly
can accept ONT Fast5 or FastQ read data. If Fast5 files are provided, high accuracy basecalling will be performed using a GPU-optimized environment. ONT FastQ files, in accordance to the ARTIC bioinformatics protocols, undergo demultiplexing and read filtering prior to genome assembly with either NanoPolish or Medaka.
Quick Start
$ staphb-wf monroe ont_assembly <input_dir> <sequencing_summary> -o <output_dir> --primers <ARTIC_primer_version>
<input_dir>
is a positional argument that designates the path to an input directory containing Oxford Nanopore Technologies read read data; input data can be either in either Fast5 or FastQ format–if Fast5, the--ont_basecalling
must also be invoked<sequencing_summary>
is a positional argument that designates the path to the location of the sequencing summary-o
specifies the directory to which Monroe will write all results; if an<output_dir>
is not provided, results will be written to amonroe_run_<date>
directory.--primers
The ARTIC primer set version used to generate the read data must be designated asV1
,V2
, orV3
.
Other Options
--run_prefix
: Desired run prefix. Default =artic_ncov19
--ont_basecalling
: perform high accuracy basecalling using GPU (only use if you have setup a GPU compatible device); must be invoked if input data is in Fast5 format.--profile
: Nextflow profile, either Docker or Singularity. Default will try docker first, then singularity if the docker executable cannot be found.--config
,-c
: Path to a custom Nextflow configuration file--resume
: Resume a previous run
Output:
Monroe ont_assembly
will organize all output into five subdirectories under the specified <output_dir>
:
<output_dir>/demultiplexing
: demultiplexed read data<output_dir>/guppylex
: read data filtered by length<output_dir>/pipeline_nanopolish
: Nanopolish consensus assembly, if the nanopolish pipeline is specified<output_dir>/pipeline_medaka
: Medaka consensus assembly, if the medaka pipeline is specified<output_dir>/logs
: Nextflow execution report (Monroe_execution_report.html), trace file (Monroe_trace.txt), and task work directories.
Docker Images
The base Nextflow configuration profiles (Docker, Singularity) for Monroe cluster_analysis
incorporate the following StaPH-B Docker Images:
Process | Function | Docker Image | Comments |
---|---|---|---|
guppy_basecalling | Performing high accuracy basecalling using GPU | genomicpariscentre/guppy-gpu | Used if --ont_basecalling is invoked; must have setup a GPU compatible device |
guppy_demultiplexing | Demultiplexing samples by barcodes present | genomicpariscentre/guppy | guppy_barcoder default parameters used |
artic_guppyplex | Filtering read data by read-length to remove chimeric reads | genomicpariscentre/guppy | artic guppyplex parameters set to: –min-length=400, –max-length=700 |
artic_nanopolish_pipeline | Performing genome assembly with NanoPolish | staphb/artic-ncov2019-nanopolish | artic minion parameters set to: –normalize=200 |
artic_medaka_pipeline | Performing genome assembly with Medaka | staphb/artic-ncov2019-nanopolish | artic medaka parameters set to: –normalize=200 |
Default docker images and parameters listed above can be adjusted by:
- Copying the template
ont_assembly
config file ($ staphb-wf monroe --get_config ont_assembly
) - Using a text editor to change the
<date>_ont_assembly.config
file - Specifying your custom config file (i.e. the edited
<date>_ont_assembly.config>
file) when running the pipeline: < br/>
$ staphb-wf monroe ont_assembly <input_dir> -o <output_dir> --primers <ARTIC_primer_version> -c <custom_config_file> [options]
Cluster Analysis:
Monroe cluster_analysis
uses Mafft to perform multiple-sequence alignment of all SARS-CoV-2 genomes provided. The resulting alignment fasta file is used to generate a pairwise-snp distance matrix with snp-dists and a maximum-likelihood phylogeneitc tree with IQ-Tree.
Output from snp-dists and IQ-Tree are curated into a single pdf report using the StaPH-B cluster-report-env
Quick Start
$ staphb-wf monroe cluster_analysis <input_dir> -o <output_dir>
<input_dir>
is a positional argument that designates the path to an input directory containing the SARS-CoV-2 assembly files (fasta).-o
specifies the directory to which Monroe will write all results; if an<output_dir>
is not provided, results will be written to amonroe_run_<date>
directory.
Other Options
--profile
: Nextflow profile, either Docker or Singularity. Default will try docker first, then singularity if the docker executable cannot be found.--config
,-c
: Path to a custom Nextflow configuration file--resume
: Resume a previous runget_rtemplate
: Create a report template file for pipeline customization;<date>_cluster_analysis_report.Rmd
will be populated in the user’s current working directory--report
,-r
: Path to custom report tempalte file
To customize the final report format:
- Copy the report template file (
$ staphb-wf monroe cluster_analysis --get_rtemplate
) - Using a text editor to change the
<date>_cluster_analysis_report.Rmd
file - Specifying your custom report template file (i.e. the edited
<date>_cluster_analysis_report.Rmd>
file) when running the pipeline:
$ staphb-wf monroe cluster_analysis <input_dir> -o <output_dir> -r <custom_report_template_file> [options]
Output:
Monroe cluster_analysis
will write the final pdf report to the specified <output_dir>
. All other output will be organized into three subdirectories:
<output_dir>/images
: PNG files of the maximum-likelihood tree and color-coded SNP-distance matrix<output_dir>/msa
: Mafft alignment file (fasta), IQ-Tree newick file, and SNP-dists pairwise-distance matrix<output_dir>/logs
: Nextflow execution report (Monroe_execution_report.html
), trace file (Monroe_trace.txt
), and task work directories.
Docker Images
The base Nextflow configuration profiles (Docker, Singularity) for Monroe cluster_analysis
incorporate the following StaPH-B Docker Images:
Process | Function | Docker Image | Comments |
---|---|---|---|
msa | Performing multi-sequence alignment with Mafft | staphb/mafft:7.450 | mafft default parameters used |
snp_matrix | Generating pairwise snp-distance matrix from Mafft msa | staphb/snp-dists:0.6.2 | snp-dists default parameters used |
iqtree | Generating maximum-likelihood phylogenetic tree from Mafft msa | staphb/iqtree:1.6.7 | iqtree set to: substitution model (-m)=GTR+4, bootstrap replicates (-bb)=1000 |
render | Curating all output into a single pdf report | staphb/cluster-report-env:1.0 |
Default docker images and parameters listed above can be adjusted by:
- Copying the template
cluster_analysis
config file ($ staphb-wf monroe --get_config cluster_analysis
) - Using a text editor to change the
<date>_cluster_analysis.config
file - Specifying your custom config file (i.e. the edited
<date>_cluster_analysis.config>
file) when running the pipeline:
staphb-wf monroe cluster_analysis <input_dir> -o <output_dir> -c <custom_config_file> [options]
Version History
Current version: v1.0.0 April 29, 2020
Version 1.0.0 is the first stable version of Monroe
Authors
Kevin G. Libuit, DCLS Bioinformatics Lead Scientist
Kelsey R Florek, WSLH Bioinformatics Scientist
Abigail Shockey, WSLH Bioinformatics Fellow