SARS-CoV-2 Pipeline
ARTICplus method
ARTIC protocol uses a multiplexed PCR approach with two primer pools tiling the entire genome. The primer sequences
are not trimmed but masked during variant calling.
The pipeline takes as input single end or paired-end fastq files. The primer sequences for the v1-v4 ARTIC protocol
are already
integrated in the pipeline therefore the user simply needs to indicate which version of primers should the pipeline
use.
Optionally, the user can import a BED file with the primer scheme (see example in this
link to BED file).
The pipeline will run as indicated in the diagram shown below to produce metrics files, alignment files (BAM
format), alignment coverage diagram and tables with variant calls.
SGS method
SARS-CoV-2 pipeline was written originally by the
Elodie Ghedin's Lab using the
Nextflow workflow manager.
The pipeline uses as input raw data files and proceeds to trim, aligns reads and calls variants against a SARS-CoV-2
reference sequence.
The pipeline was designed to work with data generated from adapted versions of the
ARTIC consortium protocol.
We specifically recommend the protocol developed by the Ghedin Lab, which is available for download
here.
Briefly, the main difference between the ARTIC protocol and the adapted version is that the ARTIC protocol is a
multiplexed
polymerase chain reaction (PCR) for targeted whole-genome amplification of SARS-CoV-2, where as the adapted version
keeps the pools
separate from library prep through sequencing.
Amplicons should be generated in two pools, A and B, from separate preparations from amplification through library
generation and sequencing
The pipeline takes single or paired-end fastq files and trims both adapters and amplification primers and aligns
them to the provided
reference file, then merges the A and B files to created merged BAM files.
The pipeline additionally calls variants using haplotypeCaller from GATK and creates several consensus fasta files
based on read
depth thresholds of 6, 10 and 20. We use the Wuhan-Hu-1 sequence as the reference sequence for SARS-CoV-2 data
(GenBank: MN908947.3)
User Options
Dependencies
- TRIMMOMATIC 0.39
- BWA 0.7.17
- PICARD 2.23.8
- GATK 4.1.9.0
- SAMTOOLS 1.11
- HTSLIB 1.11
- BCFTOOLS 1.11
- DEEPTOOLS 3.5.1
- PILON 1.23
- BEDTOOLS 2.30.0
- PYSAM 0.19.1
- PYPAIRIX 0.3.7
- SNPEFF 5.1
- IVAR 1.3.1 (Only in the ARTICplus method)
Pipeline Major Steps
ARTICplus method
-
Trim: Trims and removes reads based on the following settings:
- ILLUMINACLIP:adapter.fa:2:30:10:8:true Trims reads of adapters
- ILLUMINACLIP:primer_{A,B}.fa:2:30:10:8:true Trims reads of primers
- LEADING:20 removes leading bps below quality threshold of 20
- TRAILING:20 removes trailing bps below quality threshold of 20
- SLIDINGWINDOW:4:20 trims read at the left most bp when the average quality of 4 bps falls below 20
- MINLEN:20 removes reads below 20bp in length
- Align: Quality trimmed single or paired-end reads are mapped to reference genome Wuhan-Hu-1 (Genbank:
NC_045512.2) using bwa mem
- Primer Trim: Primer sequences in BAM alignment file are masked using iVar
- Downsample Bam: BAM alignment file is downsampled using the jvarkit biostar154220.jar downsample tool. A
region’s coverage is downsampled to 200X coverage if that regions coverage is above 200X
- Call Variants: Variants are called using GATK HaplotypeCaller
-
Filter Variants: Raw variant call file is split in to an individual SNP and Indel VCF file using GATK
SelectVariants for filtering. The filtered individual files are then merged using Picard tools MergeVcfs to a
single filtered variants file
- SNP filter thresholds: QD < 2.0, FS > 100.0, MQ < 40.0, SOR > 4.0, ReadPosRankSum < -8.0
- Indel filter thresholds: DP < 20.0, QD < 2.0, FS > 200.0, SOR > 10.0
- Annotate Variant File: Variant file is annotated using snpEff
- Consensus Generation: A raw consensus genome is first generated using GATK FastaAlternateReferenceMaker
from the merged filtered variants VCF file. Reads are then mapped to the raw consensus sequence to generate a
BAM file used for masking regions of the consensus genome where coverage is less than 20X or 20X
- QC Metrics: Produces a report of each sample's total reads, aligned reads, percent reads aligned, average
read length, percent of paired reads, number of snps, number of indels, mean coverage, and the percent of the
genome that is covered at at least 50X
SGS method
-
Trim Reads: Trims and removes reads based on the following settings:
- ILLUMINACLIP:adapter.fa:2:30:10:8:true Trims reads of adapters
- ILLUMINACLIP:primer_{A,B}.fa:2:30:10:8:true Trims reads of primers
- LEADING:20 removes leading bps below quality threshold of 20
- TRAILING:20 removes trailing bps below quality threshold of 20
- SLIDINGWINDOW:4:20 trims read at the left most bp when the average quality of 4 bps falls below 20
- MINLEN:20 removes reads below 20bp in length
- Align Reads: Aligns forward and reverse reads of A or B to reference genome Wuhan-Hu-1 (Genbank:
MN908947.3) using bwa mem
- Sort SAM Files: Sorts the aligned SAM files by coordinates and converts to BAM using PICARDs SortSam
- Merge Primer A and B BAM Files: Merges aligned reads from A and B BAM files into a single BAM file and
replaces read groups to match sample ID using PICARDs MergeSamFile and AddOrReplaceReadGroups
- Remove Duplicates: Marks and removes duplicate reads using GATKs MarkDuplicate. Produces a metrics file
and deduplicated BAM file with an index
- Call Variants: Produces a vcf file that contains raw snps and indel information from GATKs
HaplotypeCaller. Optionally produces a BAM file of variants. Sepatates snps and indels into their own vcf file
with GATKs SelectVariants
- Filter Variants: Produces a vcf file by filtering variants. Snps are filtered out if the have a QD <
2.0, FS > 60.0, MQ < 40.0, SOR > 4.0, and a ReadPosRankSum < -8.0. Indels are filtered out if they
have a DP < 20.0, QD <2.0, FS > 200.0, and SOR > 10.0
- Consensus Generation: Produces a consensus sampleID.fasta file containing snps with GATKs
IndexFeatureFile and FastaAlternateReferenceMaker. Additional consensus fasta files are generated which mask
regions below 6, 10, and 20 read depth *important: please be aware that this consensus
is only reporting snps and not indels.
- Annotation: An annotation vcf file is generated using snpEFF
- QC metrics: Produces a report of each samples total reads, aligned reads, percent reads aligned, average
read length, percent of paired reads, percent of duplicated reads, insert size, number of snps, number of snps
filtered, and average coverage
Output Files/Directories
ARTICplus method
-
consensus
- sampleID_below_10_masked.fasta: consensus sequence with regions below a depth of 10 reads masked
- sampleID_below_20_masked.fasta: consensus sequence with regions below a depth of 20 reads masked
-
metrics
- sampleID_alignment_metrics.txt: Full report of alignment summary metrics. To learn the meaning of all
columns, see http://broadinstitute.github.io/picard/picard-metric-definitions.html#AlignmentSummaryMetrics
- sampleID_coverage_plot.pdf: image describing coverage of genome by reads. The coverage gets downsampled
to 200x in the lower pane
- sampleID_depth_out.txt: histogram data of depth along the genome coordinates
- sampleID_ TX0018_insert_metrics.txt: report of the insert size identified using paired data
-
reports
- nephele2_covid19_report.csv: report of each sample’s total reads, aligned reads, % aligned, average read
length, % paired, mean coverage, % genome covered at 50x, # of snps, # indels
- nephele2_covid19_snpeff_variant_report.csv: variant effect summary for all samples in run
-
reports_ind
- reports for each individual sample
- sampleID_variant_effect.csv: individual sample variant effect summary
-
variant_files
- sampleID.filt.vars.vcf.gz: filtered variant file
- sampleID.filt.vars.ann.vcf.gz: snpEff annotated variant file
- sampleID_snpEff_summary.genes.txt: snpEff generated summary file of variant effects per gene
- sampleID_snpEff_summary.html: snpEff generated output summary that can be viewed in a browser. For a
detailed description of the two snpEff summary files see http://pcingola.github.io/SnpEff/se_outputsummary/
-
bam_files (optional)
SGS method
-
consensus
- sampleID_below_1_masked.fasta: consensus sequence with regions below a depth of 1 reads masked
- sampleID_below_6_masked.fasta: consensus sequence with regions below a depth of 6 reads masked
- sampleID_below_10_masked.fasta: consensus sequence with regions below a depth of 10 reads masked
- sampleID_below_20_masked.fasta: consensus sequence with regions below a depth of 20 reads masked
-
filtered_indels
- sampleID_filtered_indels.vcf: VCF files of filtered GATK: haplotypeCaller outputs specific to indels
-
filtered_snps
- sampleID_filtered_snps.vcf: VCF files of filtered GATK: haplotypeCaller outputs specific to snps
-
reports
- report.csv: report of each samples total reads, aligned reads, % aligned, average read length, % paired,
% duplicated, main insert size, # of snps, # of snps filtered, and average coverage
-
bam_files (optional)
- sampleID_merged.bam: BAM file of A and B aligned outputs merged together.
- sampleID_merged.bam.bai: Index file of sampleID_merged.bam.
- Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data.
Bioinformatics, 30(15), 2114-2120. doi:10.1093/bioinformatics/btu170
- Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform.
bioinformatics, 25(14), 1754-1760. doi:10.1093/bioinformatics/btp324
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The sequence
alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. doi:10.1093/bioinformatics/btp352
- “Picard Toolkit.” 2019. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute
- Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T. J., Carneiro, M. O., Van der Auwera, G. A., ... &
Banks, E. (2018). Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178.
doi:10.1101/201178
- Grubaugh, N.D., Gangavarapu, K., Quick, J. et al. An amplicon-based sequencing framework for accurately
measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol 20, 8 (2019).
https://doi.org/10.1186/s13059-018-1618-7
Pipeline design & author
-
ARTICplus method: Brendan Jeffrey and the Nephele team
-
SGS method: