SARS-CoV-2 Pipeline

ARTICplus method

ARTIC protocol uses a multiplexed PCR approach with two primer pools tiling the entire genome. The primer sequences are not trimmed but masked during variant calling.

The pipeline takes as input single end or paired-end fastq files. The primer sequences for the v1-v4 ARTIC protocol are already integrated in the pipeline therefore the user simply needs to indicate which version of primers should the pipeline use. Optionally, the user can import a BED file with the primer scheme (see example in this link to BED file). The pipeline will run as indicated in the diagram shown below to produce metrics files, alignment files (BAM format), alignment coverage diagram and tables with variant calls.

SGS method

SARS-CoV-2 pipeline was written originally by the Elodie Ghedin's Lab using the Nextflow workflow manager. The pipeline uses as input raw data files and proceeds to trim, aligns reads and calls variants against a SARS-CoV-2 reference sequence. The pipeline was designed to work with data generated from adapted versions of the ARTIC consortium protocol. We specifically recommend the protocol developed by the Ghedin Lab, which is available for download here. Briefly, the main difference between the ARTIC protocol and the adapted version is that the ARTIC protocol is a multiplexed polymerase chain reaction (PCR) for targeted whole-genome amplification of SARS-CoV-2, where as the adapted version keeps the pools separate from library prep through sequencing.

Amplicons should be generated in two pools, A and B, from separate preparations from amplification through library generation and sequencing

The pipeline takes single or paired-end fastq files and trims both adapters and amplification primers and aligns them to the provided reference file, then merges the A and B files to created merged BAM files. The pipeline additionally calls variants using haplotypeCaller from GATK and creates several consensus fasta files based on read depth thresholds of 6, 10 and 20. We use the Wuhan-Hu-1 sequence as the reference sequence for SARS-CoV-2 data (GenBank: MN908947.3)

User Options

Nephele QC Pipeline: We recommend running all sample files through Nephele's QC pipeline before running samples in the SARS-CoV-2 pipeline. It is always a good idea to view the quality of your data before analysis

Input FASTA/Q files:

ARTICplus method: The pipeline expects fastq files (single or paired) per samples and a simple mapping file to map the sample name with the fastq files (see example below).
SGS method: The pipeline will work on any number of samples, provided that each sample has 2 (single end) or 4 (paired end) input fastq files (read files from both A and B amplicon preparations). Two primer fasta files are required as inputs for samples. A metadata.xlsx file is needed to map the sample files to their respective primer files (arbitrarily labelled A and B for this pipeline)

Example:

ARTICplus method mapping file

#SampleID	ForwardFastqFile	ReverseFastqFile
N1	N1_L001_R1_001.fastq.gz	N1_L001_R2_001.fastq.gz
N2	N2_L001_R1_001.fastq.gz	N2_L001_R1_001.fastq.gz

SGS method mapping file

#SampleID	ForwardFastqFile_A	ReverseFastqFile_A	ForwardFastqFile_B	ReverseFastqFile_B	PrimerFile_A	PrimerFile_B
N1	N1-A_S7_L001_R1_001.fastq.gz	N1-A_S7_L001_R2_001.fastq.gz	N1-B_S8_L001_R1_001.fastq.gz	N1-B_S8_L001_R2_001.fastq.gz	primer_A.fa	primer_B.fa
N2	N2-A_S1_L001_R1_001.fastq.gz	N2-A_S1_L001_R1_001.fastq.gz	N2-A_S1_L001_R1_001.fastq.gz	N2-A_S1_L001_R1_001.fastq.gz	primer_A.fa	primer_B.fa

Dependencies

TRIMMOMATIC 0.39
BWA 0.7.17
PICARD 2.23.8
GATK 4.1.9.0
SAMTOOLS 1.11
HTSLIB 1.11
BCFTOOLS 1.11
DEEPTOOLS 3.5.1
PILON 1.23
BEDTOOLS 2.30.0
PYSAM 0.19.1
PYPAIRIX 0.3.7
SNPEFF 5.1
IVAR 1.3.1 (Only in the ARTICplus method)

Pipeline Major Steps

ARTICplus method

Trim: Trims and removes reads based on the following settings:
- ILLUMINACLIP:adapter.fa:2:30:10:8:true Trims reads of adapters
- ILLUMINACLIP:primer_{A,B}.fa:2:30:10:8:true Trims reads of primers
- LEADING:20 removes leading bps below quality threshold of 20
- TRAILING:20 removes trailing bps below quality threshold of 20
- SLIDINGWINDOW:4:20 trims read at the left most bp when the average quality of 4 bps falls below 20
- MINLEN:20 removes reads below 20bp in length
Align: Quality trimmed single or paired-end reads are mapped to reference genome Wuhan-Hu-1 (Genbank: NC_045512.2) using bwa mem
Primer Trim: Primer sequences in BAM alignment file are masked using iVar
Downsample Bam: BAM alignment file is downsampled using the jvarkit biostar154220.jar downsample tool. A region’s coverage is downsampled to 200X coverage if that regions coverage is above 200X
Call Variants: Variants are called using GATK HaplotypeCaller
Filter Variants: Raw variant call file is split in to an individual SNP and Indel VCF file using GATK SelectVariants for filtering. The filtered individual files are then merged using Picard tools MergeVcfs to a single filtered variants file
- SNP filter thresholds: QD < 2.0, FS > 100.0, MQ < 40.0, SOR > 4.0, ReadPosRankSum < -8.0
- Indel filter thresholds: DP < 20.0, QD < 2.0, FS > 200.0, SOR > 10.0
Annotate Variant File: Variant file is annotated using snpEff
Consensus Generation: A raw consensus genome is first generated using GATK FastaAlternateReferenceMaker from the merged filtered variants VCF file. Reads are then mapped to the raw consensus sequence to generate a BAM file used for masking regions of the consensus genome where coverage is less than 20X or 20X
QC Metrics: Produces a report of each sample's total reads, aligned reads, percent reads aligned, average read length, percent of paired reads, number of snps, number of indels, mean coverage, and the percent of the genome that is covered at at least 50X

SGS method

Trim Reads: Trims and removes reads based on the following settings:
- ILLUMINACLIP:adapter.fa:2:30:10:8:true Trims reads of adapters
- ILLUMINACLIP:primer_{A,B}.fa:2:30:10:8:true Trims reads of primers
- LEADING:20 removes leading bps below quality threshold of 20
- TRAILING:20 removes trailing bps below quality threshold of 20
- SLIDINGWINDOW:4:20 trims read at the left most bp when the average quality of 4 bps falls below 20
- MINLEN:20 removes reads below 20bp in length
Align Reads: Aligns forward and reverse reads of A or B to reference genome Wuhan-Hu-1 (Genbank: MN908947.3) using bwa mem
Sort SAM Files: Sorts the aligned SAM files by coordinates and converts to BAM using PICARDs SortSam
Merge Primer A and B BAM Files: Merges aligned reads from A and B BAM files into a single BAM file and replaces read groups to match sample ID using PICARDs MergeSamFile and AddOrReplaceReadGroups
Remove Duplicates: Marks and removes duplicate reads using GATKs MarkDuplicate. Produces a metrics file and deduplicated BAM file with an index
Call Variants: Produces a vcf file that contains raw snps and indel information from GATKs HaplotypeCaller. Optionally produces a BAM file of variants. Sepatates snps and indels into their own vcf file with GATKs SelectVariants
Filter Variants: Produces a vcf file by filtering variants. Snps are filtered out if the have a QD < 2.0, FS > 60.0, MQ < 40.0, SOR > 4.0, and a ReadPosRankSum < -8.0. Indels are filtered out if they have a DP < 20.0, QD <2.0, FS > 200.0, and SOR > 10.0
Consensus Generation: Produces a consensus sampleID.fasta file containing snps with GATKs IndexFeatureFile and FastaAlternateReferenceMaker. Additional consensus fasta files are generated which mask regions below 6, 10, and 20 read depth *important: please be aware that this consensus is only reporting snps and not indels.
Annotation: An annotation vcf file is generated using snpEFF
QC metrics: Produces a report of each samples total reads, aligned reads, percent reads aligned, average read length, percent of paired reads, percent of duplicated reads, insert size, number of snps, number of snps filtered, and average coverage

Output Files/Directories