SARS-CoV-2 Pipeline

ARTICplus method

ARTIC protocol uses a multiplexed PCR approach with two primer pools tiling the entire genome. The primer sequences are not trimmed but masked during variant calling.

The pipeline takes as input single end or paired-end fastq files. The primer sequences for the v1-v4 ARTIC protocol are already integrated in the pipeline therefore the user simply needs to indicate which version of primers should the pipeline use. Optionally, the user can import a BED file with the primer scheme (see example in this link to BED file). The pipeline will run as indicated in the diagram shown below to produce metrics files, alignment files (BAM format), alignment coverage diagram and tables with variant calls.

User Options

Nephele QC Pipeline: We recommend running all sample files through Nephele's QC pipeline before running samples in the SARS-CoV-2 pipeline. It is always a good idea to view the quality of your data before analysis

Input FASTA/Q files:

ARTICplus method: The pipeline expects fastq files (single or paired) per samples and a simple mapping file to map the sample name with the fastq files (see example below).

Example:

ARTICplus method mapping file

#SampleID	ForwardFastqFile	ReverseFastqFile
N1	N1_L001_R1_001.fastq.gz	N1_L001_R2_001.fastq.gz
N2	N2_L001_R1_001.fastq.gz	N2_L001_R1_001.fastq.gz

Dependencies

TRIMMOMATIC 0.39
BWA 0.7.17
PICARD 2.23.8
GATK 4.1.9.0
SAMTOOLS 1.11
HTSLIB 1.11
BCFTOOLS 1.11
DEEPTOOLS 3.5.1
PILON 1.23
BEDTOOLS 2.30.0
PYSAM 0.19.1
PYPAIRIX 0.3.7
SNPEFF 5.1
IVAR 1.3.1 (Only in the ARTICplus method)

Pipeline Major Steps

ARTICplus method

Trim: Trims and removes reads based on the following settings:
- ILLUMINACLIP:adapter.fa:2:30:10:8:true Trims reads of adapters
- ILLUMINACLIP:primer_{A,B}.fa:2:30:10:8:true Trims reads of primers
- LEADING:20 removes leading bps below quality threshold of 20
- TRAILING:20 removes trailing bps below quality threshold of 20
- SLIDINGWINDOW:4:20 trims read at the left most bp when the average quality of 4 bps falls below 20
- MINLEN:20 removes reads below 20bp in length
Align: Quality trimmed single or paired-end reads are mapped to reference genome Wuhan-Hu-1 (Genbank: NC_045512.2) using bwa mem
Primer Trim: Primer sequences in BAM alignment file are masked using iVar
Downsample Bam: BAM alignment file is downsampled using the jvarkit biostar154220.jar downsample tool. A region’s coverage is downsampled to 200X coverage if that regions coverage is above 200X
Call Variants: Variants are called using GATK HaplotypeCaller
Filter Variants: Raw variant call file is split in to an individual SNP and Indel VCF file using GATK SelectVariants for filtering. The filtered individual files are then merged using Picard tools MergeVcfs to a single filtered variants file
- SNP filter thresholds: QD < 2.0, FS > 100.0, MQ < 40.0, SOR > 4.0, ReadPosRankSum < -8.0
- Indel filter thresholds: DP < 20.0, QD < 2.0, FS > 200.0, SOR > 10.0
Annotate Variant File: Variant file is annotated using snpEff
Consensus Generation: A raw consensus genome is first generated using GATK FastaAlternateReferenceMaker from the merged filtered variants VCF file. Reads are then mapped to the raw consensus sequence to generate a BAM file used for masking regions of the consensus genome where coverage is less than 20X or 20X
QC Metrics: Produces a report of each sample's total reads, aligned reads, percent reads aligned, average read length, percent of paired reads, number of snps, number of indels, mean coverage, and the percent of the genome that is covered at at least 50X

Output Files/Directories