SARS-CoV-2 Pipeline

ARTICplus method

ARTIC protocol uses a multiplexed PCR approach with two primer pools tiling the entire genome. The primer sequences are not trimmed but masked during variant calling.

SARS-CoV-2 ARTIC Primer Pools

The pipeline takes as input single end or paired-end fastq files. The primer sequences for the v1-v4 ARTIC protocol are already integrated in the pipeline therefore the user simply needs to indicate which version of primers should the pipeline use. Optionally, the user can import a BED file with the primer scheme (see example in this link to BED file). The pipeline will run as indicated in the diagram shown below to produce metrics files, alignment files (BAM format), alignment coverage diagram and tables with variant calls.

SARS-CoV-2 ARTIC Flow
SGS method

SARS-CoV-2 pipeline was written originally by the Elodie Ghedin's Lab using the Nextflow workflow manager. The pipeline uses as input raw data files and proceeds to trim, aligns reads and calls variants against a SARS-CoV-2 reference sequence. The pipeline was designed to work with data generated from adapted versions of the ARTIC consortium protocol. We specifically recommend the protocol developed by the Ghedin Lab, which is available for download here. Briefly, the main difference between the ARTIC protocol and the adapted version is that the ARTIC protocol is a multiplexed polymerase chain reaction (PCR) for targeted whole-genome amplification of SARS-CoV-2, where as the adapted version keeps the pools separate from library prep through sequencing.

Amplicons should be generated in two pools, A and B, from separate preparations from amplification through library generation and sequencing

SARS-CoV-2 Primer Pools

The pipeline takes single or paired-end fastq files and trims both adapters and amplification primers and aligns them to the provided reference file, then merges the A and B files to created merged BAM files. The pipeline additionally calls variants using haplotypeCaller from GATK and creates several consensus fasta files based on read depth thresholds of 6, 10 and 20. We use the Wuhan-Hu-1 sequence as the reference sequence for SARS-CoV-2 data (GenBank: MN908947.3)

SARS-CoV-2 Flow

User Options

  • Nephele QC Pipeline: We recommend running all sample files through Nephele's QC pipeline before running samples in the SARS-CoV-2 pipeline. It is always a good idea to view the quality of your data before analysis
  • Input FASTA/Q files:
    • ARTICplus method: The pipeline expects fastq files (single or paired) per samples and a simple mapping file to map the sample name with the fastq files (see example below).
    • SGS method: The pipeline will work on any number of samples, provided that each sample has 2 (single end) or 4 (paired end) input fastq files (read files from both A and B amplicon preparations). Two primer fasta files are required as inputs for samples. A metadata.xlsx file is needed to map the sample files to their respective primer files (arbitrarily labelled A and B for this pipeline)

    Example:

    ARTICplus method mapping file

    #SampleID ForwardFastqFile ReverseFastqFile
    N1 N1_L001_R1_001.fastq.gz N1_L001_R2_001.fastq.gz
    N2 N2_L001_R1_001.fastq.gz N2_L001_R1_001.fastq.gz

    SGS method mapping file

    #SampleID ForwardFastqFile_A ReverseFastqFile_A ForwardFastqFile_B ReverseFastqFile_B PrimerFile_A PrimerFile_B
    N1 N1-A_S7_L001_R1_001.fastq.gz N1-A_S7_L001_R2_001.fastq.gz N1-B_S8_L001_R1_001.fastq.gz N1-B_S8_L001_R2_001.fastq.gz primer_A.fa primer_B.fa
    N2 N2-A_S1_L001_R1_001.fastq.gz N2-A_S1_L001_R1_001.fastq.gz N2-A_S1_L001_R1_001.fastq.gz N2-A_S1_L001_R1_001.fastq.gz primer_A.fa primer_B.fa

Dependencies

  • TRIMMOMATIC 0.39
  • BWA 0.7.17
  • PICARD 2.23.8
  • GATK 4.1.9.0
  • SAMTOOLS 1.11
  • HTSLIB 1.11
  • BCFTOOLS 1.11
  • DEEPTOOLS 3.5.1
  • PILON 1.23
  • BEDTOOLS 2.30.0
  • PYSAM 0.19.1
  • PYPAIRIX 0.3.7
  • SNPEFF 5.1
  • IVAR 1.3.1 (Only in the ARTICplus method)

Pipeline Major Steps

ARTICplus method

  • Trim: Trims and removes reads based on the following settings:
    • ILLUMINACLIP:adapter.fa:2:30:10:8:true Trims reads of adapters
    • ILLUMINACLIP:primer_{A,B}.fa:2:30:10:8:true Trims reads of primers
    • LEADING:20 removes leading bps below quality threshold of 20
    • TRAILING:20 removes trailing bps below quality threshold of 20
    • SLIDINGWINDOW:4:20 trims read at the left most bp when the average quality of 4 bps falls below 20
    • MINLEN:20 removes reads below 20bp in length
  • Align: Quality trimmed single or paired-end reads are mapped to reference genome Wuhan-Hu-1 (Genbank: NC_045512.2) using bwa mem
  • Primer Trim: Primer sequences in BAM alignment file are masked using iVar
  • Downsample Bam: BAM alignment file is downsampled using the jvarkit biostar154220.jar downsample tool. A region’s coverage is downsampled to 200X coverage if that regions coverage is above 200X
  • Call Variants: Variants are called using GATK HaplotypeCaller
  • Filter Variants: Raw variant call file is split in to an individual SNP and Indel VCF file using GATK SelectVariants for filtering. The filtered individual files are then merged using Picard tools MergeVcfs to a single filtered variants file
    • SNP filter thresholds: QD < 2.0, FS > 100.0, MQ < 40.0, SOR > 4.0, ReadPosRankSum < -8.0
    • Indel filter thresholds: DP < 20.0, QD < 2.0, FS > 200.0, SOR > 10.0
  • Annotate Variant File: Variant file is annotated using snpEff
  • Consensus Generation: A raw consensus genome is first generated using GATK FastaAlternateReferenceMaker from the merged filtered variants VCF file. Reads are then mapped to the raw consensus sequence to generate a BAM file used for masking regions of the consensus genome where coverage is less than 20X or 20X
  • QC Metrics: Produces a report of each sample's total reads, aligned reads, percent reads aligned, average read length, percent of paired reads, number of snps, number of indels, mean coverage, and the percent of the genome that is covered at at least 50X

SGS method

  • Trim Reads: Trims and removes reads based on the following settings:
    • ILLUMINACLIP:adapter.fa:2:30:10:8:true Trims reads of adapters
    • ILLUMINACLIP:primer_{A,B}.fa:2:30:10:8:true Trims reads of primers
    • LEADING:20 removes leading bps below quality threshold of 20
    • TRAILING:20 removes trailing bps below quality threshold of 20
    • SLIDINGWINDOW:4:20 trims read at the left most bp when the average quality of 4 bps falls below 20
    • MINLEN:20 removes reads below 20bp in length
  • Align Reads: Aligns forward and reverse reads of A or B to reference genome Wuhan-Hu-1 (Genbank: MN908947.3) using bwa mem
  • Sort SAM Files: Sorts the aligned SAM files by coordinates and converts to BAM using PICARDs SortSam
  • Merge Primer A and B BAM Files: Merges aligned reads from A and B BAM files into a single BAM file and replaces read groups to match sample ID using PICARDs MergeSamFile and AddOrReplaceReadGroups
  • Remove Duplicates: Marks and removes duplicate reads using GATKs MarkDuplicate. Produces a metrics file and deduplicated BAM file with an index
  • Call Variants: Produces a vcf file that contains raw snps and indel information from GATKs HaplotypeCaller. Optionally produces a BAM file of variants. Sepatates snps and indels into their own vcf file with GATKs SelectVariants
  • Filter Variants: Produces a vcf file by filtering variants. Snps are filtered out if the have a QD < 2.0, FS > 60.0, MQ < 40.0, SOR > 4.0, and a ReadPosRankSum < -8.0. Indels are filtered out if they have a DP < 20.0, QD <2.0, FS > 200.0, and SOR > 10.0
  • Consensus Generation: Produces a consensus sampleID.fasta file containing snps with GATKs IndexFeatureFile and FastaAlternateReferenceMaker. Additional consensus fasta files are generated which mask regions below 6, 10, and 20 read depth *important: please be aware that this consensus is only reporting snps and not indels.
  • Annotation: An annotation vcf file is generated using snpEFF
  • QC metrics: Produces a report of each samples total reads, aligned reads, percent reads aligned, average read length, percent of paired reads, percent of duplicated reads, insert size, number of snps, number of snps filtered, and average coverage

Output Files/Directories

ARTICplus method

  • consensus
    • sampleID_below_10_masked.fasta: consensus sequence with regions below a depth of 10 reads masked
    • sampleID_below_20_masked.fasta: consensus sequence with regions below a depth of 20 reads masked
  • metrics
    • sampleID_alignment_metrics.txt: Full report of alignment summary metrics. To learn the meaning of all columns, see http://broadinstitute.github.io/picard/picard-metric-definitions.html#AlignmentSummaryMetrics
    • sampleID_coverage_plot.pdf: image describing coverage of genome by reads. The coverage gets downsampled to 200x in the lower pane
    • sampleID_depth_out.txt: histogram data of depth along the genome coordinates
    • sampleID_ TX0018_insert_metrics.txt: report of the insert size identified using paired data
  • reports
    • nephele2_covid19_report.csv: report of each sample’s total reads, aligned reads, % aligned, average read length, % paired, mean coverage, % genome covered at 50x, # of snps, # indels
  • reports_ind
    • reports for each individual sample
  • variant_files
    • sampleID.filt.vars.vcf.gz: filtered variant file
    • sampleID.filt.vars.ann.vcf.gz: snpEff annotated variant file
    • sampleID_snpEff_summary.genes.txt: snpEff generated summary file of variant effects per gene
    • sampleID_snpEff_summary.html: snpEff generated output summary that can be viewed in a browser. For a detailed description of the two snpEff summary files see http://pcingola.github.io/SnpEff/se_outputsummary/
  • bam_files (optional)

SGS method

  • consensus
    • sampleID_below_1_masked.fasta: consensus sequence with regions below a depth of 1 reads masked
    • sampleID_below_6_masked.fasta: consensus sequence with regions below a depth of 6 reads masked
    • sampleID_below_10_masked.fasta: consensus sequence with regions below a depth of 10 reads masked
    • sampleID_below_20_masked.fasta: consensus sequence with regions below a depth of 20 reads masked
  • filtered_indels
    • sampleID_filtered_indels.vcf: VCF files of filtered GATK: haplotypeCaller outputs specific to indels
  • filtered_snps
    • sampleID_filtered_snps.vcf: VCF files of filtered GATK: haplotypeCaller outputs specific to snps
  • reports
    • report.csv: report of each samples total reads, aligned reads, % aligned, average read length, % paired, % duplicated, main insert size, # of snps, # of snps filtered, and average coverage
  • bam_files (optional)
    • sampleID_merged.bam: BAM file of A and B aligned outputs merged together.
    • sampleID_merged.bam.bai: Index file of sampleID_merged.bam.

Tools and References

  1. Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114-2120. doi:10.1093/bioinformatics/btu170
  2. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754-1760. doi:10.1093/bioinformatics/btp324
  3. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. doi:10.1093/bioinformatics/btp352
  4. “Picard Toolkit.” 2019. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute
  5. Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T. J., Carneiro, M. O., Van der Auwera, G. A., ... & Banks, E. (2018). Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178. doi:10.1101/201178
  6. Grubaugh, N.D., Gangavarapu, K., Quick, J. et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol 20, 8 (2019). https://doi.org/10.1186/s13059-018-1618-7

Pipeline design & author