WGSA2 Pipeline

Introduction

Many researchers lack the computational training or resources to process large metagenomic datasets to the level optimal for extracting biological data. To address this problem, our Nephele2 team at NIAID, developed and deployed a new user-friendly command line-free pipeline, Whole metaGenome Sequence Assembly pipeline, version 2 (WGSA2) designed to bridge the processing and analysis gap between short shotgun reads and metagenomic assemblies. Nephele’s WGSA2 pipeline can process not only metagenomic datasets from human or model animals, but also environmental biomes (e.g. marine), and other complex communities inclusive of prokaryotic, micro-eukaryotic and viral organisms. Overall, the WGSA2 pipeline allows the user to easily gain understanding of their metagenomic samples, without investment in time, computational requirements or effort.

The Nephele team is working hard to improve and extend this pipeline to provide users with a better experience and analyses. Please contact us with feedback and keep an eye on Nephele.

WGSA2 pipeline workflow & user options

Note: Click to the diagram to zoom in/out

Step 1: Trimming and error-correcting the raw reads (TEreads)

The WGSA2 pipeline begins by using fastp to perform basic or additional trimming and filtering the dataset, if the option is chosen (in the figure above, Step labeled "T"; fastp), correcting sequencing errors (Step "E"; fastp)

  • User option: Run additional trim & filter
    We recommend first running our QC pipeline on your dataset to check sequence quality and do basic adapter and quality trimming and filtering. However, users can also specify additional trimming and filtering be performed in the WGSA2 pipeline itself. Choice of quality trim threshold for the 5' and 3' end (argument for fastp --cut_front/back_mean_quality), as well as average read quality and minimum read length filtering can be specified.

Step 2: Decontaminating trimmed reads (TEDreads)

The pipeline then uses Kraken 2 to decontaminate (removing host/background sequences) the TEreads against the decontamination DB of choice (Step "D”) producing a cleaned set of FASTQ files (called "TEDreads"). Reports of the contamination levels are provided per sample.

Step 3: TAX profiling of TEDreads

The pipeline uses Kraken 2 to produce taxonomic profiles from the TEDreads. This TEDreads-based taxonomic profiling method was chosen as default for WGSA2 because in our testing, it produced more accurate and taxonomically detailed classification of metagenomic datasets (while also being the quickest, most computationally conservative strategy). The taxonomic profiles are visualized per-sample as a Krona html report and eventually summarized (Step 9).

  • Database: The Kraken 2 database used in the pipeline was custom-built on Nov 8, 2021 from the following Kraken 2 downloadable libraries: bacteria, archaea, protozoa, fungi and viral, and was manually supplemented with SARS-CoV2 genome (TAXID: 2697049). The resulting index is taxonomically diverse containing the genomes of over 15K microbial organisms to ensure optimal classification of phylogenetically complex microbial communities. A detailed database content file can be accessed here.

  • User options: other taxonomic profiling strategies
    The user may also choose additional taxonomic profiling be performed on their dataset, further down the pipeline. See user options for gene-based taxonomic profiling and MAG-based taxonomic profiling detailed below. All taxonomic profiling strategies have strengths and limitations, but have proven sufficiently accurate and are generally in agreement. Performing multiple TAX profiling is not required for reliable taxonomic representation of samples in a dataset.

Step 4: de novo assembly

In this step, the pipeline performs de novo assembly of the TEDreads independently for each sample using the metaSPAdes assembler.

  • User option: Produce taxonomic annotation on scaffolds
    Users can optionally assign taxonomic classifications to the produced scaffolds using the same procedure as in Step 3. Due to the variable length of the scaffolds and some caveats in calculating scaffold depths and coverage statistics, scaffold-based taxonomic profiles computed this way have shown to be less accurate for abundance estimation than other profiling strategies, so they are not provided by default.

Step 5: Read mapping & abundance scoring

Bowtie 2 and SAMtools are used to map each sample's TEDreads to the corresponding assembly and de-replicate the read alignments to only the most appropriate pair match in the entire assembly. This provides the scaffold depth and coverage statistics which are used to score organism abundances in the downstream analyses.

Step 6: Gene prediction

WGSA2 utilizes Prodigal for performing genetic feature predictions, and eggNOG-mapper2 against EggNOG-v5 database for annotation of the predicted features (pregenes). Prodigal was chosen for its effectiveness in predicting genes of not only prokaryotic, but also unicellular eukaryotic organisms with primarily continuous gene structure (most unicellular eukaryotes). Within the context of phylogenetically complex communities (meta-omics datasets), this feature has proven more advantageous than using more kingdom-specific tools.

Step 7: Gene annotation

eggNOG-mapper2 (accompanied by EggNOG-v5 database) was also chosen for its ability to discover and properly annotate prokaryotic and unicellular eukaryotic genes.

Gene abundance scores are computed using VERSE (default method "featureCounts", `-z 0`) by enumerating the number of reads aligned to the gene coordinates of the scaffold. TPM values (transcripts per million) are calculated based on the abundance scores per gene.

  • User Option: Metabolic Pathways Database
    The user can choose to have the functional profiles made from the following databases:
    1. KEGG Database (default): based on KO annotations and mapped against KEGG metabolic pathways (mapping files from Dec 2021).
    2. MetaCyc Database: based on EC annotations and mapped against the MetaCyc metabolic pathways (MetaCyc database from Dec 2021)
  • User option: Gene-based taxonomic profile
    Optionally, the pipeline can produce a taxonomic profile based on the predicted genes from their samples (gene-based taxonomic profiles). In this case the predicted gene sequences are classified taxonomically (in the same way as Step 3) and their abundance scores used to quantify the observed taxonomies. Taxonomic profiles created this way showed to be accurate, but taxonomic ranking may not go as deep as TEDread-based TAX profiles. The taxonomic profiles are summarized per-sample and visualized as an html file (KronaTools). Generally these taxonomic profiles are found to coincide with the TEDread-based TAX profiles.
  • User Option: Run AMRFinder
    A user may wish to have not only predicted coding regions (genes), but also predicted antimicrobial resistance peptides (AMRs). Selecting the option to run AMRFinder during job submission, will add this extra Step for the pipeline.

Step 8: Pathway Inference

The annotated genes are used to produce functional profiles of each sample by mapping the annotation to the pathways to which they belong (MinPath). The pathway abundance is calculated using the average TPM values from the abundance scores of each gene in the pathway (genes.to.kronaTable.py from here) producing functional profiles per sample (metabolic pathway profiles). These are summarized per sample as an html file (KronaTools).

Step 9: Dataset-wide summaries

These sets are conditional to submitting datasets with multiple samples (best >3 samples). First, the taxonomic and functional profiles from each samples are collated to create 1 abundance matrix per profile type to represent the dataset and provided to user as final output.

Step 10: Stats & visualizations

These abundance matrixes are also used to produce simple community characterizing visualization plots (e.g. alpha and beta diversity plots, abundance heatmaps etc.).

Optional Step: Creation and evaluation of MAGs

  • User option: Creation and evaluation of MAGs (prokaryotic communities only)

Finally, users may select this option to produce metagenomic-assembled genomes (MAGs) from their samples. The WGSA2 pipeline will bin scaffolds of >1500bp length (MetaBAT 2) based on GC content and read coverage in an attempt to reconstruct the genomes of the most abundant organisms from each sample. Quality assessments of the produced MAGs are then performed providing information about the size, abundance estimation, completeness, contamination level, number of genes, taxonomic affiliation and more for each MAG (CheckM). A biom file collating all the samples is produced.

This option is restricted only to communities that are primarily prokaryotic (eukaryotic single cell organisms may not be properly binned or classified in draft genomes). Taxonomic profiles are again produced and summarized and visualized per sample (KronaTools). These taxonomic profiles are also highly accurate (and concurs with TEDread-based and gene-based taxonomic profiles), but lack sensitivity to less abundant organisms or those with incomplete draft genomes. This is why this strategy for taxonomic profiling is provided as only optional choice for the WGSA2 pipeline.

Software, scripts, tools utilized by the pipeline

Tool versions used by the pipeline are recorded in log file

WGSA2 job submission requirements

Pre-processing through Nephele's QC pipeline
We strongly advise running all sample files through Nephele's QC pipeline, before running the samples through the WGSA2 pipeline. This allows user to familiarize themselves with the quality and quantity of their data, as well as perform basic quality trimming and filtering of their dataset.
Mapping file (template)
A tab-delimited or Excel file containing submitted sample IDs and their respective FASTQ files. Required by the WGSA2 pipeline.
Sequence data source
The WGSA2 pipeline is optimized for metagenomic datasets. For metatranscriptomic datasets, please be advised that some inaccuracies may be inherently produced depending on phylogenetic composition and genetic architecture of some microbial organisms (e.g. organisms with high ITS content/fungi or non-continuous genes).
Sequence data file formats
The input sequence data files must be paired-end (PE) whole genome shotgun reads (not merged) in FASTQ format (gzipped ok), see our FAQ on PE vs SE reads. Due to the scaffold-based approach of the assemblies, single-end (SE) reads are not acceptable for WGSA2 pipeline
Community phylogenetic composition / complexity
The WGSA2 pipeline can handle phylogenetically diverse microbial communities including prokaryotic, viral and unicellular eukaryotic organisms (e.g. protozoa). Currently, it is not optimized for identifying all fungal genes due to their complex structure.

Pipeline output details

  • WGSA2_output_details.pdf: contains these pipeline output details.
  • logfile.txt: A file log from the pipeline. If the pipeline halts in error, this is where the information will be logged for tracing the error.
  • for_analyze_with_microbiomedb.biom: TAX community abundance matrix of the dataset, which can be uploaded to MicrobiomeDB.
  • _corrected.txt: a copy of the mapping file submitted with the pipeline.
TEDreads folder

Output of Steps 1 & 2. These log files track the stats of samples through the trimming, filtering (T), error correction (E) and decontamination (D) Steps. If you chose the Output TEDread fastq files user option, we recommend downloading the TEDreads to this folder.

  • <SampleName>_fastpog.html: log file with filtering, trimming and error-correction stats
  • <SampleName>_kr2_decontamREPORT.txt: a log file reporting abundance and classification of the host reads (removed from the dataset).
TAXprofiles folder

Contents of this folder will depend on the options elected for the pipeline.

  • TEDreadsTAX folder (default): This folder contains logs and reports from the taxonomic classification in Step 3. These file are organized in sub-folders:
    • bin folder: files used by KronaTools for creating the TAXplots html chart. These are the text versions of the TEDread-based taxonomic profiles visualized in the TAXplots_TEDreads.html report.
    • bioms folder: per-sample _json.biom files of the TAX profiles. These can be manually uploaded to MicrobiomeDB or any other analytical platform, if needed.
    • reports folder: contain Kraken 2 report files created during taxonomic classification
    • TAXplots_TEDreads.html: interactive Krona chart of the TEDread-based taxonomic profiles for each sample (created by KronaTools)
  • merged_tables folder (conditional of ≥ 2 samples): Final result of the TAX profiling Step 3 - a dataset-wide TAX abundance matrix. It is produced by merging all the TEDread-based TAX classification and abundance profiles for each sample after all the samples are processed. Since this folder contains dataset-wide collated TAX information, it is only made if a dataset contains ≥ 2 samples. Per-sample results are in the TEDreadsTAX folder.
    • The final TAX abundance matrix is summarized in various ways (Lineage included, Lineage not included, Counts only, Counts + taxonomic ranks, etc.).
    • merged_Counts+Lineage_json.biom TAX abundance matrix in biom format (same as the for_analyze_with_microbiomedb.biom in main folder).
  • DiversityPlots (≥ 2 samples): Diversity plots and exploratory statistics for the dataset. Created based on tables in the merged_tables folder. The pipeline provides alpha diversity indices and boxplots, beta diversity PCoA and nMDS ordination plots, TAX profile heatmap, and rankabundance and rarefication curves. In some cases (e.g. ordination plots), some exploratory statistics will fail due to the content of the dataset. Exploratory statistic options are also not available for single-sample datasets. Expected files:
    • TAX_AlphaDiv_mapping.txt (alpha diversity indices per-sample)
    • TAX_AlphaDiv.pdf
    • TAX_BetaDiv_nMDS.pdf (requires >=3 samples)
    • TAX_BetaDiv_PCoA.pdf (requires >=3 samples)
    • TAX_Profile_Heatmap.pdf
    • TAX_RankAbundanceCurve.pdf
    • TAX_RareficationCurve.pdf
  • geneTAX folder (optional): Output of the gene-based taxonomic profiling user option from Step 6. It contains the TAX classification and abundances of the individual predicted gene sequences.
    • bin folder: contain text versions of the gene-based taxonomic profiles visualized by KronaTools in the TAXplots_genetax.html report.
    • TAXplots_genetax.html: an html interactive report of the gene-based taxonomic profiles for each sample
  • MAGs_TAX folder (optional): This folder is generated only if MAGs creation and taxonomic profiling is enabled. It contains the TAX classification and abundances of the MAGs.
    • bin folder: contain text versions of the MAG-based taxonomic profiles visualized by KronaTools in the TAXplots_MAGx.html report.
    • MAG-based_Counts+TAX.biom: a biom file summarizing the MAG-based taxonomic profiles of all samples.
    • TAXplots_MAGx.html: an html interactive report of the MAG-based taxonomic profiles for each sample
PWYprofiles folder

Summary output (Step 10) of functional annotation and pathway inference (Steps 7 & 8). Contents of this folder will depend on the options elected for the pipeline.

  • keggPWY folder (default) or metacycPWY folder (optional):
    • PWYplots_KEGG.html or PWYplots_METACYC.html (depending on Metabolic Pathways Database user option in Step 7): html interactive chart of the PWY functional profiles for each sample (created by KronaTools)
    • bin folder: per-sample KronaTools text files for creating the PWYplots chart.
    • bioms folder: per-sample _json.biom files of the PWY profiles, created for each sample.
  • merged_tables folder (≥ 2 samples): Dataset-wide PWY abundance tables representing the final results of the PWY functional annotation and abundance scoring. Tables are produced by merging all the PWY profiles for each sample. The final PWY abundance matrix is represented in various ways (Tiers included, Tiers not included, Counts only, PWY Tier rank descriptions). merged_Counts+Tiers_json.biom biom file for this final PWY abundance matrix is also created. Per-sample summaries are in keggPWY or metacycPWY folders.

  • DiversityPlots (conditional of ≥ 2 samples): Simple exploratory plots of PWY functional profiles for the dataset. Created based on PWYprofiles/merged_tables. Diversity plots are not available for single-sample datasets. Expected files:
    • PWY_BetaDiv_nMDS.pdf & PWY_BetaDiv_PCoA.pdf (>=3 samples): PCoA & nMDS ordination plots are created from the PWY abundance
    • PWY_Profile_Heatmap.pdf (>=2 samples): PWY profile heatmap
asmbMetaSpades/Per-Sample Assembly Results folder

Each sample will have its own sub-folder, named <SampleName>_asmb. Content of each <SampleName>_asmb may vary depending on options chosen for the WGSA2 pipeline, but should contain the sample assemblies, and the per-sample analysis results (functional profiling, output of optional choices) of each assembly.

  • Assembly files (default): Output of Step 4.
    • contigs.fasta & final.assembly.fasta : FASTA files of the contiguous consensus sequences (contigs). The independent contigs are in contigs.fasta . The PE-arranged-contigs (scaffolds) are in final.assembly.fasta which is used for all subsequent assembly-based analyses.
    • final.assembly_scaffCoverage.txt: per-scaffold stats and abundance scores such as average read count, % scaffold covered by reads, number covered bases, GC content, etc.
    • final.assembly_stats.txt: assembly stats such as size of assembly, N50 an L50 stats, number of scaffolds created, etc. File also contains TEDread mapping stats and alignment de-replication stats.
    • final.assembly.bam & .bam.bai: assembly alignment and index of alignment files
    • spades.log: log file of the assembly software SPAdes (also captured by the main WGSA logfile.txt)
  • genes folder (default): Information about predicted genes.
    • predicted gene (pregene) files: Output of Step 6.
      • pregenes.{faa, fna, gff, gtf}: predicted genes described as amino acid sequences (.faa), nucleotide sequences (.fna), or scaffold names and coordinates {.gff & .gtf} files. Please note that pregenes.faa file can be provided to the GhostKoala online annotation & pathway mapping tool for additional sample-specific metabolic and taxonomic annotation.
      • pregenes_ABUND.txt: per predicted gene stats such as gene length, read coverage of gene and estimated transcript per million (TPM) value for each gene.
      • pregenes_stats.txt: overall gene prediction & gene annotation stats, such as number of genes predicted and annotated, etc.
      • AMR.{faa & \_info.txt} (optional): Output of Run AMRFinder user option. .faa contains the peptide amino-acid sequences and \_info file contains information about the predicted peptides, such as name, gene symbol, Element Type, some coverage stats, etc.
    • annotations folder (default): Output of Step 7:
      • annots.emapper.{annotations, hits, seed_orthologs}: files output by eggNOG-mapper2 from which annotations are extracted
      • annots.{COG, EC, KEGGmap, KO}.txt: COG, EC, KEGGmap and KO annotations extracted from the emapper output files.
    • pathways folder (default): Output of Step 8. See: How to read MinPath output and the MinPATH publication for more information. Files will be prefixed based on the Gene Annotation Type selected. Information from the MinPATH report is used to create a <SampleName>_4krona.txt file, describing each inferred pathway and corresponding calculated abundance score . The \_4krona.txt file from each assembled sample is used to create the functional profiles n the PWYprofiles folder.

    • pregenesTAX folder (optional): Per-sample output of the user option gene-based taxonomic profiles. taxREPORT.txt is used to make the Krona charts in the TAXprofiles/geneTAX folder. Also contains the nucleotide sequences of classified and unclassified predicted genes.
  • scaffTAX folder (optional): This folder contains the per-sample output of the user option Produce taxonomic annotation on scaffolds in Step 4. It contains predicted taxonomic affiliations of assembled scaffolds, including FASTA (.fna) sequences of classified and unclassified scaffolds and the Kraken 2 report.

  • MAGs folder (optional): Output of user option Creation and evaluation of MAGs. Contains binned scaffolds and draft genomes and quality assessments. Files will vary based on the community composition, organismal abundance and sequence representation.
    • mags fasta files (draft genomes):
      • mag.{number}.fa: fasta file containing scaffolds assumed to be representing the same organism or lineage
      • mag.{lowDepth, tooShort, unbinned}.fa: fasta files containing scaffolds that do not meet the binning tool's (MetaBAT 2) criteria for binning (see MetaBAT 2 publication). These files are also likely to contain scaffolds affiliated with unicellular eukaryotic organisms (if such are present in the sample).
    • magsQA folder: Output of CheckM's bin quality assessment tool. These files and folders are: bins/, QCplots/, storage/, checkm.log, lineage.ms. More information can be found on CheckM's wiki. Of note are:
      • magsQA/bins/: aa and nt sequences for the predicted genes from that bin, along with gene coordinates, annotations (.gff) and other files
      • magsQA/QCplots: plots describing QC stats for each bin (more info here).
    • Summaries of CheckM output:
      • mags_SUMMARY.txt: taxonomic, abundance, coverage and other information and stats about each MAG. This file is also used to create the MAGs-based TAX profiles reported in TAXprofiles/MAGs_TAX folder.
      • mags_coverage.txt: per scaffold coverage and read mapping stats for each MAG
      • mags_profiles.txt: coverage & abundance assessment stats for each MAG. More information here
      • mags_qa.txt: quality assessment stats for each MAG
      • mags_tax.txt: more detailed taxonomic information about each MAG

Accuracy of WGSA2 profiling

We sought to assess and benchmark the accuracy of the WGSA2 taxonomic assignments by evaluating results when running the pipeline with known microbial community. For this, we used the 2nd CAMI Toy Human Microbiome Project Dataset (Sczyrba et al. 2017). The links below, show the TAX profiles from 1 of CAMI dataset sample: S6 (Oral cavity microbiome), produced by the various WGSA2 profiling strategies.

The theoretical taxonomic profiles for the CAMI oral cavity sample S6 can be seen in a Krona Plot in the first link below. The subsequent links contain the taxonomic profiles as produced by WGSA2 with the same sample full set of reads (_allReads), and a subset of 1 million (_sub1Mreads; also part of the WGSA2 example dataset). The links correspond to the taxonomy assignments produced by the clean reads (TEDreads), the assembled genes (GENE-based) and draft genomic assemblies (MAGs) as Krona Plots.

If you wish to test the WGSA2 pipeline, you can find and download a small example dataset (3 CAMI samples, subsampled at 1 million reads) and corresponding mapping file specific for the WGSA2 pipeline, from Nephele's User Guide page.

Tools and References

  • Joint Genome Institute. BBtools. 2014
  • Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 01 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560
  • Bankevich et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012 May; 19(5): 455–477.doi: 10.1089/cmb.2012.0021
  • Langmead, B., Trapnell, C., Pop, M. et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, (2009). https://doi.org/10.1186/gb-2009-10-3-r25
  • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PMID: 19505943; PMCID: PMC2723002.
  • Kang, Dongwan D et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies, PeerJ. 2019, doi:10.7717/peerj.7359
  • Cantalapiedra Carlos P., Hernandez-Plaza Ana, Letunic Ivica, Bork Peer, Huerta-Cepas Jaime. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. biorxiv. (2021) doi: https://doi.org/10.1101/2021.06.03.446934
  • Hyatt, Doug et al., Prodigal: prokaryotic gene recognition and translation initiation site identification., BMC bioinformatics vol. 11 119. 8 Mar. 2010, doi:10.1186/1471-2105-11-119
  • Parks, Donovan H et al., CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes., Genome research vol. 25,7 (2015): 1043-55. doi:10.1101/gr.18607214
  • Ondov, B.D., Bergman, N.H. Phillippy, A.M. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 385 (2011). https://doi.org/10.1186/1471-2105-12-385
  • Ye Y, Doak TG (2009) A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes. PLOS Computational Biology 5(8): e1000465. https://doi.org/10.1371/journal.pcbi.1000465
  • Caspi et al 2020, The MetaCyc database of metabolic pathways and enzymes - a 2019 update, Nucleic Acids Research 48(D1):D445-D453
  • Zhu, Q., Fisher, S.A., Shallcross, J., Kim, J. (Preprint). VERSE: a versatile and efficient RNA-Seq read counting tool. bioRxiv 053306. doi: http://dx.doi.org/10.1101/053306
  • H. Li, Seqtk: a fast and lightweight tool for processing FASTA or FASTQ sequences, 2013.
  • Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, and Jaime Huerta-Cepas., eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. 2021, Molecular Biology and Evolution, https://doi.org/10.1093/molbev/msab293
  • Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi, Mao Tanabe, KEGG as a reference resource for gene and protein annotation, Nucleic Acids Research, 2016, https://doi.org/10.1093/nar/gkv1070
  • Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol, 2019. https://doi.org/10.1186/s13059-019-1891-0
  • R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

Pipeline design & author

Angelina Angelova PhD, Bioinformatics and Computational Biosciences (BCBB), NIH/NIAID/OCICB