Metagenomics Inference PICRUSt2 Pipeline

You can use the biom and fasta files generated on the 16S amplicon pipelines (DADA2, QIIME2) as input for this Metagenomics Inference pipeline. The Nephele implementation leverages the PICRUSt2 (version 2.4.1) code and documentation released by the Huttenhower lab; learn more here: https://github.com/picrust/picrust2/wiki/Full-pipeline-script and here https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.4.1). This implementation is only meant to be used with the outputs of 16S analysis.

The PICRUSt2 performs the 4 key steps outlined on this wiki: (1) sequence placement, (2) hidden-state prediction of genomes, (3) metagenome prediction, (4) pathway-level predictions. The outputs are further annotated with descriptions of the functional categories.

Finally, the predicted pathways are also plotted in an interactive heatmap using Morpheus.

Input Files

  • BIOM File: The biom file contains the OTU and taxonomy tables to be analyzed. This pipeline accepts the biom file produced by the Nephele Analysis pipelines QIIME2 (BIOM v2.1.0 formatted file) and DADA2 (BIOM V1 format). These files are typically named "feature-table.biom" or "taxa.biom".
  • FASTA File: Amplicon sequences variants (ASVs) or OTU representative sequences (e.g. "dna-sequences.fasta", "seq.fasta"). The fasta file must correspond to the biom file; if reads are missing, PICRUSt will proceed with available reads in fasta file and a warning message will be shown in log file.
  • Mapping File: Spreadsheet or text file (download sample)

User Options/Parameters

  • Max NSTI: Sequences with NSTI values above this value will be excluded. NSTI stands for Nearest-sequenced taxon index and it represents how closed the placed ASV is from the nearest reference 16S sequence. Smaller value = closer distances and more accurate predictions
  • Min Reads: Minimum number of reads across all samples for each input ASV. ASVs below this cut-off will be counted as part of the RARE category in the stratified output
  • Min Samples: Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-off will be counted as part of the RARE category in the stratified output
  • Stratified:
    • None: the output will not be stratified per ASV, instead the output tables will report annotations per sample. For most purposes, you should use this option.
    • Stratified only: select this option to generate stratified tables at all steps (will increase run-time).
    • Stratified & Per Sequence Contrib: select to specify that MinPath is run on the genes contributed by each sequence (i.e. a predicted genome) individually. Note this will greatly increase the runtime. The output will be the predicted pathway abundance contributed by each individual sequence. This is in contrast to the default stratified output, which is the contribution to the community-wide pathway abundances.

Output Files/Directories

The Nephele implementation of PICRUSt2 produces the following key output files:
  • EC_metagenome_out
    • pred_metagenome_unstrat.tsv.gz: unstratified EC number metagenome predictions
    • seqtab_norm.tsv.gz: the per-sample NSTI values weighted by the abundance of each ASV (weighted_nsti.tsv.gz)
    • pred_metagenome_unstrat_descrip.tsv.gz: which is the same as "pred_metagenome_unstrat.tsv.gz" but with an additional column of descriptions for the annotations
    • pred_metagenome_contrib.tsv.gz: stratified output or the metagenome contributions per ASV. This file can be very large and take longer compute time to produce
  • KO_metagenome_out: as EC_metagenome_out above, but for KO metagenomes
  • pathways_out: folder containing predicted pathway abundances based on predicted EC number abundances
    • path_abun_predictions.tsv.gz: table of pathway abundances within each predicted genome
    • path_abun_contrib.tsv.gz: table of stratisfied MetaCyc pathway abundances
    • path_abun_unstrat.tsv.gz: table of unstratified pathway abundances which are based on the community-wide pathway abundances
    • path_abun_unstrat_descrip.tsv.gz: the same file listed above but annotated with descriptions corresponding to the pathway IDs
    • path_abun_unstrat_per_seq.tsv.gz: table with the unstratified pathway abundances based on the per-sequence pathway abundances. When the "--per_sequence_contrib" option is used that means that pathway abundances and coverages are calculated for each predicted genome individually
  • pre_aligned_sequences.fasta: output of pre-alignment step. PICRUSt2 only takes sequences which align to the positive strand (see their FAQ), so the pipeline pre-aligns the input FASTA to the reference database and reverse complements those sequences that align poorly (in case mis-orientation is the cause) before running the full analysis.
  • heatmap.html: interactive heatmap of unstratified pathway abundances (from path_abun_unstrat.tsv.gz). See our visualization documentation for more information on this heatmap. Note: due to the interactive features of the heatmap, it must be in the same directory as the heatmap_files folder in order to work.

Tools and References

  • PICRUSt2: Douglas, G.M., Maffei, V.J., Zaneveld, J.R. et al. PICRUSt2 for prediction of metagenome functions. Nat Biotechnol 38, 685–688 (2020). https://doi.org/10.1038/s41587-020-0548-6
  • Morpheus, https://software.broadinstitute.org/morpheus
  • Data source: Peluso G, Tian E, Abusleme L, Munemasa T, Mukaibo T, Ten Hagen KG. Loss of the disease-associated glycosyltransferase Galnt3 alters Muc10 glycosylation and the composition of the oral microbiome. J Biol Chem. 2020 Jan 31;295(5):1411-1425. doi: 10.1074/jbc.RA119.009807. Epub 2019 Dec 27. PMID: 31882545; PMCID: PMC6996895.