DADA2 Pipeline

Packages

Nephele runs the DADA2 R package v1.28 following the steps in the package authors' Big Data workflow including optional use of the DECIPHER package v2.28.0. We make some minor modifications of the parameters used. Additionally, we construct a phylogenetic tree using MAFFT v7.520 (2023/Mar/22) and FastTree v2.1.11 Our pipeline is outlined below. If you are new to DADA2, it might be helpful to read through the DADA2 Tutorial.

User Options

Ion Torrent Data - Beta: By default, DADA2 is trained to work on Illumina data. Checking this option sets the denoising parameters according to DADA2's suggested values for Ion Torrent data. They also suggest the trim left parameter be increased by 15 bp (on top of any primer lengths). This option is in beta, and has not been extensively tested. If you have Ion Torrent data, we are interested in your feedback - please email us!

Filter and Trim

Trim left: The number of nucleotides to remove from the start of each read, forward and reverse. The values should be chosen based on the lengths of primers used for sequencing. If your data are untrimmed, this parameter is very important for the DADA2 pipeline. See this FAQ (Default: 0).
Truncation quality score: Truncate reads at the first instance of a quality score less than or equal to this value. (Default: 4).
Truncation length: The length at which to truncate reads, forward and reverse. Reads shorter than these lengths are discarded. If set to 0, reads are not truncated. If both trim left and truncation length are set, the filtered reads will have length = truncation length - trim left. (Default: 0).
Maximum expected errors (maxEE): After truncation, reads with higher than this many "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10)). (Default: 5).

Merge Pairs

For paired-end data only.

Just concatenate: Concatenate paired reads instead of merging. (Default: FALSE)
Maximum mismatches: The maximum number of mismatches allowed in the overlap region when merging read pairs. (Default: 0).
Trim overhanging sequence: After merging paired end reads, trim sequence which overhangs the start of each read. If amplicons are shorter than read length, e.g. 16S V4 region, we suggest checking this option. (Logical. Default: FALSE).

Analysis

Pseudo-pooling: Pseudo-pool samples for sample inference. This is useful for identifying rare variants, but significantly increases processing time. For more info, see the DADA2 documentation on pseudo-pooling. (Default: False).
Chimera removal: Remove chimeric sequences. If primers are not trimmed (either prior to submission or using the trim left option), then we suggest unchecking this option. (Default: True).
Taxonomic assigment: Method to be used for taxonomic assignment, either rdp or IDTAXA. (Default: rdp)
Reference database: Reference database to be used for taxonomic assignment. IDTAXA will use its own SILVA database. See Databases below.
Minimum bootstrap value for rdp: If rdp is chosen for the taxonomic assignment method, this is the minimum bootstrap confidence value required for a taxonomic assignment to be made. (Default: 40).
Multiple species ID: For species assignment, if there are multiple exact matches, by default, DADA2 will not make any assignment. If this option is checked, then all exact matches will be listed as the species in the ASV table. (Default: False).
Sampling depth: The number of counts for filtering and subsampling the OTU table for downstream analysis. Samples which have counts below this value will be removed from the downstream analysis. The counts of the remaining samples will be subsampled to this value. If not specified, it will be calculated automatically. (See this FAQ)

Databases

SILVA v138.1 database (also older version v132)
Human Oral Microbiome Database (eHOMD) v15.22 formatted for DADA2 (also older version v15.1)
Greengenes v13.8
For IDTAXA, we use the authors' modified SILVA v132/v138 SSU trained classifier. More information in the DECIPHER FAQ.

Pipeline steps

This is the pipeline workflow along with the outputs given at each step. We link to the specific DADA2 R functions that are used.

Preprocess sequence data with filterAndTrim. The maximum expected errors (maxEE), trim left (trimLeft), truncation quality score (truncQ), and truncation length (truncLen) parameters can be set by user options. The filtered sequence files, *_trim.fastq.gz, are output to the filtered_data directory.
Learn the error rates with learnErrors. The error rate graphs made with plotErrors are saved as errorRate_R1.pdf, errorRate_R2.pdf. The error profiles, err, are also saved as a list R binary object in the intermediate_files directory.
Dereplicate reads with derepFastq and run the dada sequence-variant inference algorithm. If the pseudo-pooling user option is checked, then inference is run twice, the second time with the inferred ASVs used as prior information.
For paired-end data, merge the overlapping denoised reads with mergePairs. The default minimum read overlap, minOverlap, parameter is set to 12. Trim overhanging sequence (trimOverhang), just concatenate (justConcatenate), and maximum mismatches (maxMismatch) can be set as user options.
Filter out ASVs of length less than 75 bp. Then, the sequence table, seqtab, containing the final amplicon sequence variants (ASVs), is saved as an R binary object (seqtab.rds) to the intermediate_files directory. Also, filter out chimeras with removeBimeraDenovo, if the option is chosen, and save that result as seqtab_nochimera.rds in the intermediate_files folder.
Depending on the user options for taxonomic assignment and reference database, classify the remaining ASVs taxonomically with
- rdp using assignTaxonomy (default). The minBoot parameter for minimum bootstrap confidence is set as a user option and tryRC is set to TRUE, so the best match from each sequence or its reverse-complement is used. Add species annotation to the taxonomic identification using addSpecies where ambiguous matches will be included if the Multiple species ID user option is checked. This final result is saved as a biom file taxa.biom.
  
  For PE data, if the mergePairs justConcatenate option is checked, species annotation will only be done using the forward reads (R1).
- IDTAXA using IdTaxa from the DECIPHER R package. The final result will be saved as taxa.biom
The final results are also saved as a tab-separated text file OTU_table.txt. The final sequence variants used for taxonomic classification are output as seq.fasta. A summary of the counts in the OTU table is saved to otu_summary_table.txt.
Construct a phylogenetic tree from ASVs using MAFFT and FastTree with default parameters. The tree is then rooted at the midpoint with skbio.tree. This produces tree files in Newick format in the phylo directory: unrooted_tree.nwk and rooted_tree.nwk.

Output Files

See Pipeline Steps above for more details on how these files were made.

OTU_table.txt: tab-separated text file of ASV counts and taxonomic assignment
seq.fasta: FASTA file of amplicon sequence variants
taxa.biom: taxonomic assignment at the genus or species level depending on choice of database or method of assignment in BIOM V1 format
otu_summary_table.txt: summary of the sequence variant counts by sample
taxonomy_table.txt: tab-separated taxonomy file suitable for importing into QIIME2
errorRate_R1/2.pdf: error profile plots
phylo: phylogenetic trees.
- rooted_tree.nwk: rooted tree in Newick format which can be used with Nephele's Downstream/Diversity Analysis pipeline for further exploration.
- unrooted_tree.nwk: unrooted tree
filtered_data: trimmed sequence files
intermediate_files: intermediate files produced by the pipeline; useful for debugging
graphs: output of the visualization pipeline

Tools and References

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA and Holmes SP (2016). "DADA2: High-resolution sample inference from Illumina amplicon data." Nature Methods, 13, pp. 581-583. doi: 10.1038/nmeth.3869.
Murali, A., Bhargava, A., and Wright, E. S. (2018). "IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences." Microbiome, 6(1). doi: 10.1186/s40168-018-0521-5.
McMurdie PJ and Paulson JN (2016). biomformat: An interface package for the BIOM file format. https://github.com/joey711/biomformat/.
Microsoft and Weston S (2017). foreach: Provides Foreach Looping Construct for R. R package version 1.4.4, https://CRAN.R-project.org/package=foreach.

Databases

Quast C., Pruesse E., Yilmaz P., Gerken, J., Schweer T., Yarza P., Peplies, J., Glöckner, F. O. (2013). "The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools." Nucleic Acids Research, 41(D1), D590-D596. doi: 10.1093/nar/gks1219.
Escapa, I. F., Chen, T., Huang, Y., Gajare, P., Dewhirst, F. E., and Lemon, K. P. (2018). "New Insights into Human Nostril Microbiome from the Expanded Human Oral Microbiome Database (eHOMD): a Resource for the Microbiome of the Human Aerodigestive Tract." MSystems, 3(6), e00187-18. doi: 10.1128/mSystems.00187-18.
DeSantis, T. Z., et al. "Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB." Applied and Environmental Microbiology, vol. 72, no. 7, July 2006, pp. 5069–72. aem.asm.org, doi: 10.1128/AEM.03006-05.

Phylogenetic tree

Katoh, K., and Standley, D. M. (2013). "MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability." Molecular Biology and Evolution, 30(4), 772–780. doi: 10.1093/molbev/mst010.
Price, M. N., Dehal, P. S., and Arkin, A. P. (2010). "FastTree 2–approximately maximum-likelihood trees for large alignments." PloS One, 5(3), e9490. doi: 10.1371/journal.pone.0009490.
scikit-bio. Retrieved July 24, 2023, from http://scikit-bio.org/.

Quick links

DADA2 Pipeline

Packages

User Options

Filter and Trim

Merge Pairs

Analysis

Databases

Pipeline steps

Output Files

Tools and References

Databases

Phylogenetic tree