Nephele runs the DADA2 R package v1.10 following the steps in the package authors' Big Data workflow including optional use of DECIPHER package v2.10. We make some minor modifications of the parameters used. Our pipeline is outlined below. If you are new to DADA2, it might be helpful to read through the DADA2 Tutorial.
trim leftparameter be increased by 15 bp (on top of any primer lengths). This option is in beta, and has not been extensively tested. If you have Ion Torrent data, we are interested in your feedback - please email us!
For paired-end data only.
IDTAXA. (Default: rdp)
IDTAXAwill use its own SILVA v132 database. See Databases below.
IDTAXA, we use the authors' modified SILVA v132 SSU trained classifier. More information in the DECIPHER FAQ.
Plot quality profiles of forward and reverse reads. These graphs are saved as qualityProfile_R1.pdf and qualityProfile_R2.pdf.
pqp1 <- plotQualityProfile(file.path(datadir, r1)) pqp2 <- plotQualityProfile(file.path(datadir, r2))
Preprocess sequence data with filterAndTrim. The maxEE, trimLeft, truncQ, and truncLen parameters can be set by the user (defaults used below as example). The filtered sequence files, *_trim.fastq.gz, are output to the filtered_data directory.
filterAndTrim(fwd = file.path(datadir, readslist$R1), filt = file.path(filt.dir, trimlist$R1), rev = file.path(datadir, readslist$R2), filt.rev = file.path(filt.dir, trimlist$R2), maxEE = 5, trimLeft = 20, truncQ = 4, truncLen = 0, rm.phix = TRUE, compress = TRUE, verbose = TRUE, multithread = nthread, minLen = 50)
Learn the error rates with learnErrors. The nbases parameter is set to 1e+08. The error rate graphs made with plotErrors are saved as errorRate_R1.pdf, errorRate_R2.pdf. The error profiles,
err, are also saved as a list R binary object in the intermediate_files directory.
errR1 <- learnErrors(r1, multithread = nthread, nbases = nbases, randomize = TRUE) pe1 <- plotErrors(errR1, nominalQ = TRUE)
derepR1 <- derepFastq(r1[sample], verbose = TRUE) ddR1 <- dada(derepR1, err = errR1, multithread = nthread, verbose = 1)
For paired-end data, merge the overlapping denoised reads with mergePairs. The minOverlap parameter is set to 12. trimOverhang, justConcatenate, and maxMismatch are set by the user. The sequence table,
seqtab, containing the final amplicon sequence variants (ASVs), is saved as an R binary object to the intermediate_files directory.
mergePairs(dd$R1, derep$R1, dd$R2, derep$R2, verbose = TRUE, minOverlap = 12, trimOverhang = FALSE, justConcatenate = FALSE, maxMismatch = 0) seqtab <- makeSequenceTable(mergedReads)
Filter out ASVs of length less than 75 bp. The sequence table is saved as seqtab_min75.rds. Also, filter out chimeras with removeBimeraDenovo, if the option is chosen.
seqtab <- seqtab[, which(seqlengths >= 75)] seqtabnochimera <- removeBimeraDenovo(seqtab, verbose = TRUE, multithread = nthread)
Classify the remaining ASVs taxonomically with
rdp using assignTaxonomy (default). The minBoot parameter for minimum bootstrap confidence is set to 80 and tryRC is set to TRUE. This genus level result is saved as taxa.biom. Add species annotation to the taxonomic identification using addSpecies. This final result is saved as a biom file taxa_species.biom.
taxa <- assignTaxonomy(seqtab, refdb, multithread = nthread, minBoot = 80, tryRC = TRUE, verbose = TRUE) taxa.species <- addSpecies(taxa, refdb_species, verbose = TRUE, tryRC = TRUE)
IDTAXA using IdTaxa from the DECIPHER R package. The final result will be saved as taxa.biom
dna <- DNAStringSet(getSequences(seqtab)) ids <- IdTaxa(dna, trainingSet, strand = "both", processors = nthread, verbose = T)
The final results are also saved as a tab-separated text file OTU_table.txt. The final sequence variants used for taxonomic classification are output as seq.fasta. A summary of the counts in the OTU table is saved to otu_summary_table.txt.
Complete descriptions of the intermediate and final output files can be found in the Pipeline Steps above.
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA and Holmes SP (2016). "DADA2: High-resolution sample inference from Illumina amplicon data." Nature Methods, 13, pp. 581-583. doi: 10.1038/nmeth.3869.
McMurdie PJ and Paulson JN (2016). biomformat: An interface package for the BIOM file format. https://github.com/joey711/biomformat/.
Microsoft and Weston S (2017). foreach: Provides Foreach Looping Construct for R. R package version 1.4.4, https://CRAN.R-project.org/package=foreach.
Murali, A., Bhargava, A., and Wright, E. S. (2018). IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences. Microbiome, 6(1). doi: 10.1186/s40168-018-0521-5.
Quast C., Pruesse E., Yilmaz P., Gerken, J., Schweer T., Yarza P., Peplies, J., Glöckner, F. O. (2013). "The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools." Nucleic Acids Research, 41(D1), D590-D596. doi: 10.1093/nar/gks1219.
Escapa, I. F., Chen, T., Huang, Y., Gajare, P., Dewhirst, F. E., & Lemon, K. P. (2018). "New Insights into Human Nostril Microbiome from the Expanded Human Oral Microbiome Database (eHOMD): a Resource for the Microbiome of the Human Aerodigestive Tract." MSystems, 3(6), e00187-18. doi: 10.1128/mSystems.00187-18.
DeSantis, T. Z., et al. "Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB." Applied and Environmental Microbiology, vol. 72, no. 7, July 2006, pp. 5069–72. aem.asm.org, doi: 10.1128/AEM.03006-05.