QIIME1 Pipeline

QIIME1 is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data. QIIME1 is designed to take users from raw sequencing data generated on the Illumina or other platforms through publication quality graphics and statistics. This includes demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations. QIIME1 has been applied to studies based on billions of sequences from tens of thousands of samples.

User Options

Preprocessing

  • Minimum Phred quality score: The Phred quality score is a measure of the quality of the identification of the nucleobases generated by the sequencing platforms such as Illumina and 454. Minimum Phred quality score of 19 for Q20 (1 in 100 of incorrect base call) or better is recommended. The default is 19.
  • Phred offset: Phred Q scores are often represented as ASCII characters of base 33 and 64. Base 33 is the most common representation on modern sequencing platforms, while 64 on 454 and older Illumina. The default is base 33.
  • Maximum ambiguous: Maximum number of degenerate bases (N) allowed in a sequence to retain it. This is applied after quality trimming, and is total over combined paired end reads if applicable. The default is 0.

Join Reads

  • Max bad run length: Maximum number of consecutive low quality base calls permitted before truncation. For V1-V3, a minimum value of 10 is recommended. The default is 3.
  • Minimum overlap length: Minimum number of overlapped bases for the join of paired-end reads. Must be an integer. The default is 10.
  • Percent difference with overlap: Maximum percentage of differences in the overlapped regions. Must be an integer between 1 and 100. The default is 25 or 25%.

Analysis

Taxonomic Assignment

All of the sequences from each sample will be clustered into Operational Taxonomic Units (OTUs) based on their sequence similarity. There are seven databases to choose for taxonomic assignment, namely HOMD (Human Oral Microbiome Database; v15.1), SILVA97 and SILVA99 (v132), Greengenes 97 and Greengenes 99 (v13.8), and ITS97 and ITS99 (UNITE+INSDC 18.11.2018). The ITS databases have been updated to the latest version with 817,130 sequences. The sequences in the UNITE databases target the formal fungal barcode, the nuclear ribosomal internal transcribed spacer (ITS) region. UNITE uses the NCBI taxonomy classification with modifications from Index Fungorum. The sequences in these databases are clustered at 97 and 99% identity. There are three main strategies for OTU picking:

  • De Novo: Reads are clustered against one another without an external reference sequence collection. Useful for studying populations where there is poor characterization of existing data. In addition to clustering, the pipeline also performs taxonomy assignment, sequence alignment, and tree-building steps.
  • Closed reference: Reads are clustered against a reference sequence collection and any reads which do not hit a sequence in the reference sequence collection are excluded. Useful when existing taxa are well characterized and one is not interested in novel ones. Taxonomic assignments in the reference database are assigned to the OTUs.
  • Open reference: Reads are clustered against a reference sequence collection and any reads which do not hit the reference collection are clustered de novo. Applicable when existing taxa are well characterized and one also wishes to discover novel ones.

Pipeline Major Steps

  1. Join forward and reverse short reads as contigs: Contigs are generated using the command join_paired_ends.py. Several parameters are used to determine the overlap and quality of reads to create a contig.
    • pe_join_method: method to use for joining paired-ends, default fastq-join.
    • min_overlap: minimum allowed overlap in base-pairs required to join pairs, default 10 (must be integer).
    • perc_max_diff: maximum allowed percentage of differences within region of overlap, default 25 (between 1-100; only applied to fastq-join method).
  2. Demultiplex FASTQ sequences: The split_libraries_fastq.py command performs demultiplexing of FASTQ sequence data.
    • max_bad_run_length: max number of consecutive low quality base calls allowed before truncating a read, default 3.
    • sequence_max_n: maximum number of degenerate bases (N) allowed in a sequence to retain it. This is applied after quality trimming, and is total over combined paired end reads if applicable, default 0.
    • phred_quality_threshold: default 19.
    • phred_offset: the ASCII offset to use when decoding Phred scores (either 33 or 64), default 33.
  3. OTU picking: The OTU picking step, pick_open_reference_otus.py, pick_closed_reference_otus.py, or pick_de_novo_otus.py, assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold. Sequences which are similar at or above the threshold level are taken to represent the presence of a taxonomic unit (e.g., a genus, when the similarity threshold is set at 0.94) in the sequence collection.
  4. Analysis of core diversity: The core_diversity_analyses.py is a workflow for running a core set of QIIME1 diversity analyses, beginning with a BIOM table, mapping file, and optional phylogenetic tree. The commands include alpha_rarefaction.py, beta_diversity_through_plots.py, summarize_taxa_through_plots.py, plus the (non-workflow) scripts make_distance_boxplots.py, compare_alpha_diversity.py, and group_significance.py.
  5. PICRUSt (Optional): PICRUSt is designed to estimate the gene families contributed to a metagenome by bacteria or archaea identified using 16S rRNA sequencing. Intermediate steps in this pipeline may also be of independent interest, as they allow for phylogenetic prediction of organismal traits using reference examples (here applied to the problem of gene content prediction), and correction for variable marker gene copy number.

Output Files/Directories

  • split_lib_out: Demultiplex and quality filter (usually at Phred > Q20) of paired Illumina reads. For more information, please visit QIIME1 online documentation.
  • otus: The OTU picking step assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold. Sequences which are similar at or above the threshold level are taken to represent the presence of a taxonomic unit (e.g., a genus, when the similarity threshold is set at 0.94) in the sequence collection. The final OTU table is summarized in a BIOM file, e.g., otu_table_mc2_w_tax_no_pynast_failures.biom for open reference OTU picking strategy and otu_table.biom for closed and de novo. These BIOM files are used for the downstream analysis. For more information, please visit QIIME1 online documentation.
  • core_dirersity: The diversity analysis generate figures and tables according to the metadata in the mapping file. The analysis includes alpha diversity, beta diversity, distance boxplots, comparison of alpha diversity, and quantitative analysis of group differences. The output from core diversity is conveniently summarized in a HTML file (index.html). The HTML file lists all the quantitative analyses such as alpha and beta diversity, group significance, and the corresponding figures and summaries. For more information, please visit QIIME1 online documentation.
    Note: The QIIME1 pipeline no longer generates the taxonomy summary plots (plot_taxa_summary.py). For microbial communities with complex profiles, it can take a few weeks to generate the taxonomy summary plots. Please consider the Downstream Analysis pipeline for additional analysis.
  • graphs: The visualization pipeline runs the plotly graphs and generates an interactive heatmap. Please visit the 16S visualization tutorial page to learn how to use it. For more information about what tools we used, please visit the 16S visualization pipeline page.
  • otu_summary_table.txt: The file lists the number of taxonomically assigned reads for each sample in the dataset.
  • otu_table.txt: A tab-separated text file of sequence variant counts and taxonomic assignment at the genus level.
  • otu_picrust (optional): Closed reference OTU picking using QIIME1 with Greengenes 99 database.
  • PICRUST_data (optional): This directory contains the final metagenome functional predictions with figures and bar plots. For more information, please visit PICRUSt online documentation.

Tools and References

  1. JG Caporaso, J Kuczynski, J Stombaugh, K Bittinger, FD Bushman, EK Costello, N Fierer, A Gonzalez Pena, JK Goodrich, JI Gordon, GA Huttley, ST Kelley, D Knights, JE Koenig, RE Ley, CA Lozupone, D McDonald, BD Muegge, M Pirrung, J Reeder, JR Sevinsky, PJ Turnbaugh, WA Walters, J Widmann, T Yatsunenko, J Zaneveld and R Knight (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, doi: 10.1038/nmeth.f.303.
  2. MGI Langille, J Zaneveld, JG Caporaso, D McDonald, D Knights, JA Reyes, JC Clemente, DE Burkepile, RL Vega Thurber, R Knight, RG Beiko and C Huttenhower (2013) Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology, doi: 10.1038/nbt.2676.
  3. RH Nilsson, KH Larsson, AFS Taylor, J Bengtsson-Palme, TS Jeppesen, D Schigel, P Kennedy, K Picard, FO Glöckner, L Tedersoo, I Saar, U Kõljalg, K Abarenkov (2019) The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Research, doi: 10.1093/nar/gky1022.
  4. UNITE Community (2019): UNITE QIIME release for Fungi. Version 18.11.2018. UNITE Community. https://doi.org/10.15156/BIO/786334.