mothur is an open source software package for bioinformatics data processing initiated by Patrick Schloss and his development team in the Department of Microbiology and Immunology at the University of Michigan. The mothur project aims to develop a single piece of open source, expandable software to fill the bioinformatics needs of the microbial ecology community. mothur incorporates the functionality of DOTUR, SONS, TreeClimber, s-libshuff and others. Nephele uses mothur v1.46.1.
The mothur pipeline only operates on paired-end FASTQ data. Additionally, due to resource limitations, the input data size is limited to 10 Gb unzipped or 2 Gb zipped.
mothur provides a number of commands for sequence processing. These commands allow you to run a fast and flexible sequence analysis pipeline for OTU-based analysis. As an example, the Max length option determines the maximum length of sequences after reads are joined. If you leave the default value of 0, we will infer the correct maximum length from your data. If you wish to override the inferred value, you can set this field to a non-zero value.
The Optimize and Criteria settings allow you to set the start and/or end of your aligned sequences. The start and end positions are determined based on the median start and end positions of sequences. The default value will remove any sequence that starts after the position that 90% of the sequences, or ends before the position that 90% of the sequences. This parameter ensures that sequences align to the same region.
The Diffs setting: Maximum difference between sequences in a cluster. Generally, if your sequence length is 200, you might only want to allow up to 2 differences between this sequence and another to be considered noise and be pre-clustered in the same cluster. A diff=2 means that there will be a maximum of 4 differences between the two sequences that are compared which is 2% divergence. If your dataset has longer reads, you can increase the value of this parameter. Learn more here: https://mothur.org/wiki/pre.cluster/
Contigs are generated using the command
make.contigs, and then renamed using the command
rename.seqs in order to reduce the memory usage. The
make.contigs command extracts the sequence and quality score data from the FASTQ files, creates the reverse complement of the reverse read and then joins the reads into contigs. This command implements a simple algorithm. Briefly, the algorithm first aligns the pairs of sequences, and then searches across the alignment and identifies any positions where the two reads disagree. If one sequence has a base and the other has a gap, the quality score of the base must be over 25 to be considered real. If both sequences have a base at that position, then one of the bases is required to have a quality score 6 or higher than the other. If it is less than 6 points, then the consensus bases is set to an N. mothur only supports samples with pair-end reads.
This step utilizes the commands
screen.seqs. Specifically, the
screen.seqs command preserves sequences that satisfies certain criteria. Several parameters are used to specify the criteria. In the Nephele's mothur pipeline, 10% of sequences are randomly selected to estimate optimal values for these parameters. The mothur commands
summary.seqs is used to estimate the parameters
end, and a custom algorithm
end: remove sequences that do not start and end at given positions based on the alignment. It is common that some sequences do not align in the same region as most of others. These sequences likely are PCR artifacts or came from the contamination in samples. The
endpositions are determined based on the median start and end positions of sequences. The median start and end positions are generated from the command
maxlength: remove sequences that exceed a certain length, default 0. Note: this value will be determined automatically based on the summary statistics of sequence lengths, if it is set to 0. It is recommended not to set this value manually.
maxlength largely works as quality assurance on the sequences. This option removes sequences that are either too long or too short. Excessive long sequences are likely due to incorrect pairing while short sequences indicate low quality reads. This parameter, however, is generally difficult to determine precisely because it depends on the 16S region and PCR primers being used. The Nephele's mothur pipeline includes a custom algorithm that automatically determines most probable length. The algorithm constructs a histogram based on the sequence lengths and determine the most appropriate cutoff for analysis. The algorithm, although rudimentary, seems relative robust.
This step returns only the unique sequences and reduces the computing time in the downstream analysis. The
unique.seqs command returns the unique sequences found in a FASTA-formatted sequence file and a file that indicates those sequences that are identical to the reference sequences. Often times a collection of sequences will have a significant number of identical sequences. It consumes considersable processing time and have to align, calculate distances, and cluster each of these sequences individually. This command significantly reduces processing time by eliminating and recording duplicate sequences.
This step utilizes the command
align.seqs. This command aligns a user-supplied FASTA-formatted candidate sequence file to a user-supplied FASTA-formatted template alignment. Taxonomic assignments in the reference database are then assigned to the OTUs. The reads that failed to align to the reference sequences are clustered against one another without an external reference sequences. This approach, de novo OTU picking, is particularly useful for studying populations where there is poor characterization of existing data. The mothur pipeline currently aligns sequences to either the HOMD or mothur's SILVA v128 databases.
This step removes columns from alignments based on given criteria with the command
filter.seqs. Alignments generated against reference alignments, such as HOMD or SILVA, often have columns where every character is either a
. or a
-. These columns are not included in calculating distances because they have no information in them. By removing these columns, the calculation of a large number of distances is accelerated.
This step remove sequences that are likely due to sequencing errors with the command
pre.cluster. This command implements a pseudo-single linkage algorithm originally developed by Sue Huse. The basic idea is that abundant sequences are more likely to generate erroneous sequences than rare sequences. The algorithm proceeds by ranking sequences in order of their abundance. It then walks through the list of sequences looking for rarer sequences that are within some threshold of the original sequences. Those that are within the threshold are merged with the larger sequence. The original Huse method performs this task on a distance matrix, whereas mothur performs the same task based on the original sequences. The advantage is that the algorithm works on aligned sequences instead of a distance matrix. This is advantageous because pre-clustering removes a large number of sequences, which makes the distance calculation much faster.
chimera.vsearch reads a FASTA file and reference file, and outputs potentially chimeric sequences. This command utilizes the VSEARCH program. VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps. The VSEARCH program supports de novo and reference based chimera detection, clustering, full-length and prefix dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering, conversion and merging of paired-end reads.
Sequences are clustered into OTUs using
cluster.split first computes the pairwise distances between sequences and saves them to a distance matrix. Nephele checks the size of this distance matrix to ensure it will fit in memory (see this FAQ) before the clustering takes place.
This step generates a consensus taxonomy for an OTU. From each OTU, a single sequence is selected as a representative. This representative sequence is annotated, and that annotation is applied to all remaining sequences within that OTU. The command
classify.otu requires parameters such as list, taxonomy, count, cutoff, label and probabilities. The default is the Wang method. The Wang method looks at all taxonomies represented in the template, and calculates the probability a sequence from a given taxonomy would contain a specific kmer. Then calculates the probability a query sequence would be in a given taxonomy based on the kmers it contains, and assign the query sequence to the taxonomy with the highest probability. This method also runs a bootstrapping algorithm to find the confidence limit of the assignment by randomly choosing with replacement 1/8 of the kmers in the query and then finding the taxonomy.
remove.lineage command reads a taxonomy file and a taxon and generates a new file that contains only the sequences not containing that taxon. The default option is to remove the lineages chloroplast, mitochondria, eukaryote, and unknown.
This step generates a distance-based phylogenetic tree based on the relaxed neighbor joining (RNJ) algorithm with the command
clearcut. RNJ modifies the traditional neighbor joining algorithm and achieves typical asymptotic runtime of O(n2logN) without a significant reduction in the quality of the inferred trees. As with traditional neighbor joining, RNJ will reconstruct the true tree given a set of additive distances. Relaxed neighbor joining was developed by Jason Evans, Luke Sheneman, and James Foster from the Initiative for Bioinformatics and Evolutionary STudies (IBEST) at the University of Idaho. No custom parameters are available.
combo.*.biom: BIOM file of sequence variant counts with taxonomic assignment at the genus level.
combo.*.phylip.dist: Phylogenetic distance calculated using the commands
dist.seqsusing all unique sequences from pre-clustering.
combo.*.fasta: The final FASTA file for taxonomic assignment after a series of quality control steps.
combo.*.opti_mcc.*: Consensus taxonomy and summmary with the optimal Matthews correlation coefficient (mcc) value.
otu_summary_table.txt: Summary of the numbers of taxonomically assigned reads for each sample in the dataset.
rep-seqs.fasta: File containing sequences that correspond to each OTU identified after clustering at 0.03 identity. This file can be used as input to the PICRUSt pipeline along with the biom file.
rep-seqs-unrooted-tree.nwk: This tree file is produced after calculating distances between the OTU representative sequences and generating a neighbor joining tree. This can be used for further analysis with tools such as phyloseq.
rep-seqs-rooted-tree.nwk: This file is derived from the rep-seqs-unrooted-tree.nwk but it is now rooted. This file can be used as input to the Downstream Analysis pipeline along with the biom file.