mothur is an open source software package for bioinformatics data processing initiated by Patrick Schloss and his development team in the Department of Microbiology and Immunology at the University of Michigan. The mothur project aims to develop a single piece of open source, expandable software to fill the bioinformatics needs of the microbial ecology community. mothur incorporates the functionality of DOTUR, SONS, TreeClimber, s-libshuff and others. Nephele uses mothur v1.46.1.
The mothur pipeline only operates on paired-end FASTQ data. Additionally, due to resource limitations, the input data size is limited to 10 Gb unzipped or 2 Gb zipped.
mothur provides a number of commands for sequence processing. These commands allow you to run a fast and flexible sequence analysis pipeline for OTU-based analysis. As an example, the Max length option determines the maximum length of sequences after reads are joined. If you leave the default value of 0, we will infer the correct maximum length from your data. If you wish to override the inferred value, you can set this field to a non-zero value.
The Optimize and Criteria settings allow you to set the start and/or end of your aligned sequences. The start and end positions are determined based on the median start and end positions of sequences. The default value will remove any sequence that starts after the position that 90% of the sequences, or ends before the position that 90% of the sequences. This parameter ensures that sequences align to the same region.
The Diffs setting: Maximum difference between sequences in a cluster. Generally, if your sequence length is 200, you might only want to allow up to 2 differences between this sequence and another to be considered noise and be pre-clustered in the same cluster. A diff=2 means that there will be a maximum of 4 differences between the two sequences that are compared which is 2% divergence. If your dataset has longer reads, you can increase the value of this parameter. Learn more here: https://mothur.org/wiki/pre.cluster/
Contigs are generated using the command make.contigs
, and then renamed using the command
rename.seqs
in order to reduce the memory usage. The make.contigs
command extracts
the sequence and quality score data from the FASTQ files, creates the reverse complement of the reverse read
and then joins the reads into contigs. This command implements a simple algorithm. Briefly, the algorithm
first aligns the pairs of sequences, and then searches across the alignment and identifies any positions
where the two reads disagree. If one sequence has a base and the other has a gap, the quality score of the
base must be over 25 to be considered real. If both sequences have a base at that position, then one of the
bases is required to have a quality score 6 or higher than the other. If it is less than 6 points, then the
consensus bases is set to an N. mothur only supports samples with pair-end reads.
This step utilizes the commands screen.seqs
. Specifically, the screen.seqs
command
preserves sequences that satisfies certain criteria. Several parameters are used to specify the criteria. In
the Nephele's mothur pipeline, 10% of sequences are randomly selected to estimate optimal values for these
parameters. The mothur commands summary.seqs
is used to estimate the parameters
start
and end
, and a custom algorithm maxlength
.
start
and end
: remove sequences that do not start and end at given positions
based on the alignment. It is common that some sequences do not align in the same region as most of
others. These sequences likely are PCR artifacts or came from the contamination in samples. The
start
and end
positions are determined based on the median start and end
positions of sequences. The median start and end positions are generated from the command
summary.seqs
.
maxlength
: remove sequences that exceed a certain length, default 0. Note: this value will
be determined automatically based on the summary statistics of sequence lengths, if it is set to 0. It
is recommended not to set this value manually.The maxlength
largely works as quality assurance on the sequences. This option removes sequences
that are either too long or too short. Excessive long sequences are likely due to incorrect pairing while
short sequences indicate low quality reads. This parameter, however, is generally difficult to determine
precisely because it depends on the 16S region and PCR primers being used. The Nephele's mothur pipeline
includes a custom algorithm that automatically determines most probable length. The algorithm constructs a
histogram based on the sequence lengths and determine the most appropriate cutoff for analysis. The
algorithm, although rudimentary, seems relative robust.
This step returns only the unique sequences and reduces the computing time in the downstream analysis. The
unique.seqs
command returns the unique sequences found in a FASTA-formatted sequence file and a
file that indicates those sequences that are identical to the reference sequences. Often times a collection
of sequences will have a significant number of identical sequences. It consumes considersable processing
time and have to align, calculate distances, and cluster each of these sequences individually. This command
significantly reduces processing time by eliminating and recording duplicate sequences.
This step utilizes the command align.seqs
. This command aligns a user-supplied FASTA-formatted
candidate sequence file to a user-supplied FASTA-formatted template alignment. Taxonomic assignments in the
reference database are then assigned to the OTUs. The reads that failed to align to the reference sequences
are clustered against one another without an external reference sequences. This approach, de novo
OTU picking, is particularly useful for studying populations where there is poor characterization of
existing data. The mothur pipeline currently aligns sequences to either the HOMD or mothur's SILVA v128
databases.
This step removes columns from alignments based on given criteria with the command filter.seqs
.
Alignments generated against reference alignments, such as HOMD or SILVA, often have columns where every
character is either a .
or a -
. These columns are not included in calculating distances
because they have no information in them. By removing these columns, the calculation of a large number of
distances is accelerated.
This step remove sequences that are likely due to sequencing errors with the command
pre.cluster
. This command implements a pseudo-single linkage algorithm originally developed by
Sue
Huse. The basic idea is that abundant sequences are more likely to
generate erroneous sequences than rare sequences. The algorithm proceeds by ranking sequences in order of
their abundance. It then walks through the list of sequences looking for rarer sequences that are within
some threshold of the original sequences. Those that are within the threshold are merged with the larger
sequence. The original Huse method performs this task on a distance matrix, whereas mothur performs the same
task based on the original sequences. The advantage is that the algorithm works on aligned sequences instead
of a distance matrix. This is advantageous because pre-clustering removes a large number of sequences, which
makes the distance calculation much faster.
The command chimera.vsearch
reads a FASTA file and reference file, and outputs potentially
chimeric sequences. This command utilizes the VSEARCH program. VSEARCH stands for vectorized search, as the tool takes
advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate
alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by
default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and
overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps. The VSEARCH program
supports de novo and reference based chimera detection, clustering, full-length and prefix
dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact
and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis,
filtering, conversion and merging of paired-end reads.
Sequences are clustered into OTUs using cluster.split
. cluster.split
first computes the
pairwise distances between sequences and saves them to a distance matrix. Nephele checks the size of this
distance matrix to ensure it will fit in memory (see this FAQ) before the clustering takes
place.
This step generates a consensus taxonomy for an OTU. From each OTU, a single sequence is selected as a
representative. This representative sequence is annotated, and that annotation is applied to all remaining
sequences within that OTU. The command classify.otu
requires parameters such as list, taxonomy,
count, cutoff, label and probabilities. The default is the Wang
method. The Wang method looks at all taxonomies represented in the
template, and calculates the probability a sequence from a given taxonomy would contain a specific
kmer. Then calculates the probability a query sequence would be in a given taxonomy based on the
kmers it contains, and assign the query sequence to the taxonomy with the highest probability. This
method also runs a bootstrapping algorithm to find the confidence limit of the assignment by randomly
choosing with replacement 1/8 of the kmers in the query and then finding the taxonomy.
The remove.lineage
command reads a taxonomy file and a taxon and generates a new file that
contains only the sequences not containing that taxon. The default option is to remove the lineages
chloroplast, mitochondria, eukaryote, and unknown.
This step generates a distance-based phylogenetic tree based on the relaxed neighbor joining (RNJ) algorithm
with the command clearcut
. RNJ modifies the traditional neighbor joining algorithm and achieves
typical asymptotic runtime of O(n2logN) without a significant reduction in the
quality of the inferred trees. As with traditional neighbor joining, RNJ will reconstruct the true tree
given a set of additive distances. Relaxed neighbor joining was developed by Jason Evans, Luke Sheneman, and
James Foster from the Initiative for Bioinformatics and Evolutionary STudies (IBEST) at the
University of Idaho. No custom parameters are available.
combo.*.biom
: BIOM
file of sequence
variant counts with taxonomic assignment at the genus level.combo.*.phylip.dist
: Phylogenetic distance calculated using the commands
dist.seqs
using all unique sequences from pre-clustering.
combo.*.fasta
: The final FASTA file for taxonomic assignment after a series of quality
control steps.combo.*.opti_mcc.*
: Consensus taxonomy and summmary with the optimal Matthews
correlation coefficient (mcc) value.otu_summary_table.txt
: Summary of the numbers of taxonomically assigned reads for each
sample in the dataset.rep-seqs.fasta
: File containing sequences that correspond to each OTU identified after
clustering at 0.03 identity. This file can be used as input to the PICRUSt pipeline along with the biom
file.rep-seqs-unrooted-tree.nwk
: This tree file is produced after calculating distances between
the OTU representative sequences and generating a neighbor joining tree. This can be used for further
analysis with tools such as phyloseq.rep-seqs-rooted-tree.nwk
: This file is derived from the rep-seqs-unrooted-tree.nwk but it
is now rooted. This file can be used as input to the Downstream Analysis pipeline along with the biom
file.