The LoRA pipeline is a complete automated workflow for long metagenomic reads processing, which runs through steps for host removal, long read assembly, microbial taxonomic content classifications, feature predictions, annotations, and abundance scoring, as well as (if elected) functional inference, community stats and visualizations. The pipeline is also equipped with user electable databases specific for host, taxonomic or functional inference, as well as options for antimicrobial resistance (AMR) gene finding, scaffold binning and quality assessment of metagenome assembled genomes (MAGs).
In tandem with our recently released long read NanoporeQC pipeline, the LoRA pipeline provides an automated, flexible and reproducible workflow through the fundamental, computationally and time demanding sequence processing steps of metagenomic analysis. LoRA provides researchers directly with the information they need to dive into their customized biological analyses, skipping through time and resource consuming bioinformatics processing steps.
Since both PacBio (PB) and Oxford Nanopore (ONT) sequencing platform produce only single-end data, the input sequences must be single-end (SE), long reads from metagenomic libraries in FASTQ format.
ONT data: Despite LoRA being able to handle unprocessed (RAW) reads, it is recommended that ONT datasets are run through Nephele's nanoporeQC pipeline, prior to submission into LoRA. This will ensure the removal of lingering sequence adaptors and truncated or low quality reads and ends, which may lead to inaccuracies in the biological results.
PB data: For PacBio datasets, QC pre-processing can be omitted as data tends to be high quality and properly trimmed by the platform. However users may choose to run PB datasets through NanoporeQC, with default settings, only to produce QC report.
A mapping file is required during submission of any dataset to a Nephele pipeline. The LoRA-specific mapping file template can be downloaded here. It is good practice that such file is prepared for each dataset before submission. Mapping files can be an .xlsx or a .txt file with the following structure:
#SampleID | ForwardFastqFile | TreatmentGroup | Description |
---|---|---|---|
Sample1 | Sample1.fastq.gz | Group1 | |
Sample2 | Sample2.fastq.gz | Group2 | my sub sample |
Sample3 | Sample3.fastq.gz | Group2 | |
Sample4 | Sample4.fastq.gz | Group1 |
There are multiple ways to upload large datasets to Nephele. It is recommended to prepare a Globus end point prior to submission
LoRA allows for customization of its workflow during job submission, using the following options:
This value reflects Kraken's scoring scheme [0-1] which adjusts the stringency for the reported taxonomic affiliations. Recommended confidence levels for host detection and removal are between 0.00-0.05 (more info here). Increasing this value leads to more stringent host detection but may omit more distant host-affiliated reads. Decreased confidence threshold captures more of the ambiguous and distant host affiliations.
User may choose a database of host organisms which best suits their dataset’s needs for decontamination. Each host database is a collection of related host genomes. A detailed description of the content of each custom database is provided below.
LoRA can handle shotgun metagenomic datasets produced from both major long read platforms currently: PacBio (PB) and Oxford Nanopore (ONT). In this section, the user needs to choose the sequencing platform used to produce their dataset, so that LoRA may adjust its sequence error-detection algorithms accordingly, during assembly and other steps.
User options are:
The user may choose how many assembly polishing steps they would like their samples to undergo in the last stages of assembly. Metagenomic datasets from complex communities often do not require many polishing steps (0-2) for assessing taxonomic and functional content. However, if the community is composed of more closely related organisms, more polishing steps may be beneficial. The pipeline is limited to 5 maximum polishing steps.
Users may choose which taxonomic classification database will fit their dataset best. Choices are:
This value reflects Kraken's scoring scheme [0-1] which adjusts the stringency for the reported taxonomic affiliations. Recommended confidence levels for taxonomic assignments are between 0.05-0.15 (more info here). Increasing this value produces more accurate taxonomic reports, but also increases the fraction of unassigned reads. Decreased confidence threshold allows for higher level taxonomic assignments to be reported, but also increases false positives. These can be later addressed by applying more stringent filtering of the final community matrix.
User may choose to have assembled contigs binned into putative MAGs and assess their quality (completeness, contamination levels, etc) and taxonomic affiliations with CheckM. Once this option is elected, user may choose to further customize the MAGs with a few more options.
MAGs taxonomy with GTDB: User may choose to have taxonomy assigned to MAGs with GTDB-Tk, in addition to the default CheckM taxonomy. Usually the associated GTDB provides better taxonomic resolution of MAGs, than CheckM. This is recommended for first time explorations of datasets but can be omitted in re-processing.
BlobTools MAG quality plots: Users may wish to explore the quality, coverage, contamination levels and other basic stats for the produced MAGs, through blob visualizations. Recommended for first time explorations of datasets, but can be omitted in re-processing.
This is recommended for initial processing of datasets but the user may choose to omit this time and resource consuming step in re-processing. The step includes gene prediction, annotation and metabolic pathway inference of assembled contigs.
If pathway inference is chosen, user may choose the pathway database, by which the metabolic pathways of their datasets is organized. Choices are:
User may choose to have Antimicrobial Resistance (AMR) gene identification performed against the CARD database. This option can only be performed if metabolic pathways inference is elected.
Functionality: The first step of LoRA is to screen each read against host genomic DNA (Kraken2 tool), based on the user-chosen host DB. By removing such reads, computational time for downstream processes is reduced, and accuracy of community results is improved.
Output: Folder named decontam/
containing decontaminated FASTQ files (_decontam.fastq.gz
) and contamination stat files (_contam
files). User is advised to use the decontaminated reads for all downstream custom analysis. The _contamLOG.txt
file shows how many reads (and their % of total reads) were assigned to host genome and removed from FASTQ files. The _contamREPORT.txt
file gives a detailed report on how many reads were assigned to which host genome.
Functionality: This module involves steps for assembly (metaFlye), read mapping to assembly and abundance scoring (miniMap2), SAM/BAM file processing (samtools), and getting assembly stats (bbtools). These steps are performed for each sample separately.
Output: Folder named assemblies/
with assemblies produced from each sample (assemblies/{sampName}_asmb/
), including _assembly.fasta
, _assembly.bam
, and _assemblySUMMARY.txt
. Other assembly logs are also available per sample within that folder.
_assemblySUMMARY.txt
file. Additional _assemblySUMMARY
files are produced with subsequent customization-based processing steps. For example, a ${sampName}_assemblySUMMARY_RefSeq_wBINqc.txt
is the original assembly summary file, extended with results from taxonomic assignment based on user elected database (module 3) and user-selected binning step (module 8).assemblies/
folder will also contain a few sub-folders from subsequent steps of the workflowFunctionality: Assembled contigs are annotated against genomes from the taxonomic DB chosen by user (Kraken2). Thus taxonomic assignments are produced. Abundance scoring for each ID is based on contig coverage statistics from assemblies. The taxonomic information and abundance scores of each sample are summarized in interactive plots for each sample (KronaTools).
Output: Taxonomic assignments are provided within each sample's assembly folder under assemblies/${sampName}_asmb/asmbTAX/
Content of folder includes a _taxREPORT.txt
file with details on reads mapped to genomes (from database of choice) and a _classLOG.txt
file with general stats on the number of reads with taxonomic assignments. The taxonomic information from each contig of the sample is also added to the _assemblySUMMARY
file, producing _assemblySUMMARY_RefSeq.txt
file.
Functionality: The abundance scores and taxonomic information from all samples is collated into one abundance matrix, reformatted as needed (e.g. biom-format) and visualized with various diversity plots (custom scripts in R language).
Output: The dataset-wide summaries produced by this module are in the profilesTAX
folder, organized by taxonomic database of choice (profilesTAX/${taxDB_name}/
). Thus, if jobs are run with different database options, the pipeline will not overprint taxonomic results, but will accumulate folders within the profilesTAX
. The folder contains:
profilesTAX/${taxDB_name}/kronabin/
folder containing copies of all the _taxREPORT.txt
files from each sampleasmbTAX_mergedTAB_wLIN.txt
)asmbTAX_plots.html
, which visualizing the taxonomic profiles from each sample.profilesTAX/${taxDB_name}/DivPlots
folder containing diversity and abundance visualizations produced from the abundance tableFunctionality: Assembled contigs are scanned for genes (Prodigal), which are then extracted (seqtk) and annotated (eggNOG-mapper2) against the eggNOG database (v5). Annotated genes are used to infer metabolic pathways (MinPath) based on metabolic database of user's choice. Abundance scores for the genes are extracted from the contig coverage stats and used to estimate the abundance of each pathway by averaging abundance for each complete pathway (MinPath default setting).
Output: If user has chosen functional processing of their dataset, there will be asmbPWY/
folders within each samples assembly folder (e.g. assemblies/${sampName}_asmb/asmbPWY/
). Folder will contain:
genes{KO,EC}.{fna,faa}
)genesSUMMARY.txt
file containing information & stats about each gene in the sample, including gene length, reads mapped, coverage, annotation, transcripts per million (TPM), value, etc._aveGEN.{ko,ec}.txt
file containing information about each KO or EC annotation for the sample, average TPM of the annotation and a short description of the gene, depending on the database of its discoveryfeatures/
(produced from gene scans), annotations/
and pathways
(produced in pathway inference)Functionality: The inferred pathway information from all samples, is consolidated and merged into one functional abundance matrix, reformatted as needed (e.g. biom-format) and visualized with various plots (custom scripts in R language). Gene abundance and information is similarly processed and provided as gene abundance matrix but not visualized in the output. Functional summary module only runs if Module 5 is elected to run, and if dataset contains more than 3 samples (stats & visualizations cannot be produced with smaller datasets)
Output: The functional summary folder (profilesPWY/
) is produced and structured similarly to the taxonomic summary folder. The main subfolder profilesPWY/${pwyDB_name}
is named after the metabolic database of choice (e.g. KEGG).
profilesPWY/${pwyDB_name}/pwyBIN
folder, and their content is merged in one summary pathway abundance file, presented in 2 formats: merged_avePWY.${pwyDB_name}_{table.txt,json.biom}
geneBINko
or geneBINec
and their content is merged in one summary gene abundance file for each annotation system: merged_aveGEN.{ko,ec}_table.txt
asmbPWYs_${pwyDB_name}_plot.html
is produced summarizing the metabolic profile of each sample.DivPlots/
folder is outputted containing diversity and abundance visualization plots from the pathway abundance table returned.Functionality: Contigs of each sample are scanned for antimicrobial resistance (AMR) genes (RGI). Output is provided per sample. Module only runs if elected and if Module 5 is elected to run.
Output: Output of the AMR gene identification algorithm can be found in assemblies/${sampName}_asmb/asmbPWY/RGI/
, and includes the .json and .txt files associated with the raw output from RGI tool (FNAs of each gene included) and the main summary stats file rgiFNA_main.txt
, with AMR gene names, location and details.
Functionality:
Output: Contained within assemblies/${sampName}_asmb/MAGs/
folder, this location contains fasta files of binned and unbanned contigs, log files, taxonomy magsTAX/
and quality check magsQA/
associated with each bin.
assemblies/${sampName}_asmb/MAGs/magsQA/magQA_SUMMARY.txt
file, and is extended with GTDB information (if option chosen) in assemblies/${sampName}_asmb/MAGs/magsQA/magQA_SUMMARY_wGTdb.txt
fileassemblies/${sampName}_asmb/MAGs/magsQA/blobPlots
, if user has elected the quality visualizations optionmagsTAX/
folder is produced, containing GTDB-based taxonomic information, summary and a .html interactive plot for each sample.assemblies/${sampName}_asmb/${sampName}_SUMMARY.txt
fileLoRA was benchmarked against theoretical taxonomic and functional profiles of ZymoBiomics HMW DNA standard (D6322) sequenced by Simon et. al in 2023 on long read ONT platform (Figure 1). Three technical replicates from three libraries of varying DNA amounts (3x 1000ng, 3x 500ng, 3x 350ng) were processed through LoRA. Results showed community composition assessment to genus level was comparable to the theoretical profiles with diversity deviations due to higher common ancestor affiliations (Figure 1). Gene and metabolic profiles of various sample sizes also showed comparable to theoretical estimations through KEGG database (Table 1).
Figure 1. LoRA taxonomic and functional benchmarking results, showing theoretical vs LoRA-produced results from ONT-sequenced ZymoBiomics D6322 HMW DNA standard. Taxonomic content is shown in the right panel (krona plots), alpha and beta diversity plots are presented in the left upper panel, and functional profile heat map in the lower left panel.
Table 1. LoRA MAG benchmarking results, showing theoretical vs LoRA-produced MAGs produced from ONT-sequenced ZymoBiomics D633 HMW DNA standard, sequenced at different read depths.
Read Depth | Species | MAGs | Genes (KEGG) | PWYs (KEGG) |
---|---|---|---|---|
Zymo D6322 | 8 | 8 | ~35.1K | ~130 |
>4.0G | 5 | 8 | 26.3K | ~190 |
~2.3G | 6 | 8 | 23.8K | ~170 |
~1.7G | 6 | 9 | 20.9K | ~100 |
~1.3G | 4 | 6 | 21.9K | ~130 |
<1.0G | 4 | 7 | 21.5K | ~118 |
Tool references