LongRead Assembly pipeline (LoRA)

Introduction

The LoRA pipeline is a complete automated workflow for long metagenomic reads processing, which runs through steps for host removal, long read assembly, microbial taxonomic content classifications, feature predictions, annotations, and abundance scoring, as well as (if elected) functional inference, community stats and visualizations. The pipeline is also equipped with user electable databases specific for host, taxonomic or functional inference, as well as options for antimicrobial resistance (AMR) gene finding, scaffold binning and quality assessment of metagenome assembled genomes (MAGs).

In tandem with our recently released long read NanoporeQC pipeline, the LoRA pipeline provides an automated, flexible and reproducible workflow through the fundamental, computationally and time demanding sequence processing steps of metagenomic analysis. LoRA provides researchers directly with the information they need to dive into their customized biological analyses, skipping through time and resource consuming bioinformatics processing steps.

List of abbreviations in text

  • ONT: Oxford Nanopore Technology
  • PB: PacBio
  • SampName: Sample Name
  • DB: Database
  • AA: Amino acid
  • NT: Nucleotide
  • bp: base pairs
  • PWY: pathway / pathway inference
  • TAX: taxonomy / taxonomic
  • AMR: antimicrobial resistance
  • TPM: Transcripts per million
  • MAGs: metagenome assembled genomes

Prerequisites for LoRA submission

1. Sequencing data

Since both PacBio (PB) and Oxford Nanopore (ONT) sequencing platform produce only single-end data, the input sequences must be single-end (SE), long reads from metagenomic libraries in FASTQ format.

2. File formats and dataset size limitations
  • LoRA can accept both gzipped (.fastq.gz) and unzipped FASTQ files. Files in .BAM or .FASTA formats are not accepted
  • File names should not contain special characters, but may include dashes "-", underscores "_" and dots "."
  • Due to the huge resource and time requirements of LoRA, which grows with dataset size, the entire submitted dataset must not exceed 150Gb gzipped. If your dataset exceeds this size, please consider splitting it into separate submissions.
3. QC and pre-processing

ONT data: Despite LoRA being able to handle unprocessed (RAW) reads, it is recommended that ONT datasets are run through Nephele's nanoporeQC pipeline, prior to submission into LoRA. This will ensure the removal of lingering sequence adaptors and truncated or low quality reads and ends, which may lead to inaccuracies in the biological results.

PB data: For PacBio datasets, QC pre-processing can be omitted as data tends to be high quality and properly trimmed by the platform. However users may choose to run PB datasets through NanoporeQC, with default settings, only to produce QC report.

4. Mapping file / Metadata file

A mapping file is required during submission of any dataset to a Nephele pipeline. The LoRA-specific mapping file template can be downloaded here. It is good practice that such file is prepared for each dataset before submission. Mapping files can be an .xlsx or a .txt file with the following structure:

#SampleID ForwardFastqFile TreatmentGroup Description
Sample1 Sample1.fastq.gz Group1
Sample2 Sample2.fastq.gz Group2 my sub sample
Sample3 Sample3.fastq.gz Group2
Sample4 Sample4.fastq.gz Group1
5. Data upload

There are multiple ways to upload large datasets to Nephele. It is recommended to prepare a Globus end point prior to submission

User options in job submission

LoRA allows for customization of its workflow during job submission, using the following options:

1. Host detection confidence levels

This value reflects Kraken's scoring scheme [0-1] which adjusts the stringency for the reported taxonomic affiliations. Recommended confidence levels for host detection and removal are between 0.00-0.05 (more info here). Increasing this value leads to more stringent host detection but may omit more distant host-affiliated reads. Decreased confidence threshold captures more of the ambiguous and distant host affiliations.

2. Host detection database

User may choose a database of host organisms which best suits their dataset’s needs for decontamination. Each host database is a collection of related host genomes. A detailed description of the content of each custom database is provided below.

  1. Human & Mouse decontamination DB (default choice)
    • Description: Human and Mouse hosts
    • Content report
    • collected assemblies:
      • Homo sapiens (assembly version GRCh38.p13),
      • Mus musculus (assembly version GRCm39)
  2. Marine decontamination DB
  3. Mosquito decontamination DB
  4. Nematode decontamination DB
  5. Rodent decontamination DB
  6. Primates decontamination DB
  7. Plants decontamination DB
3. Sequence data type

LoRA can handle shotgun metagenomic datasets produced from both major long read platforms currently: PacBio (PB) and Oxford Nanopore (ONT). In this section, the user needs to choose the sequencing platform used to produce their dataset, so that LoRA may adjust its sequence error-detection algorithms accordingly, during assembly and other steps.

4. Data quality type

User options are:

  • RAW: the long read data is raw and has not undergone any QC or pre-processing. This option is recommended for most PB data
  • HQ: the long read data has been pre-processed through a quality-control pipeline (such as Nephele's Nanopore QC), which removes low quality bases from ends of reads, as well as shorter and adapter-containing reads. This pre-processing step is recommended for ONT data.
  • CORR: some long read seq data may require specialized error-correction pre-processing steps, such as CANU. This pre-processing step is recommended for older long read sequence data.
5. Assembly polishing steps

The user may choose how many assembly polishing steps they would like their samples to undergo in the last stages of assembly. Metagenomic datasets from complex communities often do not require many polishing steps (0-2) for assessing taxonomic and functional content. However, if the community is composed of more closely related organisms, more polishing steps may be beneficial. The pipeline is limited to 5 maximum polishing steps.

6. Tax classification DB

Users may choose which taxonomic classification database will fit their dataset best. Choices are:

  • RefSeq: composed of all reference sequence genomes, this common database, is best used for global overview of any microbial community.
  • MGBC: The Mouse Gastrointestinal Bacterial Catalogue provides better bacterial taxonomic assignments for samples that of mouse gut origin.
7. Tax assignment confidence level

This value reflects Kraken's scoring scheme [0-1] which adjusts the stringency for the reported taxonomic affiliations. Recommended confidence levels for taxonomic assignments are between 0.05-0.15 (more info here). Increasing this value produces more accurate taxonomic reports, but also increases the fraction of unassigned reads. Decreased confidence threshold allows for higher level taxonomic assignments to be reported, but also increases false positives. These can be later addressed by applying more stringent filtering of the final community matrix.

8. Produce MAGs

User may choose to have assembled contigs binned into putative MAGs and assess their quality (completeness, contamination levels, etc) and taxonomic affiliations with CheckM. Once this option is elected, user may choose to further customize the MAGs with a few more options.

MAGs taxonomy with GTDB: User may choose to have taxonomy assigned to MAGs with GTDB-Tk, in addition to the default CheckM taxonomy. Usually the associated GTDB provides better taxonomic resolution of MAGs, than CheckM. This is recommended for first time explorations of datasets but can be omitted in re-processing.

BlobTools MAG quality plots: Users may wish to explore the quality, coverage, contamination levels and other basic stats for the produced MAGs, through blob visualizations. Recommended for first time explorations of datasets, but can be omitted in re-processing.

9. Infer metabolic pathways

This is recommended for initial processing of datasets but the user may choose to omit this time and resource consuming step in re-processing. The step includes gene prediction, annotation and metabolic pathway inference of assembled contigs.

10. Metabolic pathway database

If pathway inference is chosen, user may choose the pathway database, by which the metabolic pathways of their datasets is organized. Choices are:

  • KEGG metabolic database + BRITE hierarchies: excellently curated metabolic maps organized in hierarchal classification system.
  • MetaCyc metabolic database: excellently curated metabolic pathways database
11. AMR prediction

User may choose to have Antimicrobial Resistance (AMR) gene identification performed against the CARD database. This option can only be performed if metabolic pathways inference is elected.

Workflow and Outputs

Nephele LoRA workflow
Module 1. Decontamination

Functionality: The first step of LoRA is to screen each read against host genomic DNA (Kraken2 tool), based on the user-chosen host DB. By removing such reads, computational time for downstream processes is reduced, and accuracy of community results is improved.

Output: Folder named decontam/ containing decontaminated FASTQ files (_decontam.fastq.gz) and contamination stat files (_contam files). User is advised to use the decontaminated reads for all downstream custom analysis. The _contamLOG.txt file shows how many reads (and their % of total reads) were assigned to host genome and removed from FASTQ files. The _contamREPORT.txt file gives a detailed report on how many reads were assigned to which host genome.

Module 2. Assembly & Coverage stats

Functionality: This module involves steps for assembly (metaFlye), read mapping to assembly and abundance scoring (miniMap2), SAM/BAM file processing (samtools), and getting assembly stats (bbtools). These steps are performed for each sample separately.

Output: Folder named assemblies/ with assemblies produced from each sample (assemblies/{sampName}_asmb/), including _assembly.fasta, _assembly.bam, and _assemblySUMMARY.txt. Other assembly logs are also available per sample within that folder.

  • Assembly information about the coverage, lengths and number of contigs in sample, is contained within the _assemblySUMMARY.txt file. Additional _assemblySUMMARY files are produced with subsequent customization-based processing steps. For example, a ${sampName}_assemblySUMMARY_RefSeq_wBINqc.txt is the original assembly summary file, extended with results from taxonomic assignment based on user elected database (module 3) and user-selected binning step (module 8).
  • Each assembled sample in the assemblies/ folder will also contain a few sub-folders from subsequent steps of the workflow
Module 3. Taxonomic assignments

Functionality: Assembled contigs are annotated against genomes from the taxonomic DB chosen by user (Kraken2). Thus taxonomic assignments are produced. Abundance scoring for each ID is based on contig coverage statistics from assemblies. The taxonomic information and abundance scores of each sample are summarized in interactive plots for each sample (KronaTools).

Output: Taxonomic assignments are provided within each sample's assembly folder under assemblies/${sampName}_asmb/asmbTAX/ Content of folder includes a _taxREPORT.txt file with details on reads mapped to genomes (from database of choice) and a _classLOG.txt file with general stats on the number of reads with taxonomic assignments. The taxonomic information from each contig of the sample is also added to the _assemblySUMMARY file, producing _assemblySUMMARY_RefSeq.txt file.

Module 4. Taxonomic summary

Functionality: The abundance scores and taxonomic information from all samples is collated into one abundance matrix, reformatted as needed (e.g. biom-format) and visualized with various diversity plots (custom scripts in R language).

Output: The dataset-wide summaries produced by this module are in the profilesTAX folder, organized by taxonomic database of choice (profilesTAX/${taxDB_name}/). Thus, if jobs are run with different database options, the pipeline will not overprint taxonomic results, but will accumulate folders within the profilesTAX. The folder contains:

  • a profilesTAX/${taxDB_name}/kronabin/ folder containing copies of all the _taxREPORT.txt files from each sample
  • a merged taxonomic abundance table from all samples, provided in 2 formats: .TXT and .BIOM (e.g. asmbTAX_mergedTAB_wLIN.txt)
  • an interactive .HTML taxonomic Krona plot, asmbTAX_plots.html, which visualizing the taxonomic profiles from each sample.
  • a profilesTAX/${taxDB_name}/DivPlots folder containing diversity and abundance visualizations produced from the abundance table
Module 5. Functional assignments (optional)

Functionality: Assembled contigs are scanned for genes (Prodigal), which are then extracted (seqtk) and annotated (eggNOG-mapper2) against the eggNOG database (v5). Annotated genes are used to infer metabolic pathways (MinPath) based on metabolic database of user's choice. Abundance scores for the genes are extracted from the contig coverage stats and used to estimate the abundance of each pathway by averaging abundance for each complete pathway (MinPath default setting).

Output: If user has chosen functional processing of their dataset, there will be asmbPWY/ folders within each samples assembly folder (e.g. assemblies/${sampName}_asmb/asmbPWY/). Folder will contain:

  • .FNA & .FAA files of sequences identified as genes via the KO or EC annotation systems (e.g. genes{KO,EC}.{fna,faa})
  • a genesSUMMARY.txt file containing information & stats about each gene in the sample, including gene length, reads mapped, coverage, annotation, transcripts per million (TPM), value, etc.
  • an _aveGEN.{ko,ec}.txt file containing information about each KO or EC annotation for the sample, average TPM of the annotation and a short description of the gene, depending on the database of its discovery
  • folders of functional processing details including features/ (produced from gene scans), annotations/ and pathways (produced in pathway inference)
Module 6. Functional summary (optional & conditional)

Functionality: The inferred pathway information from all samples, is consolidated and merged into one functional abundance matrix, reformatted as needed (e.g. biom-format) and visualized with various plots (custom scripts in R language). Gene abundance and information is similarly processed and provided as gene abundance matrix but not visualized in the output. Functional summary module only runs if Module 5 is elected to run, and if dataset contains more than 3 samples (stats & visualizations cannot be produced with smaller datasets)

Output: The functional summary folder (profilesPWY/) is produced and structured similarly to the taxonomic summary folder. The main subfolder profilesPWY/${pwyDB_name} is named after the metabolic database of choice (e.g. KEGG).

  • the pathways inferred from each sample are consolidated in the profilesPWY/${pwyDB_name}/pwyBIN folder, and their content is merged in one summary pathway abundance file, presented in 2 formats: merged_avePWY.${pwyDB_name}_{table.txt,json.biom}
  • the annotated genes from each sample are consolidated in individual folders based on the annotation system applied geneBINko or geneBINec and their content is merged in one summary gene abundance file for each annotation system: merged_aveGEN.{ko,ec}_table.txt
  • an interactive .html krona plot asmbPWYs_${pwyDB_name}_plot.html is produced summarizing the metabolic profile of each sample.
  • a DivPlots/ folder is outputted containing diversity and abundance visualization plots from the pathway abundance table returned.
Module 7. AMR finding (optional & conditional)

Functionality: Contigs of each sample are scanned for antimicrobial resistance (AMR) genes (RGI). Output is provided per sample. Module only runs if elected and if Module 5 is elected to run.

Output: Output of the AMR gene identification algorithm can be found in assemblies/${sampName}_asmb/asmbPWY/RGI/, and includes the .json and .txt files associated with the raw output from RGI tool (FNAs of each gene included) and the main summary stats file rgiFNA_main.txt, with AMR gene names, location and details.

Module 8. MAGs production, QA & taxonomy (optional & conditional)

Functionality:

  • Binning into MAGs, QC and stats: Contigs of each sample are binned together or apart based on common features (e.g. GC content)(MetaBAT2). Produced bins (in each sample), are ran through the CheckM workflow to assess abundance (using coverage), completeness, contamination levels, number of genes and other stats. The tool also scans for bacterial marker genes to assess taxonomic affiliation of each bin (MAG). Results are provided per bin per sample.
  • GTDB on MAGs: GTdb-Tk is applied to produced bins to further refine taxonomic affiliation of each MAG. Results are provided per bin in each sample.
  • MAG tax visualization BlobPlots: If elected, quality plots for each bin are produced by visualizing the taxonomic affiliations from each contig in each bin (blobTools). Plots are produced per bin per sample.

Output: Contained within assemblies/${sampName}_asmb/MAGs/ folder, this location contains fasta files of binned and unbanned contigs, log files, taxonomy magsTAX/ and quality check magsQA/ associated with each bin.

  • the final quality information of each bin is summarized in the assemblies/${sampName}_asmb/MAGs/magsQA/magQA_SUMMARY.txt file, and is extended with GTDB information (if option chosen) in assemblies/${sampName}_asmb/MAGs/magsQA/magQA_SUMMARY_wGTdb.txt file
  • visualizations of the taxonomic affiliation of each contig in each bin is outputted in assemblies/${sampName}_asmb/MAGs/magsQA/blobPlots, if user has elected the quality visualizations option
  • If user had the GTDB option elected, the magsTAX/ folder is produced, containing GTDB-based taxonomic information, summary and a .html interactive plot for each sample.
  • The binning taxonomic and GTDB information is also added to the assemblies/${sampName}_asmb/${sampName}_SUMMARY.txt file

Considerations

  • consider your data type: the LoRA pipeline is optimized for long read metagenomic datasets only from PacBio or ONT platforms. Submitting other datasets (single cell organisms, cDNA, alternative sequencing platforms or assembled sequences, etc) may produce pipeline failures or inaccuracies in the final results.
  • consider phylogenic content: Due to the nature of long read sequencing, eukaryotic organisms in complex communities are likely to get better taxonomic representation and/or reduce signal from lower-abundance or smaller-genome organisms. In eukaryote-rich communities, this may lead to underrepresentation or lack of prokaryote detection.
  • consider QC: despite being adaptable to raw read processing, it is highly advised that users run ONT data through QC pipeline, prior to submission in LoRA. This ensures removal of low quality reads/bases and adaptor sequences, which may lead to misassembles or misannotations.

Accuracy of LoRA

LoRA was benchmarked against theoretical taxonomic and functional profiles of ZymoBiomics HMW DNA standard (D6322) sequenced by Simon et. al in 2023 on long read ONT platform (Figure 1). Three technical replicates from three libraries of varying DNA amounts (3x 1000ng, 3x 500ng, 3x 350ng) were processed through LoRA. Results showed community composition assessment to genus level was comparable to the theoretical profiles with diversity deviations due to higher common ancestor affiliations (Figure 1). Gene and metabolic profiles of various sample sizes also showed comparable to theoretical estimations through KEGG database (Table 1).

Figure 1. LoRA taxonomic and functional benchmarking results, showing theoretical vs LoRA-produced results from ONT-sequenced ZymoBiomics D6322 HMW DNA standard. Taxonomic content is shown in the right panel (krona plots), alpha and beta diversity plots are presented in the left upper panel, and functional profile heat map in the lower left panel.

Nephele LoRA Figure

Table 1. LoRA MAG benchmarking results, showing theoretical vs LoRA-produced MAGs produced from ONT-sequenced ZymoBiomics D633 HMW DNA standard, sequenced at different read depths.

Read Depth Species MAGs Genes (KEGG) PWYs (KEGG)
Zymo D6322 8 8 ~35.1K ~130
>4.0G 5 8 26.3K ~190
~2.3G 6 8 23.8K ~170
~1.7G 6 9 20.9K ~100
~1.3G 4 6 21.9K ~130
<1.0G 4 7 21.5K ~118

References

    Tool references

  • eggNOG-mapper2 - Carlos P Cantalapiedra, Ana Hernández-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas, eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale, Molecular Biology and Evolution, Volume 38, Issue 12, December 2021, Pages 5825-5829, https://doi.org/10.1093/molbev/msab293
  • Prodigal - Hyatt, D., Chen, GL., LoCascio, P.F. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). https://doi.org/10.1186/1471-2105-11-119
  • CheckM - Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.
  • CheckM2 - Chklovski, A., Parks, D.H., Woodcroft, B.J. et al. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203-1212 (2023). https://doi.org/10.1038/s41592-023-01940-w
  • GTDB-tk - Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, Donovan H Parks, GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database, Bioinformatics, Volume 36, Issue 6, March 2020, Pages 1925-1927, https://doi.org/10.1093/bioinformatics/btz848
  • Kraken2 - Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
  • R language - R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org.
  • Minimap2 - Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094-3100, https://doi.org/10.1093/bioinformatics/bty191
  • BlobTools - Laetsch DR and Blaxter ML. BlobTools: Interrogation of genome assemblies [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:1287 (https://doi.org/10.12688/f1000research.12232.1)
  • fastp - Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, September 2018, Pages i884-i890, https://doi.org/10.1093/bioinformatics/bty560
  • MetaFlye - Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540-546 (2019). https://doi.org/10.1038/s41587-019-0072-8