DiscoVir

Introduction

DiscoVir is a bioinformatics pipeline for exploring viruses (ssDNA, dsDNA phage, and giant DNA viruses) and viral diversity in metagenomes. DiscoVir integrates the most up-to-date and comprehensive tools available for analyzing viruses from metagenomics data. The pipeline code can be found in our GitHub repository.

Inputs

The pipeline accepts metagenomic assembly sequences (.fasta) and a corresponding binary alignment map (.bam) file of the reads mapped back to the assembly as input. These files are produced from the WGSA2 pipeline¹ in Nephele or can be generated elsewhere. Both files are needed for each sample being submitted to the pipeline and included in the mapping file.

User Options

geNomad filter

GeNomad² produces scores that indicate the confidence of the viral genome predictions. These scores undergo a calibration to adjust those scores to compute false discovery rates and improve performance of the tool. The user has the option to change the filtering stringency using three preset options that take into account these scores, false discovery rates, and other parameters that are used in their predictions. Please see the geNomad preset help for more details. Filtering options:

Default: Follows the default parameters set for post-classification filtering.
Conservative: Selecting this option will make the filtering process more aggressive.
Relaxed: Disables all filters.

CheckV filter

The user has the option to filter and remove low-quality viral genomes for downstream processing and analysis using CheckV³. Filtering options are:

all: This option will not filter the viral genomes and all genomes will be included in the rest of the pipeline, including those with undetermined quality (default).
medium: Only allow complete genomes and all genomes of completeness >50% for the rest of the pipeline.
high: Only allow complete genomes and genomes of completeness >90% for the rest of the pipeline.
complete: Only allow complete genomes. Additionally, any genomes with "non-determined" quality will be removed for the rest of the pipeline.

Minimum sequence length

Minimum length of input sequences for vOTU clustering, DRAM-v⁴ annotation, and, if AMGs are selected, VirSorter2.0⁵. Range: 100-100000. Default is 5000.

Host prediction (optional)

Run host prediction for phage using iPHoP⁶. Host prediction will substantially increase the time for the analysis to be completed, so it is optional.
Minimum iPHoP confidence score: Range: 75-100. Default is 90, which is the equivalent to approximately 10% false discovery rate (FDR). Increasing this value will decrease the FDR (e.g. 95 = ~5 FDR). Please see iPHoP's documentation for more details on how to interpret these confidence scores.

Functional Annotation

AMGs: The user has the option to screen viral genomes for auxiliary metabolic genes (AMGs) and produce an abundance matrix. To find AMGs, DRAM-v requires inputs from VirSorter2.0. Thus, viral genomes are passed through VirSorter2.0 prior to DRAM-v for this specific purpose. This may cause the loss of some viral genomes initially discovered by geNomad, therefore not all viral genomes may be annotated for AMGs. Note: By default, functional gene annotation of all viral genomes following CheckV is performed using DRAM-v in DiscoVir even if this option is not selected. However, without selecting this option, annotations will not include specific information about AMGs.
DIAMOND⁷ NCBI nr annotation: The user has the option to annotate viral function genes identified from geNomad using DIAMOND against NCBI's nr database.

Pipeline Steps

Metagenomic assemblies are searched for viral genomes using geNomad. All viral genomes detected within the user-selected parameters are annotated taxonomically and functionally. Prophage are also trimmed during this step to remove any host sequence.
Read counts of viral genomes and viral genes identified by geNomad in each sample are extracted using the original input .bam file with VERSE⁸. Read per kilobase pairs (RPK) and copies per million (CPM, equivalent to transcripts per million in RNA-seq) are generated for each viral genome and geNomad gene.
Viral genomes are passed through CheckV to obtain quality information. Depending on the selected user option, genomes are filtered based on completeness and/or length with seqtk⁹.
The final set of per-sample viral genomes are additionally annotated for function:
- Viral genomes passing quality and length filters are functionally annotated with DRAM-v by default.
- RPK and CPM values are calculated for each gene in each sample using Liftoff¹⁰ and VERSE. Gene abundance matrices with CPM values for each sample are produced for each KOfam ID, Pfam, and VOG ID. A heatmap with VOGIDs is produced with plotnine¹¹.
- If selected by the user, the viral genomes will also be annotated with DIAMOND against NCBI's nr database.
- If selected by the user, the viral genomes will be passed through VirSorter2.0 and then DRAM-v again to look for AMGs. RPK and CPM values are calculated for each AMG in each sample. An AMGs abundance matrix with CPM values for each sample will be produced.
vOTUs and their abundances are generated.
- The viral genomes from all samples are first deduplicated with BBMap's dedupe¹².
- Deduplicated genomes are then clustered at 95% identity and 0.85 coverage following MUIViG guidelines using MMseqs2¹³. The representative sequence of each cluster is the longest sequence and serves as the vOTU.
- RPK and CPM values are calculated for each viral genome in each sample. A vOTU abundance matrix is produced with CPM values per viral genome per sample and taxonomy plots are generated with krona¹⁴.
If host prediction is selected, phage host are identified from vOTUs.
- Host calls are produced based on the user-defined confidence score (75-100).
- RPK and CPM values are calculated for phage hosts at the genus level for each sample. A host abundance matrix is produced with CPM values per genus per sample and taxonomy plots are generated with krona.

Output Files

Here we will highlight the final output of each pipeline step as well as other important files. For a complete list of the outputs of each tool (including any intermediate files), please see the individual tool's documentation linked below.

The main outputs folder contains the log file for the job:

logfile.txt main log file for the pipeline. DiscoVir uses Snakemake for the workflow, and logfile.txt has the Snakemake output for each step in the pipeline. It will list any errors (we recommend searching the document for the word "Error" or "Error in rule"), and direct users to the step's individual log file for that step in the pipeline for more information.

There are two main folder types in the output of the DiscoVir pipeline:

Sample folders: per-sample output.
Combined folders: outputs of analysis on sequences combined from across the entire dataset and tables merged from the individual sample results.

Sample folders

Each sample has an individual folder which contains subfolders for each step in the pipeline that analyzes the individual sample's sequences.

The subfolders inside the sample folder are:

genomad: outputs of geNomad.

{sample}.genomad.log: log file to check for all (STDOUT and STDERR) messages from this pipeline step.
{sample}_summary: final output of geNomad.
- {sample}_summary/{sample}_virus_summary.tsv: summary table of viral sequences identified.
- {sample}_summary/{sample}_virus.fna: FASTA of viral sequences (with host trimmed, if necessary).
- {sample}_summary/{sample}_virus_genes.tsv: summary table of genes predicted from viral sequences.
- {sample}_summary/{sample}_virus_proteins.faa: amino acid/protein FASTA of viral genes.
abund_genomad/{sample}_virus.count.CDS.cpm.txt: abundance estimates of the viral sequences.
- read counts are estimated using the ht-seq algorithm from the tool VERSE, along with estimated CPM and RPK.
- CPM is calculated the same way as TPM, but here we use as a generic term in case of DNA-seq.
abund_genomad/{sample}_virus_genes.count.CDS.cpm.txt: abundance estimates of viral genes.

CheckV: outputs of CheckV.

{sample}.checkv.log: log file.
quality_summary.tsv: summary of quality of all viral sequences.
combined.fna: viral sequences identified by CheckV (with headers given by CheckV).
checkv_filtered_genomad_viruses.fna: viral sequences filtered based on the `checkv quality` user option. This file uses the original geNomad headers which allows the pipeline to more easily track the sequences through analysis.

dramv: outputs of running DRAM-v annotate step only using the viral sequences predicted by geNomad and optionally filtered by CheckV (checkv/checkv_filtered_genomad_viruses.fna). DRAM-v finds genes and annotates them.

{sample}.dramv.log: log file.
dramv-annotate
- annotations.tsv: table of DRAM-v gene annotations.
- genes.{faa,fna,gff}: AA and nucleotide FASTA files as well as associated gff file for the genes. Coordinates in the gff are vis-a-vis the original geNomad sequences found in checkv/checkv_filtered_genomad_viruses.fna.
abund_dramv
- {sample}_dramv.count.gene.cpm.txt: abundance estimates of genes found by DRAM-v.

amgs (optional): if the run_amgs user option is chosen, this folder has the DRAM-v predicted AMGs (auxiliary metabolic genes). To predict AMGs, the pipeline first runs VirSorter2 on checkv_filtered_genomad_viruses.fna to generate the table needed by DRAM-v to predict AMGs. VirSorter2 filters out many sequences identified as viral by other tools, so this is an optional step.

vs2: output of VirSorter2.
- {sample}.vs2.log: log file.
- for-dramv/final-viral-combined-for-dramv.fna & vs2/for-dramv/viral-affi-contigs-for-dramv.tab: files used by DRAM-v to predict AMGs.
{sample}.dramv_amgs.log: DRAM-v log file.
dramv-annotate: directory of DRAM-v gene annotations - using all DRAM-v databases.
dramv-distill: directory containing the output of DRAM-v's distill step which identifies AMGs.
- amg_summary.tsv: table of potential AMGs.
abund_amgs
- {sample}_amgs.count.gene.cpm.txt: abundance estimates of genes.

diamond (optional): if the DIAMOND option is chosen, this folder contains the output of annotating the geNomad-predicted genes by aligning sequences with DIAMOND to NCBI's nr database.

{sample}.nr.diamond.tsv: table of top alignments for gene sequences with NCBI nr accession number. For full explanation of all columns see the NCBI BLAST format table under outfmt (DIAMOND uses the BLAST output format).

Combined folders

Combined folders contain outputs of analyses performed on the combined results from all samples produced earlier in the pipeline.

vOTUs: Contains outputs from vOTU clustering, summaries, and abundances.

vOTU_sequences.fasta: FASTA of final vOTU sequences.
vOTU_table_cpm.tsv: Matrix of abundances (CPM) of vOTUs for each sample.
vOTU.krona.html: Krona plots of vOTU taxonomy.
vOTU_genomad_virus_summary.tsv: geNomad summary information for vOTUs.
vOTU_clustering:
- bbtools_dedupe: outputs of deduplication step with bbtools dedupe.sh.
  - all_input_contigs.fasta: A FASTA file containing all viral genomes combined from all samples.
  - unique_seqs.fasta: A FASTA file containing all unique sequences (deduplicated sequences) from all samples.
  - bbtools_dedupe.log: log file to check for all (STDOUT and STDERR) messages from this pipeline step.
- mmseqs:
  - cluster_seqs.fasta: Viral genome FASTA sequences grouped by cluster.
  - DB_clu.tsv: Tab separated file displaying IDs of sequences within each cluster.
  - flat_DB_clu.tsv: Tab separated file displaying IDs of sequences within each cluster.
  - representative_sequences.fasta: FASTA sequences of viral genome representing each cluster, which becomes the vOTU. FASTA names include names of all viral genomes within each cluster.
  - representative_sequences.renamed.fasta: FASTA sequences of viral genome. representing each cluster, which becomes the vOTU. FASTA names are changed so that they are only the representative sequence.
  - mmseqs2_log: Log file to check for all (STDOUT and STDERR) messages from this pipeline.
  - DB: Directory containing MMseqs2 outputs and indexes for database.

vOTU_Host_predictions_iphop: outputs of iPHoP.

Host_prediction_to_genome_m##.csv: Files containing summary information of host predictions. Host predictions are made at the genome level.
Host_prediction_to_genus_m##.csv: Files containing summary information of host predictions. Host predictions are made at the genus level.
Host_cpm_table.tsv: Matrix of abundances (CPM) of phage hosts of vOTUs for each sample predicted from iphop.
Host.krona.html: Krona plots of host taxonomy.

gene_tables: Contains functional gene and AMG abundances and summaries.

dramv_kofam_hits_cpm.tsv: A matrix of abundances (CPM) of kofam hits from DRAM-v.
dramv_pfam_hits_cpm.tsv: A matrix of abundances (CPM) of Pfam hits from DRAM-v.
dramv_vogdb_hits_cpm.tsv: A matrix of abundances (CPM) of VOGID hits from DRAM-v.
dramv_heatmap.pdf: Heatmap of abundances (CPM) of top VOGID hits from DRAM-v.
amg_cpm.tsv: If the AMG option is selected, this file will be produced containing a matrix of CPM abundances of AMGs for each sample.
amg_heatmap.pdf: Heatmap of abundances (CPM) of AMGs.

Packages and Versions

geNomad v1.7
VERSE v0.1.5
CheckV v0.8.1
seqtk v1.3
BBMap v38.90
MMseqs2 v13.45111
VirSorter2 v2.2.3
DRAM v1.4.6
iPHoP v1.3.0
DIAMOND v2.0.15
Liftoff v1.6.3
krona v2.8.1
plotnine v0.12.4

Databases

geNomad: v1.7
CheckV: v1.5
DRAM: v1.4.6
VirSorter2.0: v2.2.3
iPHoP: Sept_2021_pub_rw
nr: updated as of November 15, 2023

References

https://www.protocols.io/view/wgsa2-workflow-a-tutorial-n92ldm98xl5b/v1
Camargo, A. P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., ... & Kyrpides, N. C. (2023). Identification of mobile genetic elements with geNomad. Nature Biotechnology, 1-10. doi: 10.1038/s41587-023-01953-y.
Nayfach, S., Camargo, A. P., Schulz, F., Eloe-Fadrosh, E., Roux, S., & Kyrpides, N. C. (2021). CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature biotechnology, 39(5), 578-585. doi: 10.1038/s41587-020-00774-7.
Shaffer, M., Borton, M. A., McGivern, B. B., Zayed, A. A., La Rosa, S. L., Solden, L. M., ... & Wrighton, K. C. (2020). DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic acids research, 48(16), 8883-8900. doi: 10.1093/nar/gkaa621.
Guo, J., Bolduc, B., Zayed, A. A., Varsani, A., Dominguez-Huerta, G., Delmont, T. O., ... & Roux, S. (2021). VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome, 9, 1-13.
Roux, S., Camargo, A. P., Coutinho, F. H., Dabdoub, S. M., Dutilh, B. E., Nayfach, S., & Tritt, A. (2023). iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS biology, 21(4), e3002083.
Buchfink, B., Reuter, K., & Drost, H. G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods, 18(4), 366-368. doi: 10.1038/s41592-021-01101-x.
Zhu, Q., Fisher, S. A., Shallcross, J., & Kim, J. (2016). VERSE: a versatile and efficient RNA-Seq read counting tool. bioRxiv, 053306. doi: 10.1101/053306.
https://github.com/lh3/seqtk
Shumate, A., & Salzberg, S. L. (2021). Liftoff: Accurate mapping of gene annotations. Bioinformatics, 37(12), 1639–1643. doi: 10.1093/bioinformatics/btaa1016.
https://github.com/has2k1/plotnine
https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/dedupe-guide/
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), 1026-1028. doi:10.1038/nbt.3988.
Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12, 1-10. doi: 10.1186/1471-2105-12-385

Quick links