DiscoVir is a bioinformatics pipeline for exploring viruses (ssDNA, dsDNA phage, and giant DNA viruses) and viral diversity in metagenomes. DiscoVir integrates the most up-to-date and comprehensive tools available for analyzing viruses from metagenomics data. The pipeline code can be found in our GitHub repository.
The pipeline accepts metagenomic assembly sequences (.fasta) and a corresponding binary alignment map (.bam) file of the reads mapped back to the assembly as input. These files are produced from the WGSA2 pipeline1 in Nephele or can be generated elsewhere. Both files are needed for each sample being submitted to the pipeline and included in the mapping file.
GeNomad2 produces scores that indicate the confidence of the viral genome predictions. These scores undergo a calibration to adjust those scores to compute false discovery rates and improve performance of the tool. The user has the option to change the filtering stringency using three preset options that take into account these scores, false discovery rates, and other parameters that are used in their predictions. Please see the geNomad preset help for more details. Filtering options:
The user has the option to filter and remove low-quality viral genomes for downstream processing and analysis using CheckV3. Filtering options are:
Minimum length of input sequences for vOTU clustering, DRAM-v4 annotation, and, if AMGs are selected, VirSorter2.05. Range: 100-100000. Default is 5000.
Host prediction (optional)Here we will highlight the final output of each pipeline step as well as other important files. For a complete list of the outputs of each tool (including any intermediate files), please see the individual tool's documentation linked below.
The main outputs folder contains the log file for the job:
logfile.txt
main log file for the pipeline. DiscoVir uses Snakemake for the workflow,
and logfile.txt has the Snakemake output for each step in the pipeline.
It will list any errors (we recommend searching the document for the word "Error" or "Error in rule"),
and direct users to the step's individual log file for that step in the pipeline for more information.
There are two main folder types in the output of the DiscoVir pipeline:
Each sample has an individual folder which contains subfolders for each step in the pipeline that analyzes the individual sample's sequences.
The subfolders inside the sample folder are:
genomad: outputs of geNomad.
{sample}.genomad.log
: log file to check for all (STDOUT and STDERR) messages from this pipeline step.{sample}_summary
: final output of geNomad.
{sample}_summary/{sample}_virus_summary.tsv
: summary table of viral sequences identified.{sample}_summary/{sample}_virus.fna
: FASTA of viral sequences (with host trimmed, if necessary).{sample}_summary/{sample}_virus_genes.tsv
: summary table of genes predicted from viral sequences.{sample}_summary/{sample}_virus_proteins.faa
: amino acid/protein FASTA of viral genes.abund_genomad/{sample}_virus.count.CDS.cpm.txt
: abundance estimates of the viral sequences.
abund_genomad/{sample}_virus_genes.count.CDS.cpm.txt
: abundance estimates of viral genes.
CheckV: outputs of CheckV.
{sample}.checkv.log
: log file.quality_summary.tsv
: summary of quality of all viral sequences.combined.fna
: viral sequences identified by CheckV (with headers given by CheckV).checkv_filtered_genomad_viruses.fna
:
viral sequences filtered based on the `checkv quality` user option.
This file uses the original geNomad headers which allows the pipeline to more easily track the sequences through analysis.
dramv: outputs of running DRAM-v annotate step only using the viral sequences predicted by geNomad and optionally filtered by CheckV (checkv/checkv_filtered_genomad_viruses.fna). DRAM-v finds genes and annotates them.
{sample}.dramv.log
: log file.dramv-annotate
annotations.tsv
: table of DRAM-v gene annotations.genes.{faa,fna,gff}
: AA and nucleotide FASTA files as well as associated gff file for the genes.
Coordinates in the gff are vis-a-vis the original geNomad sequences found in checkv/checkv_filtered_genomad_viruses.fna.
abund_dramv
{sample}_dramv.count.gene.cpm.txt
: abundance estimates of genes found by DRAM-v.amgs (optional): if the run_amgs user option is chosen, this folder has the DRAM-v predicted AMGs (auxiliary metabolic genes). To predict AMGs, the pipeline first runs VirSorter2 on checkv_filtered_genomad_viruses.fna to generate the table needed by DRAM-v to predict AMGs. VirSorter2 filters out many sequences identified as viral by other tools, so this is an optional step.
vs2
: output of VirSorter2.
{sample}.vs2.log
: log file.for-dramv/final-viral-combined-for-dramv.fna
& vs2/for-dramv/viral-affi-contigs-for-dramv.tab
: files used by DRAM-v to predict AMGs.{sample}.dramv_amgs.log
: DRAM-v log file.dramv-annotate
: directory of DRAM-v gene annotations - using all DRAM-v databases.dramv-distill
: directory containing the output of DRAM-v's distill step which identifies AMGs.
amg_summary.tsv
: table of potential AMGs.abund_amgs
{sample}_amgs.count.gene.cpm.txt
: abundance estimates of genes.diamond (optional): if the DIAMOND option is chosen, this folder contains the output of annotating the geNomad-predicted genes by aligning sequences with DIAMOND to NCBI's nr database.
{sample}.nr.diamond.tsv
: table of top alignments for gene sequences with NCBI nr accession number.
For full explanation of all columns see the NCBI BLAST format table under outfmt
(DIAMOND uses the BLAST output format).
Combined folders contain outputs of analyses performed on the combined results from all samples produced earlier in the pipeline.
vOTUs: Contains outputs from vOTU clustering, summaries, and abundances.
vOTU_sequences.fasta
: FASTA of final vOTU sequences.vOTU_table_cpm.tsv
: Matrix of abundances (CPM) of vOTUs for each sample.vOTU.krona.html
: Krona plots of vOTU taxonomy.vOTU_genomad_virus_summary.tsv
: geNomad summary information for vOTUs.vOTU_clustering:
bbtools_dedupe
: outputs of deduplication step with bbtools dedupe.sh.
all_input_contigs.fasta
: A FASTA file containing all viral genomes combined from all samples.unique_seqs.fasta
: A FASTA file containing all unique sequences (deduplicated sequences) from all samples.bbtools_dedupe.log
: log file to check for all (STDOUT and STDERR) messages from this pipeline step.mmseqs
:
cluster_seqs.fasta
: Viral genome FASTA sequences grouped by cluster.DB_clu.tsv
: Tab separated file displaying IDs of sequences within each cluster.flat_DB_clu.tsv
: Tab separated file displaying IDs of sequences within each cluster.representative_sequences.fasta
: FASTA sequences of viral genome representing each cluster, which becomes the vOTU. FASTA names include names of all viral genomes within each cluster.representative_sequences.renamed.fasta
: FASTA sequences of viral genome. representing each cluster, which becomes the vOTU. FASTA names are changed so that they are only the representative sequence.mmseqs2_log
: Log file to check for all (STDOUT and STDERR) messages from this pipeline.DB
: Directory containing MMseqs2 outputs and indexes for database.vOTU_Host_predictions_iphop: outputs of iPHoP.
Host_prediction_to_genome_m##.csv
: Files containing summary information of host predictions. Host predictions are made at the genome level.Host_prediction_to_genus_m##.csv
: Files containing summary information of host predictions. Host predictions are made at the genus level.Host_cpm_table.tsv
: Matrix of abundances (CPM) of phage hosts of vOTUs for each sample predicted from iphop.Host.krona.html
: Krona plots of host taxonomy.gene_tables: Contains functional gene and AMG abundances and summaries.
dramv_kofam_hits_cpm.tsv
: A matrix of abundances (CPM) of kofam hits from DRAM-v.dramv_pfam_hits_cpm.tsv
: A matrix of abundances (CPM) of Pfam hits from DRAM-v.dramv_vogdb_hits_cpm.tsv
: A matrix of abundances (CPM) of VOGID hits from DRAM-v.dramv_heatmap.pdf
: Heatmap of abundances (CPM) of top VOGID hits from DRAM-v.amg_cpm.tsv
: If the AMG option is selected, this file will be produced containing a matrix of CPM abundances of AMGs for each sample.amg_heatmap.pdf
: Heatmap of abundances (CPM) of AMGs.