Getting Started About Nephele Pipelines Pipeline Descriptions Test Datasets FAQ

Frequently Asked Questions

Pipeline Submission Tips

How do I know if my data is paired-end or single-end?

Paired End: Paired-end sequencing involves sequencing DNA from both ends of a fragment. Choose this option if your sequence files are paired and have been demultiplexed. For example, Illumina MiSeq Paired End FASTQ files consist of two files for each sample with names ending on "R1_001.fastq" and "R2_001.fastq"
Single End: Single-end sequencing involves sequencing DNA from only one end. Choose this option if your samples were only sequenced from one end, if you don't want to use the reverse FASTQ file, or if your reads are merged.
If you need further information to help you decide what type of file you have, see the SRA file type information page.

How do I know if my data is demultiplexed?

Each sequence file should only contain data from a single sample. For paired-end data, you should have 2 files per sample, and for single-end, just one file per sample. Nephele only accepts demultiplexed data at this time.

Is it possible to submit 16S amplicon data to a shotgun metagenomics pipeline?

The shotgun metagenomics pipelines use whole genome shotgun data for taxonomic and functional characterization. They are not designed to work with amplicon data from a single region, but rather make use of marker sequences from along the entire genome. The pipelines may give errors or unexpected output:

bioBakery: The QC step of the bioBakery pipeline removes many 16S (and other amplicon) sequences as they are often over-abundant in samples due to contamination. For more information, see the bioBakery kneadData QC and MetaPhlAn2 tutorial pages.
WGSA: The whole genome shotgun assembly pipeline should process amplicon sequences through assembly, but the taxonomic classification may not be as accurate and the gene prediction and annotation step may fail since functional genes will not be present in the data. See the CheckM wiki to learn more about the taxonomic lineage assignment, and the Prokka README for more on the functional annotation methods used by MetaProkka.

Can I submit already merged read pairs to Nephele?

If you have reads that are already joined, submit the dataset as single-end samples.

How should I choose the values for the optional parameters?

The optional parameters were carefully chosen based on (1) the most common scenarios of NGS data analysis, (2) the suggestions from the developers, and (3) published results. The different pipelines available on Nephele target different kinds of NGS studies, such as whole genome shotgun sequencing, 16S microbiome survey, and functional annotation of microbial community.

Most users submit their jobs with default values of the optional parameters. In our experience, more experienced bioinformaticians change the parameters to optimize their input data. We also have received feedback from novice microbiome researchers or students that they study the optional parameters (reading help text and testing different values-even if it fails) as a part of learning microbiome analysis.

How can I upload data to Nephele more efficiently?

Rather than uploading large sequence files from your computer (and potentially over long distances or slow/wireless networks), you can likely save time by first uploading your files to Globus, Google Drive or BaseSpace. Nephele will verify that it can access the files, and then it will retrieve them after you click submit. Please see Select upload method in Step 3 of the user guide. Google Drive users may find the command line tool rclone useful, especially for uploading from a server or HPC (Google Drive-specific instructions).

Note: regardless of the storage mechanism you choose, we will only retrieve the files that are specified in your mapping file.

How long does a typical analysis take? Is this normal?

On average, it takes about 1.4 hours to process 1G of compressed data. Nephele uses multiple processor cores to speed up the analysis pipelines.

Why would you want to run Nephele's Pre-processing QC pipeline before you run a microbiome analysis?

Studies show that quality filtering can greatly improve microbiome analysis results. Best practices on working with sequencing data include doing a series of QC steps to verify and even improve the quality of the data. Our Pre-processing QC pipeline was designed to run a quality check by default, so the user can run it without choosing any options and receive FastQC tables and graphs providing information on the quality of individual samples. After evaluating these results, the user can submit their files to an analysis pipeline or return to the QC pipeline to trim reads and merge read pairs as needed.

Even though our 16S and WGS pipelines include quality filtering, trimming and merging steps, it may be best to run those processing steps separately ahead of time. We have incorporated the tools cutadapt and Trimmomatic in our Pre-processing QC pipeline steps to give users more control for modifying parameters, which can be helpful for some datasets, especially if the amplicon region is variable length. For the read merging step, we have integrated the FLASH merger, which some results show might provide better precision and recall than the native tools used by QIIME1 and mothur. For longer amplicon regions with a short overlap between paired reads, FLASH may perform better than the DADA2 merger. So, we designed the QC pipeline to provide these programs for our users as well to help them get better results. For, more information about the tools we use see the details page.

Some usage examples:

Run paired-end data through this pipeline, choosing to merge the reads, and then submit the resulting FASTQ files to the DADA2 or QIIME1 Single End pipelines.
Examine the average per-base quality scores from the FastQC results of the pipeline, and use that information to set the Truncation length parameter in DADA2 or the Minimum Phred quality score parameter in QIIME1.

How long do we keep your uploaded data and result files?

Nephele system keeps your uploaded data and result files for 90 days from the time you submit a job. During the 90 days, you can download the result file and resubmit your job using the jobID. After the 90 days, both your uploaded data and result files will be automatically deleted.

Why has Nephele decided to retire QIIME 1.9 16S pipeline now? And what does it mean to you?

The developers of QIIME released QIIME 2.0 originally in 2017 and announced they would discontinue the support for QIIME (version 1.9). A manuscript was published in July of 2019 to describe the new plugin-based architecture of QIIME 2.0. The original version of QIIME offered clustering tools such as uclust and usearch for closed, open, and de novo OTU clustering. It was common practice to use the open-reference clustering at 97% similarity. The new version includes plugins such as DADA2 (running the DADA2 R package ) and Deblur that improve quality control, perform denoising and return sequence variants. The documentation of QIIME 2.0 recommends the use of these denoising algorithms over the clustering methods used previously. For researchers who still found it useful to cluster the reads into OTUs, the QIIME team later included plugins for clustering using vsearch into their QIIME 2.0 architecture.

The Nephele team has adopted QIIME 2.0 for the clustering steps (QIIME 2.0 16S pipeline), the Deblur denoising algorithm (QIIME 2.0 16S pipeline), and several visualization options in the Downstream Analysis Pipeline (Explore tab). Even though QIIME 2.0 also offers a plugin for DADA2, the Nephele team decided to implement a separate pipeline using the native DADA2 R package.

Suppose you are a user of the QIIME pipeline and are wondering which pipeline to use after Nephele retires QIIME 1.9 pipeline (OTU clustering method). In that case, we recommend that you adopt the denoising method available in the DADA2 pipeline (for paired-end or single-end) or in the QIIME 2.0 pipeline when you use the Deblur option for single-end reads. Alternatively, you could decide to continue using clustering-based methods available in the new QIIME 2.0 16S pipeline (vsearch option) or the clustering method available on the mothur pipeline if you have a short amplicon design such as the V4 16S region and good quality data.

Why has Nephele decided to retire QIIME 1.9 ITS pipeline now? And what does it mean to you?

The developers of QIIME released QIIME 2.0 originally in 2017 and announced they would discontinue the support for QIIME (version 1.9). The current ITS pipeline on Nephele is based on QIIME 1.9, therefore a better supported method was needed. The Nephele team has decided to use DADA2 which improves quality control, performs denoising and returns sequence variants. The pipeline is based on the DADA2 ITS Tutorial.

Where is QIIME2?

QIIME2 is a framework which runs other third-party tools for analysis - including VSEARCH and DADA2. For computational reasons, we sometimes use the QIIME2 framework, as in our VSEARCH and Downstream Analysis pipelines, but for our DADA2 pipeline, we run the DADA2 package directly. This allows us to provide more detailed output and have more flexibility in user options. See pipelines descriptions for more information.

What are "recommended" pipelines?

These are the pipelines we recommend for most datasets. If you are new to metagenomics/genomics, we suggest using the recommended pipeline, as these are the most robust, efficient, and generally more accurate based on the literature and our testing. Our other pipelines are for more advanced users who want to try other tools.

How do I prepare my WGSA2 job outputs for analysis with DiscoVir, Nephele's virome analysis pipeline?

DiscoVir can use the metagenomic assembly scaffolds and bam (made from mapping reads back to the scaffolds) files from the WGSA2 pipeline. You can find the FASTA files and bam files in the asmb_files directory of the WGSA2 outputs folder.

Troubleshooting Nephele Errors

I received an error email; now what?

We recommend that you start by examining the logfile.txt file, which can be found directly on the results download page as well as in the PipelineResults.JOBID.tar.gz directories. Specifically, you can do a text search for ERROR to see some common errors that can arise with data analyses on Nephele. Many of these errors are described further in additional FAQs here, which provide detailed suggestions or solutions. If you continue to have issues, please do not hesitate to send us a support request.

The logfile shows that some files are missing but, in fact, I did upload them.

This is most likely due to corrupted sequence data files. File corruption is not uncommon on slow or unreliable network connections. If the file is gzipped and cannot be uncompressed successfully, Nephele will report an error of missing files. If this happens to your submission, there are possible remedies: (1) ensure that the file is not already corrupted on your computer, (2) try to find a faster, reliable connection, and (3) use software with file integrity check option such as FileZilla with FTP upload.

The logfile is empty or it only contains a single line?

This is most likely the result of having a bad or corrupt .gz files. You'll need to recreate your .gz files before resubmitting your pipeline. The .gz files should only contain a single sequence data file, and NO FOLDERS. MacOS users should create your .gz file using the command line, (Terminal.app instructions).

I received a Downstream Analysis Pipeline error from core_metrics or alpha_group_significance.

Errors from QIIME 2's diversity plugin visualizers core metrics and alpha group significance are usually because the sampling depth chosen filters out more samples than you intended. In particular, see the requirements for alpha group significance to run. You can often diagnose these errors by looking at the summary.qzv file on QIIME 2's view page. The summary visualization gives you the ability to modify the sampling depth and see which samples and metadata groups would remain after filtering.

For the mothur pipeline, what does it mean that my inputs or distance matrix are too large?

Pat Schloss, the author of mothur, has written this very informative blog post: Why do I have such a large distance matrix? about this issue. After reading this post, you may want to run our QC pipeline to check and quality filter your input data. Nephele has limited resources and cannot handle such large data files for our mothur pipeline, so we check the size of the input data, and if it is above 10Gb, we do not start the pipeline. Additionally, we check the size of the split distance matrices from the first part of mothur's cluster.split command, and if any are larger than 36Gb (or 60% of available memory), we do not run the rest of the pipeline. In this case, OTU clustering and visualizations will not be made.

My job experienced a validation failure due to constant quality scores; what does this mean?

Before starting the DADA2 pipeline, we validate your FASTQ files to confirm that the quality scores are not constant. Constant quality scores make it impossible for DADA2 to denoise sequences before identifying an amplicon sequence variant. To confirm that your data will process successfully through DADA2, we suggest running the FASTQ files through our QC pipeline first and reviewing the Sequence Quality Histograms. If they look something like this:

With straight lines throughout the plot, it indicates that you have constant quality scores and you will be unable to process your data through DADA2.
There are a few potential remedies. First, if you downloaded your FASTQs from SRA, ensure that you use the SRA Normalized Format and not the SRALite files. Additional information can be found here and here. Alternatively, if you are unable to download or recover the SRA Normalized Format, you can consider running your FASTQ files through the QIIME2 pipeline and choose vsearch clustering. With this pipeline, you can obtain clustered operational taxonomic units, which may still be relevant for your question.

Problems with Output Files

Why didn't I get the expected images from core diversity analysis for the QIIME1 or mothur pipelines? Why didn't I get the expected html report for the bioBakery WGS pipeline?

The QIIME1 core diversity analysis and the bioBakery wmgx_vis workflow each require a minimum of 3 samples. If you submitted a mapping file with more than three samples, check the contents of logfile.txt. It is possible that one or more of your samples did not have the minimum number of OTUs or reads and was excluded from further analysis. This will be indicated in the logfile.txt output.

Why are only some of my samples appearing in the downstream analysis for the amplicon pipelines?

Missing samples are typically the result of poor or low OTU or sequence variant counts for those samples. To identify which samples have been excluded from your final analysis look at the samples_being_ignored.txt file. You may also look at the logfile.txt to see why those samples have been excluded. Samples that have low OTU or sequence variant counts are sometimes removed because of the Sampling depth cutoff parameter. If you do not specify the parameter, please see FAQ:How is the sampling depth calculated? for more information. If you open the otu_summary_table.txt file, you can see OTU counts for all of your samples. Adjusting the Sampling depth parameter accordingly (i.e., entering a value that will include all of your samples) in a new run with the same data will resolve this issue. The parameter can be set under the Analysis tab of the job submission page, and you can use the job resubmission feature of Nephele to more easily resubmit your data with a different value.

Why do some of the samples have so few counts in the DADA2 pipeline results?

The DADA2 pipeline is highly sensitive to sequence quality and primer trimming. It is very important to specify the correct primer lengths at job submission (or remove the primers from the data before submitting), as these sequences may interfere with the denoising of the reads as well as with chimera removal (if you are in doubt of the primer lengths, we advise you not to choose the chimera removal option). See this DADA2 FAQ for more information.

The DADA2 pipeline produces quality profile plots that you can look at to gauge the quality of your data (qualityProfile_R1/2.pdf). If the data is poor quality, the reads may be filtered out during the filterAndTrim step. You can also see a table in the log file of how many reads pass this step. Additionally, if the data is poor quality, reads that pass the filter may be trimmed too much in the filterAndTrim step, and may not merge properly in the mergePairs step. You can search the log file for paired-reads for how many reads successfully merged for each sample. Sometimes, it is helpful to use a trimming program such as cutadapt, Trimmomatic, or BBDuk to trim for quality (and/or primers) prior to running DADA2. You can use Nephele's QC pipeline to do this pre-processing of your data; see here for more information.

I'm having problems with the heatmaps from the 16S visualizations.

The interactive heatmaps are produced using Morpheus by the Broad Institute. They work best in Google Chrome or Firefox. You may have problems with Microsoft Edge, Internet Explorer, or Safari.

How do I open the xxx.tar.gz file on Windows?

Computers running Windows (7, 8, or 10) requires a third-party program like WinZip (commericial) or 7-Zip (open source). A file with a .tgz or .tar.gz extension is a Gnu Zip compressed Tar Archive file. They are made up of files that have been placed in a TAR archive and then compressed using Gnu Zip.

I have results from several independent WGSA2 runs that I would like to combine so that I can analyze my samples together. How can I do that?

We have written a custom script that can perform this task for you, which you can download here.

Next, gather the files you plan to combine into a single directory. When you download and unzip the results from your job, you will have a directory called outputs/. Navigate to TAXprofiles or PWYprofiles and find the bin directory. It should contain a number of files ending in 4krona.txt. Move these files from each Nephele job into a single, combined folder. The file names themselves can change as desired, as long as TAX and PWY files are not combined.

Then, from the command line, run the Rscript command. Rstudio also includes a "Terminal" tab next to the "Console" tab, that can run Rscript commands.

Let's say I have gathered all 4krona files into combine_my_files. I would next run:

Rscript WGSA2_MergeKronaFiles.R --binDIR combine_my_files --outTYPE text --outFILE analysis_table

This would result in an output file called analysis_table.txt, which would be a complete table from all files contained in combine_my_files.

We currently support output into text, phyloseq, or biom formats for easy transfer to your preferred analysis software.

Miscellaneous

How do I cite Nephele?

See the Citing Nephele page.

How do I know which software versions were run at the time of my analysis?

Please refer to the Release Notes to see when Nephele updates were made. Also, in the initial email you receive for each job, you will find the version of Nephele that corresponds to the Release Notes, as well as a copy of all the parameters that were selected for that job. Software package versions for the pipelines are also listed in the log files and current versions can be found on the pipeline details pages.

How is sampling depth calculated?

Choosing a sampling depth is generally arbitrary. Generally, it's recommended to choose a value high enough that you are able to capture the diversity present in samples with high read counts, but low enough to include the majority of your samples. For a simple community with only a handful of abundant members, for example, a sampling depth of 5,000 or less may suffice for an accurate estimate of diversity. For a more complex community with many low abundant members, however, a much higher value for sampling depth, 10,000 or higher, is generally necessary.

Nephele specifies the sampling depth of 10,000 reads as the minimum requirement for all downstream analysis. The pipelines use the following logics to determine the sampling depth:

apply the user-specified sampling depth, if available
set the sampling depth based on the sample with the least number of reads if it is greater than or equal to 10,000
otherwise, no downstream analysis is performed.

Note: Users are encouraged to specify the sampling depth that is most appropriate for their studies. There is really no formula that can precisely determine the most appropriate value simply based on the distribution of read counts and the number of samples. If the pipeline does not generate any downstream analysis for you samples, it is most likely that the sample with the least number of reads is below 10,000. You will need to lower the sampling depth in order to run the downstream analysis.

How do I download job results and other files from Nephele via command line?

You can use a Unix utility like wget (FAQ) to transfer files to your computer using a terminal program (e.g. a Linux terminal, macOS Terminal.app, or Windows Command Prompt/PowerShell). Right click on any button or link for downloads on the Nephele website, and copy the link address to use with wget on the command line. Here is an example command for downloading job results:

wget -O results.tar.gz "https://nephele.niaid.nih.gov/result_link/1bee6ca12909"

Downloading via command line is useful if you would like to transfer job results to an HPC or other remote machine or if the file to download is large and it may take a while for the transfer to complete.

WGSA2 TEDreads can be downloaded similarly, but will have a .tar-only extension.

Is it still free to use Nephele? Will it continue to be free?

Nephele is currently provided to the research community free of charge (this may change in the future).

What are the educational scenarios for usage of Nephele?

Since the public launch in 2016, Georgetown University, North Carolina State University, and University of Florida have used Nephele in their microbiology classes. Nephele has mainly served to process FASTQ files, and then students have studied the result files by visualizing the processed data and making sense of what it means biologically. Besides the classroom, Ph.D. candidates or graduate students who do not have access to a high computing environment use Nephele for their research. If you are interested in using Nephele in your class, please let us know to see how we can help.

Can I add my own pipelines to Nephele?

We also like learning about new and different pipelines that could better serve your research and educational needs! If you have a suggestion of a tool or analysis for Nephele, please fill out this form. We are interested in hearing more about the research needs of our users!

How can I get insight into Nephele pipeline performance?

We have built a metrics dashboard to monitor the performance of Nephele workers that run the pipelines. The dashboard is intended for internal use and the data presented there are not part of the results that should be inspected by users. Nontheless, if you are interested in the resources used towards analyzing your data, please access the dashbard at https://nephele.niaid.nih.gov/metrics/<job_id>

Quick links

Nephele User Guide

Frequently Asked Questions

Pipeline Submission Tips

Troubleshooting Nephele Errors

Problems with Output Files

Miscellaneous