HTSQualC Quick Start¶

Goal¶

HTSQualC is an automated quality control analysis tool for a single and paired-end high-throughput sequencing data (HTS) generated from Illumina sequencing platforms.

Features¶

Simultaneously filter and/or trim reads for adapter or primer contamination, uncalled bases (N), and low-quality reads
Supports single and paired-end reads
Analyze multiple samples simultaneously
Parallel computation for accelerating the speed of analysis
Visualization and statistics
No dependency on an external open-source tool

Prerequisites¶

Downloads, access, and services¶

To complete this tutorial, you will need access to the following services/software

Prerequisite	Preparation/Notes	Link/Download
CyVerse account	You will need a CyVerse account to complete this exercise	Register: CyVerse User Portal

Platform(s)¶

We will use the following CyVerse platform(s):

Platform	Interface	Link	Platform Documentation	Quick Start
Data Store	GUI/Command line	Data Store	Data Store Manual	Data Store Guide
Discovery Environment	Web/Point-and-click	Discovery Environment	DE Manual	Discovery Environment Guide

Input and example data¶

To complete this tutorial, you will need to have the following inputs prepared

Input File(s)	Format	Preparation/Notes	Example Data
HTS generated Sequence reads	FASTQ	Single and Paired-end sequence reads generated from Illumina sequencing platform	Test data for this is available in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> htseqqc

Get started:¶

Tip

If you have not already imported your own sequence read files to CyVerse, you can follow the instructions for uploading data, for example using Cyberduck, in our Data Store Guide.

Login to the Discovery Environment.
Click on the Apps in the DE workspace and search for HTSQualC. Click on HTSQualC to open the app or click the button below:
Under “Analysis Name”, you can use default name or change as appropriate.
Under “Input”, provide singe (Input file 1 only) or paired-end (both Input file 1 and 2) data by clicking “Add” button. Here, you can provide single or multiple files for the analysis.
Under “Parameter”, select appropriate parameters to run the analysis. Check below for detailed parameters.
Click Launch Analysis. You will receive a notification and may close the Apps window.
Click on Analyses from the DE workspace and monitor the status of your submitted job (You may have to click refresh to view updated status).
In the Analysis console, once your status appears as ‘Completed,’ click on the name of your analysis to navigate you to the results.

HTSQualC optional parameters:¶

-a INPUT_FILES_1, --p1 INPUT_FILES_1
                      Single end input files or left files for paired-end
                      data (.fastq, .fq). Multiple sample files must be
                      separated by comma or space
-b INPUT_FILES_2, --p2 INPUT_FILES_2
                      Right files for paired-end data (.fastq, .fq).
                      Multiple files must be separated by comma or space
-c QUAL_FMT, --qfmt QUAL_FMT
                      Quality value format [1= Illumina 1.8, 2= Illumina
                      1.3,3= Sanger]. If quality format not provided, it
                      will automatically detect based on sequence data
-e N_CONT, --nb N_CONT
                      Filter the reads containing given % of uncalled bases
                      (N)
-f ADPT_SEQS, --adp ADPT_SEQS
                      Trim the adapter and truncate the read sequence
                      (multiple adapter sequences must be separated by
                      comma)
-d MIN_SIZE, --msz MIN_SIZE
                      Filter the reads which are lesser than minimum size
-g ADPT_MATCH, --per ADPT_MATCH
                      Truncate the read sequence if it matches to adapter
                      sequence equal or more than given percent (0.0-1.0)
                      [default=0.9]
-i QUAL_THRESH, --qthr QUAL_THRESH
                      Filter the read sequence if average quality of bases
                      in reads is lower than threshold (1-40) [default:20]
-n TRIM_OPT, --trim TRIM_OPT
                      If trim option set to True, the reads with low quality
                      (as defined by option --qthr) will be trimmed instead
                      of discarding [True|False] [default: False]
-p WIND_SIZE, --wsz WIND_SIZE
                      The window size for trimming (5->3) the reads. This
                      option should always set when -trim option is defined
                      [default: 5]
-r MIN_LEN_FILT, --mlk MIN_LEN_FILT
                      Minimum length of the reads to retain after trimming
-q CPU, --cpu CPU     Number of CPU [default:2]
-m OUT_FMT, --ofmt OUT_FMT
                      Output file format (fastq/fasta) [default:fastq]
-v VIS_OPT, --no-vis VIS_OPT
                      No figures will be produced [True|False]
                      [default:False]
--version             show program's version number and exit

Output:¶

HTSQualC produces the filtered cleaned HTS data as FASTQ/FASTA files, and statistics and visualization of filtered cleaned HTS datasets. The output will be saved in folder with name ending as filtering_out.

Following output files and graphs are produced by HTSQualC with default commands,

Files	Description
Statistics.txt	Detailed statistics of quality control evaluations for provided parameters
1_S1_L001_R1_001_Clean.fastq	Cleaned sequence data in FASTQ format (left file for paired-end data)
1_S1_L001_R2_001_Clean.fastq	Cleaned sequence data in FASTQ format (right file for paired-end data)
1_S1_L001_R1_001_Qualdist.png	Comparative sequence PHRED quality distribution for raw and cleaned data (for left file)
1_S1_L001_R2_001_Qualdist.png	Comparative sequence PHRED quality distribution for raw and cleaned data (for right file)
1_S1_L001_R1_001_GCdist.png	Comparative percentage of GC content distribution for raw and cleaned data (for left file)
1_S1_L001_R2_001_GCdist.png	Comparative percentage of GC content distribution for raw and cleaned data (for right file)
1_S1_L001_R1_001_QualGroup.png	Group-wise sequence PHRED quality distribution for cleaned data (for left file)
1_S1_L001_R2_001_QualGroup.png	Group-wise for sequence PHRED quality distribution for cleaned data (for right file)
1_S1_L001_R1_001_Basedist.png	Nucleotide base content for raw and cleaned data (for left file)
1_S1_L001_R2_001_Basedist.png	Nucleotide base content for raw and cleaned data (for right file)
Command.log	Commands used for the HTSQualC analysis

Detailed statistics of quality control evaluations (Statistics.txt)

Comparative sequence PHRED quality distribution for raw and cleaned data (1_S1_L001_R1_001_Qualdist.png and 1_S1_L001_R2_001_Qualdist.png)

Comparative percentage of GC content distribution for raw and cleaned data (1_S1_L001_R1_001_GCdist.png and 1_S1_L001_R2_001_GCdist.png)

Nucleotide base content for raw and cleaned data (1_S1_L001_R1_001_Basedist.png and 1_S1_L001_R2_001_Basedist.png)

Group-wise sequence PHRED quality distribution for cleaned data (1_S1_L001_R1_001_QualGroup.png and 1_S1_L001_R2_001_QualGroup.png)

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Fix or improve this documentation

Search for an answer: CyVerse Learning Center
Ask us for help: click on the lower right-hand side of the page
Report an issue or submit a change: Github Repo Link
Send feedback: Tutorials@CyVerse.org

Learning Center Home