HTSeqQC: A Flexible and One-Step Quality Control Software for High-throughput Sequence Data Analysis

AbstractMotivationUse of high-throughput sequencing (HTS) has become indispensable in life science research. Raw HTS data contains several sequencing artifacts, and as a first step it is imperative to remove the artifacts for reliable downstream bioinformatics analysis. Although there are multiple stand-alone tools available that can perform the various quality control steps separately, availability of an integrated tool that can allow one-step, automated quality control analysis of HTS datasets will significantly enhance handling large number of samples parallelly.ResultsHere, we developed HTSeqQC, a stand-alone, flexible, and easy-to-use software for one-step quality control analysis of raw HTS data. HTSeqQC can evaluate HTS data quality and perform filtering and trimming analysis in a single run. We evaluated the performance of HTSeqQC for conducting batch analysis of HTS datasets with 322 sample datasets with an average ∼ 1M (paired end) sequence reads per sample. HTSeqQC accomplished the QC analysis in ∼3 hours in distributed mode and ∼31 hours in shared mode, thus underscoring its utility and robust performance.Availability and implementationHTSeqQC software, Docker image and Nextflow template are available for download at https://github.com/reneshbedre/HTSeqQC and graphical user interface (GUI) is available at CyVerse Discovery Environment (DE) (https://cyverse.org/). Documentation available at https://reneshbedre.github.io/blog/htseqqc.html and https://cyverse-htseqqc-cyverse-tutorial.readthedocs-hosted.com/en/latest/ (for CyVerse).ContactKranthi Mandadi ([email protected])Supplementary informationSupplementary information provided in Supplementary File 1.

Download Full-text

HTSQualC is a flexible and one-step quality control software for high-throughput sequencing data analysis

Scientific Reports ◽

10.1038/s41598-021-98124-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Renesh Bedre ◽

Carlos Avila ◽

Kranthi Mandadi

Keyword(s):

Quality Control ◽

High Throughput ◽

High Throughput Sequencing ◽

Science Research ◽

Control Analysis ◽

Sequencing Data ◽

Quality Control Analysis ◽

High Throughput Sequencing Data ◽

One Step ◽

Automated Quality Control

AbstractUse of high-throughput sequencing (HTS) has become indispensable in life science research. Raw HTS data contains several sequencing artifacts, and as a first step it is imperative to remove the artifacts for reliable downstream bioinformatics analysis. Although there are multiple stand-alone tools available that can perform the various quality control steps separately, availability of an integrated tool that can allow one-step, automated quality control analysis of HTS datasets will significantly enhance handling large number of samples parallelly. Here, we developed HTSQualC, a stand-alone, flexible, and easy-to-use software for one-step quality control analysis of raw HTS data. HTSQualC can evaluate HTS data quality and perform filtering and trimming analysis in a single run. We evaluated the performance of HTSQualC for conducting batch analysis of HTS datasets with 322 samples with an average ~ 1 M (paired end) sequence reads per sample. HTSQualC accomplished the QC analysis in ~ 3 h in distributed mode and ~ 31 h in shared mode, thus underscoring its utility and robust performance. In addition to command-line execution, we integrated HTSQualC into the free, open-source, CyVerse cyberinfrastructure resource as a GUI interface, for wider access to experimental biologists who have limited computational resources and/or programming abilities.

Download Full-text

kataegis: an R package for identification and visualization of the genomic localized hypermutation regions using high-throughput sequencing

BMC Genomics ◽

10.1186/s12864-021-07696-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xue Lin ◽

Yingying Hua ◽

Shuanglin Gu ◽

Li Lv ◽

Xingyu Li ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Somatic Mutations ◽

R Package ◽

Frequency Of Occurrence ◽

Link Type ◽

Genomic Landscape ◽

One Step ◽

Flanking Regions

Abstract Background Genomic localized hypermutation regions were found in cancers, which were reported to be related to the prognosis of cancers. This genomic localized hypermutation is quite different from the usual somatic mutations in the frequency of occurrence and genomic density. It is like a mutations “violent storm”, which is just what the Greek word “kataegis” means. Results There are needs for a light-weighted and simple-to-use toolkit to identify and visualize the localized hypermutation regions in genome. Thus we developed the R package “kataegis” to meet these needs. The package used only three steps to identify the genomic hypermutation regions, i.e., i) read in the variation files in standard formats; ii) calculate the inter-mutational distances; iii) identify the hypermutation regions with appropriate parameters, and finally one step to visualize the nucleotide contents and spectra of both the foci and flanking regions, and the genomic landscape of these regions. Conclusions The kataegis package is available on Bionconductor/Github (https://github.com/flosalbizziae/kataegis), which provides a light-weighted and simple-to-use toolkit for quickly identifying and visualizing the genomic hypermuation regions.

Download Full-text

Quality control analysis of the 1000 Genomes Project Omni2.5 genotypes

10.1101/078600 ◽

2016 ◽

Cited By ~ 7

Author(s):

Nicole M. Roslin ◽

Li Weili ◽

Andrew D. Paterson ◽

Lisa J. Strug

Keyword(s):

Quality Control ◽

Sex Chromosome ◽

Geographic Region ◽

Control Analysis ◽

Initial Number ◽

1000 Genomes Project ◽

High Quality ◽

1000 Genomes ◽

Link Type ◽

Quality Control Analysis

CitationFor any use of the 1000 Genomes Project data, please use the citation as noted here: http://www.1000genomes.org/faq/how-do-i-cite-1000-genomes-project. To cite this report or the lists described here, please use the following:Roslin NM, Li W, Paterson AD, Strug LJ. Quality control analysis of the 1000 Genomes Project Omni2.5 genotypes (Abstract/Program #576/F). Presented at the 66th Annual Meeting of The American Society of Human Genetics, October 18-22, 2016, Vancouver, Canada.Data SummaryChips: IlluminaHumanOmni2.5-4v1_B and Illumina HumanOmni25M-8v1-1_BInitial number of SNPs: 2 458 861Initial number of samples: 2318Number of SNPs passing QC: 1 989 184 (80.9%)Number of samples passing QC: 2318 (100%)Number of quasi-unrelated samples with consistent ethnicity and well inferred sex: 1736AbstractThe 1000 Genomes Project genotype 2318 individuals (48.1% male) from 19 populations in 5 continental groups on the Illumina Omni2.5 platform. The data are publicly available, and will prove a valuable resource to obtain ethnic-specific allele frequencies, as well as exploring population histories through principal components analysis (PCA), estimation of inbreeding coefficients, and admixture analysis. As in any study, the data should be cleaned prior to analysis, to remove individuals or markers of questionable quality. Furthermore, a thorough understanding of the relationships between individuals must be established. Here we report our findings after comprehensive examination of the data for quality control.The basic quality of the genotypes was assessed using standard procedures. KING version 1.4 was used to confirm the relationships in the provided pedigrees, and also to detect undeclared relationships. PCA was used to examine the similarities and differences between individuals among and between population groups.In general, the data was found to be of high quality. No samples were removed due to low call rate (<97%) or excess heterozygosity. Sex chromosome genotypes showed two individuals with discrepancies between reported and inferred sex, and were unable to determine sex in an additional 20 individuals; the sex for these was changed to unknown. Relationship checking found discrepancies between first-degree relationships in the provided pedigrees and the genotypes in 9 families, including one instance where a reported parent/child pair was unrelated, two instances where full sibs were unrelated, and one set of three individuals who formed a newly defined trio. A set of 1756 individuals who were inferred to be more distant than 3rd degree relatives was extracted and used in PCA. These individuals clustered in a pattern that is consistent with other published reports of global populations. We identified 4 individuals whose genotypes clustered more closely with a different geographic region than the one in the provided data.Although the genotype data is of high quality, errors exist in the publicly available dataset that require attention prior to using the genotypes. PLINK-format files including SNPs with good quality metrics and revised pedigree structures is available at http://tcag.ca. Files with distantly related or unrelated individuals, with sex inference consistent with provided gender, and with PCA consistent with continental group are also available.

Download Full-text