seqQscorer: automated quality control of next-generation sequencing data using machine learning

AbstractControlling quality of next-generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterize common NGS quality features and develop a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal and external functional genomics datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at https://github.com/salbrec/seqQscorer.

Download Full-text

Automated quality control of next generation sequencing data using machine learning

10.1101/768713 ◽

2019 ◽

Author(s):

Steffen Albrecht ◽

Miguel A. Andrade-Navarro ◽

Jean-Fred Fontaine

Keyword(s):

Quality Control ◽

Next Generation Sequencing ◽

Predictive Models ◽

Next Generation Sequencing Data ◽

Control Procedure ◽

Next Generation ◽

Sequencing Data ◽

Statistical Guidelines ◽

Ngs Data ◽

Generation Sequencing

AbstractControlling quality of next generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterized common NGS quality features and developed a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal data and external disease diagnostic datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at the following URL: https://github.com/salbrec/seqQscorer.

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text

QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data

PLoS ONE ◽

10.1371/journal.pone.0060234 ◽

2013 ◽

Vol 8 (4) ◽

pp. e60234 ◽

Cited By ~ 46

Author(s):

Qian Zhou ◽

Xiaoquan Su ◽

Anhui Wang ◽

Jian Xu ◽

Kang Ning

Keyword(s):

Quality Control ◽

Next Generation Sequencing ◽

Control Method ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Quality Control Method ◽

Generation Sequencing

Download Full-text

Quality control of next-generation sequencing data without a reference

Frontiers in Genetics ◽

10.3389/fgene.2014.00111 ◽

2014 ◽

Vol 5 ◽

Cited By ~ 18

Author(s):

Urmi H. Trivedi ◽

TimothÃ©e CÃ©zard ◽

Stephen Bridgett ◽

Anna Montazam ◽

Jenna Nichols ◽

...

Keyword(s):

Quality Control ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001850018x ◽

2018 ◽

Vol 16 (05) ◽

pp. 1850018 ◽

Cited By ~ 1

Author(s):

Sanjeev Kumar ◽

Suneeta Agarwal ◽

Ranvijay

Keyword(s):

Next Generation Sequencing ◽

Genomic Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Compression Technique ◽

Compression Algorithms ◽

Ngs Data ◽

And Storage ◽

Generation Sequencing

Genomic data nowadays is playing a vital role in number of fields such as personalized medicine, forensic, drug discovery, sequence alignment and agriculture, etc. With the advancements and reduction in the cost of next-generation sequencing (NGS) technology, these data are growing exponentially. NGS data are being generated more rapidly than they could be significantly analyzed. Thus, there is much scope for developing novel data compression algorithms to facilitate data analysis along with data transfer and storage directly. An innovative compression technique is proposed here to address the problem of transmission and storage of large NGS data. This paper presents a lossless non-reference-based FastQ file compression approach, segregating the data into three different streams and then applying appropriate and efficient compression algorithms on each. Experiments show that the proposed approach (WBFQC) outperforms other state-of-the-art approaches for compressing NGS data in terms of compression ratio (CR), and compression and decompression time. It also has random access capability over compressed genomic data. An open source FastQ compression tool is also provided here ( http://www.algorithm-skg.com/wbfqc/home.html ).

Download Full-text

CoverView: a sequence quality evaluation tool for next generation sequencing data

Wellcome Open Research ◽

10.12688/wellcomeopenres.14306.1 ◽

2018 ◽

Vol 3 ◽

pp. 36 ◽

Cited By ~ 5

Author(s):

Márton Münz ◽

Shazia Mahamdallie ◽

Shawn Yost ◽

Andrew Rimmer ◽

Emma Poyastro-Pearson ◽

...

Keyword(s):

Quality Control ◽

Next Generation Sequencing ◽

Quality Evaluation ◽

Reference Sample ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Evaluation Tool ◽

Link Type ◽

Generation Sequencing

Quality assurance and quality control are essential for robust next generation sequencing (NGS). Here we present CoverView, a fast, flexible, user-friendly quality evaluation tool for NGS data. CoverView processes mapped sequencing reads and user-specified regions to report depth of coverage, base and mapping quality metrics with increasing levels of detail from a chromosome-level summary to per-base profiles. CoverView can flag regions that do not fulfil user-specified quality requirements, allowing suboptimal data to be systematically and automatically presented for review. It also provides an interactive graphical user interface (GUI) that can be opened in a web browser and allows intuitive exploration of results. We have integrated CoverView into our accredited clinical cancer predisposition gene testing laboratory that uses the TruSight Cancer Panel (TSCP). CoverView has been invaluable for optimisation and quality control of our testing pipeline, providing transparent, consistent quality metric information and automatic flagging of regions that fall below quality thresholds. We demonstrate this utility with TSCP data from the Genome in a Bottle reference sample, which CoverView analysed in 13 seconds. CoverView uses data routinely generated by NGS pipelines, reads standard input formats, and rapidly creates easy-to-parse output text (.txt) files that are customised by a simple configuration file. CoverView can therefore be easily integrated into any NGS pipeline. CoverView and detailed documentation for its use are freely available at github.com/RahmanTeamDevelopment/CoverView/releases and www.icr.ac.uk/CoverView

Download Full-text

NGS_SNPAnalyzer: a desktop software supporting genome projects by identifying and visualizing sequence variations from next-generation sequencing data

Genes & Genomics ◽

10.1007/s13258-020-00997-7 ◽

2020 ◽

Vol 42 (11) ◽

pp. 1311-1317

Author(s):

Dong-Jun Lee ◽

Taesoo Kwon ◽

Chang-Kug Kim ◽

Young-Joo Seol ◽

Dong-Suk Park ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Variation ◽

Detection Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Sequence Variations ◽

Ngs Data ◽

Generation Sequencing ◽

Genome Projects

Abstract Background Sequence variations such as single nucleotide polymorphisms are markers for genetic diseases and breeding. Therefore, identifying sequence variations is one of the main objectives of several genome projects. Although most genome project consortiums provide standard operation procedures for sequence variation detection methods, there may be differences in the results because of human selection or error. Objective To standardize the procedure for sequence variation detection and help researchers who are not formally trained in bioinformatics, we developed the NGS_SNPAnalyzer, a desktop software and fully automated graphical pipeline. Methods The NGS_SNPAnalyzer is implemented using JavaFX (version 1.8); therefore, it is not limited to any operating system (OS). The tools employed in the NGS_SNPAnalyzer were compiled on Microsoft Windows (version 7, 10) and Ubuntu Linux (version 16.04, 17.0.4). Results The NGS_SNPAnalyzer not only includes the functionalities for variant calling and annotation but also provides quality control, mapping, and filtering details to support all procedures from next-generation sequencing (NGS) data to variant visualization. It can be executed using pre-set pipelines and options and customized via user-specified options. Additionally, the NGS_SNPAnalyzer provides a user-friendly graphical interface and can be installed on any OS that supports JAVA. Conclusions Although there are several pipelines and visualization tools available for NGS data analysis, we developed the NGS_SNPAnalyzer to provide the user with an easy-to-use interface. The benchmark test results indicate that the NGS_SNPAnayzer achieves better performance than other open source tools.

Download Full-text

Rapid evaluation and quality control of next generation sequencing data with FaQCs

BMC Bioinformatics ◽

10.1186/s12859-014-0366-2 ◽

2014 ◽

Vol 15 (1) ◽

Cited By ~ 88

Author(s):

Chien-Chi Lo ◽

Patrick S G Chain

Keyword(s):

Quality Control ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Rapid Evaluation ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Masking as an effective quality control method for next-generation sequencing data analysis

BMC Bioinformatics ◽

10.1186/s12859-014-0382-2 ◽

2014 ◽

Vol 15 (1) ◽

Cited By ~ 4

Author(s):

Sajung Yun ◽

Sijung Yun

Keyword(s):

Quality Control ◽

Data Analysis ◽

Next Generation Sequencing ◽

Control Method ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Quality Control Method ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

Detection of somatic structural variants from short-read next-generation sequencing data

10.1101/840751 ◽

2019 ◽

Author(s):

Tingting Gong ◽

Vanessa M Hayes ◽

Eva KF Chan

Keyword(s):

Next Generation Sequencing ◽

Cancer Genomics ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Factors Affecting ◽

Ngs Data ◽

Generation Sequencing

AbstractSomatic structural variants (SVs) play a significant role in cancer development and evolution, but are notoriously more difficult to detect than small variants from short-read next-generation sequencing (NGS) data. This is due to a combination of challenges attributed to the purity of tumour samples, tumour heterogeneity, limitations of short-read information from NGS, and sequence alignment ambiguities. In spite of active development of SV detection tools (callers) over the past few years, each method has inherent advantages and limitations. In this review, we highlight some of the important factors affecting somatic SV detection and compared the performance of eight commonly used SV callers. In particular, we focus on the extent of change in sensitivity and precision for detecting different SV types and size ranges from samples with differing variant allele frequencies and sequencing depths of coverage. We highlight the reasons for why some SV callers perform well in some settings but not others, allowing our evaluation findings to be extended beyond the eight SV callers examined in this paper. As the importance of large structural variants become increasingly recognised in cancer genomics, this paper provides a timely review on some of the most impactful factors influencing somatic SV detection and guidance on selecting an appropriate SV caller.

Download Full-text