scholarly journals ForestQC: quality control on genetic variants from next-generation sequencing data using random forest

2018 ◽  
Author(s):  
Jiajin Li ◽  
Brandon Jew ◽  
Lingyu Zhan ◽  
Sungoo Hwang ◽  
Giovanni Coppola ◽  
...  

ABSTRACTNext-generation sequencing technology (NGS) enables discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in sequencing technology or in variant calling algorithms. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present a statistical approach for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our method uses information on sequencing quality such as sequencing depth, genotyping quality, and GC contents to predict whether a certain variant is likely to contain errors. To evaluate our method, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that our method outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. Our approach is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is an effective approach to perform quality control on genetic variants from sequencing data.Author SummaryGenetic disorders can be caused by many types of genetic mutations, including common and rare single nucleotide variants, structural variants, insertions and deletions. Nowadays, next generation sequencing (NGS) technology allows us to identify various genetic variants that are associated with diseases. However, variants detected by NGS might have poor sequencing quality due to biases and errors in sequencing technologies and analysis tools. Therefore, it is critical to remove variants with low quality, which could cause spurious findings in follow-up analyses. Previously, people applied either hard filters or machine learning models for variant quality control (QC), which failed to filter out those variants accurately. Here, we developed a statistical tool, ForestQC, for variant QC by combining a filtering approach and a machine learning approach. We applied ForestQC to one family-based whole genome sequencing (WGS) dataset and one general case-control WGS dataset, to evaluate our method. Results show that ForestQC outperforms widely used methods for variant QC by considerably improving the quality of variants. Also, ForestQC is very efficient and scalable to large-scale sequencing datasets. Our study indicates that combining filtering approaches and machine learning approaches enables effective variant QC.

2019 ◽  
Vol 15 (12) ◽  
pp. e1007556
Author(s):  
Jiajin Li ◽  
Brandon Jew ◽  
Lingyu Zhan ◽  
Sungoo Hwang ◽  
Giovanni Coppola ◽  
...  

2014 ◽  
Vol 5 ◽  
Author(s):  
Urmi H. Trivedi ◽  
Timothée Cézard ◽  
Stephen Bridgett ◽  
Anna Montazam ◽  
Jenna Nichols ◽  
...  

2018 ◽  
Vol 3 ◽  
pp. 36 ◽  
Author(s):  
Márton Münz ◽  
Shazia Mahamdallie ◽  
Shawn Yost ◽  
Andrew Rimmer ◽  
Emma Poyastro-Pearson ◽  
...  

Quality assurance and quality control are essential for robust next generation sequencing (NGS). Here we present CoverView, a fast, flexible, user-friendly quality evaluation tool for NGS data. CoverView processes mapped sequencing reads and user-specified regions to report depth of coverage, base and mapping quality metrics with increasing levels of detail from a chromosome-level summary to per-base profiles. CoverView can flag regions that do not fulfil user-specified quality requirements, allowing suboptimal data to be systematically and automatically presented for review. It also provides an interactive graphical user interface (GUI) that can be opened in a web browser and allows intuitive exploration of results. We have integrated CoverView into our accredited clinical cancer predisposition gene testing laboratory that uses the TruSight Cancer Panel (TSCP). CoverView has been invaluable for optimisation and quality control of our testing pipeline, providing transparent, consistent quality metric information and automatic flagging of regions that fall below quality thresholds. We demonstrate this utility with TSCP data from the Genome in a Bottle reference sample, which CoverView analysed in 13 seconds. CoverView uses data routinely generated by NGS pipelines, reads standard input formats, and rapidly creates easy-to-parse output text (.txt) files that are customised by a simple configuration file. CoverView can therefore be easily integrated into any NGS pipeline. CoverView and detailed documentation for its use are freely available at github.com/RahmanTeamDevelopment/CoverView/releases and www.icr.ac.uk/CoverView


2021 ◽  
Author(s):  
King Wai Lau ◽  
Michelle Kleeman ◽  
Caroline Reuter ◽  
Attila Lorincz

AbstractSummaryExtremely large datasets are impossible or very difficult for humans to comprehend by standard mental approaches. Intuitive visualization of genetic variants in genomic sequencing data could help in the review and confirmation process of variants called by automated variant calling programs. To help facilitate interpretation of genetic variant next-generation sequencing (NGS) data we developed VisVariant, a customizable visualization tool that creates a figure showing the overlapping sequence information of thousands of individual reads including the variant and flanking regions.Availability and implementationDetailed information on how to download, install and run VisVariant together with an example is available on our github website [https://github.com/hugging-biorxiv/visvariant].


Sign in / Sign up

Export Citation Format

Share Document