scholarly journals Next Generation Sequencing and Machine Learning Technologies Are Painting the Epigenetic Portrait of Glioblastoma

2020 ◽  
Vol 10 ◽  
Author(s):  
Ivana Jovčevska
2019 ◽  
Vol 66 (1) ◽  
pp. 239-246 ◽  
Author(s):  
Chao Wu ◽  
Xiaonan Zhao ◽  
Mark Welsh ◽  
Kellianne Costello ◽  
Kajia Cao ◽  
...  

Abstract BACKGROUND Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning–based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens. METHODS A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants. RESULTS The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as “uncertain,” with zero misclassification between the true positives and artifacts in the test set. CONCLUSIONS We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.


Genes ◽  
2018 ◽  
Vol 9 (10) ◽  
pp. 505
Author(s):  
Manfred Grabherr ◽  
Bozena Kaminska ◽  
Jan Komorowski

The massive increase in computational power over the recent years and wider applicationsof machine learning methods, coincidental or not, were paralleled by remarkable advances inhigh-throughput DNA sequencing technologies.[...]


BMC Genomics ◽  
2016 ◽  
Vol 17 (1) ◽  
Author(s):  
Jean-François Spinella ◽  
Pamela Mehanna ◽  
Ramon Vidal ◽  
Virginie Saillour ◽  
Pauline Cassart ◽  
...  

2019 ◽  
Vol 19 (7) ◽  
Author(s):  
Gang Li ◽  
Boyang Ji ◽  
Jens Nielsen

ABSTRACT Understanding genotype–phenotype relationship is fundamental in biology. With the benefit from next-generation sequencing and high-throughput phenotyping methodologies, there have been generated much genome and phenome data for Saccharomyces cerevisiae. This makes it an excellent model system to understand the genotype–phenotype relationship. In this paper, we presented the reconstruction and application of the yeast pan-genome in resolving genotype–phenotype relationship by a machine learning-assisted approach.


2019 ◽  
Author(s):  
Tom Hill ◽  
Robert L. Unckless

AbstractCopy number variants (CNV) are associated with phenotypic variation in several species. However, properly detecting changes in copy numbers of sequences remains a difficult problem, especially in lower quality or lower coverage next-generation sequencing data. Here, inspired by recent applications of machine learning in genomics, we describe a method to detect duplications and deletions in short-read sequencing data. In low coverage data, machine learning appears to be more powerful in the detection of CNVs than the gold-standard methods or coverage estimation alone, and of equal power in high coverage data. We also demonstrate how replicating training sets allows a more precise detection of CNVs, even identifying novel CNVs in two genomes previously surveyed thoroughly for CNVs using long read data.Available at: https://github.com/tomh1lll/dudeml


2020 ◽  
Vol 2020 ◽  
pp. 1-6
Author(s):  
Erdal Cosgun ◽  
Min Oh

Background. Next-generation sequencing enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concerns is the reliability of variant calls. Although researchers can utilize raw quality scores of variant calling, they are forced to start the further analysis without any preevaluation of the quality scores. Method. We presented a machine learning approach for estimating quality scores of variant calls derived from BWA+GATK. We analyzed correlations between the quality score and these annotations, specifying informative annotations which were used as features to predict variant quality scores. To test the predictive models, we simulated 24 paired-end Illumina sequencing reads with 30x coverage base. Also, twenty-four human genome sequencing reads resulting from Illumina paired-end sequencing with at least 30x coverage were secured from the Sequence Read Archive. Results. Using BWA+GATK, VCFs were derived from simulated and real sequencing reads. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). The robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively).


Sign in / Sign up

Export Citation Format

Share Document