Next Generation Sequencing and Machine Learning Technologies Are Painting the Epigenetic Portrait of Glioblastoma

Abstract BACKGROUND Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning–based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens. METHODS A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants. RESULTS The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as “uncertain,” with zero misclassification between the true positives and artifacts in the test set. CONCLUSIONS We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.

Download Full-text

Special Issue Introduction: The Wonders and Mysteries Next Generation Sequencing Technologies Help Reveal

Genes ◽

10.3390/genes9100505 ◽

2018 ◽

Vol 9 (10) ◽

pp. 505

Author(s):

Manfred Grabherr ◽

Bozena Kaminska ◽

Jan Komorowski

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Dna Sequencing ◽

Computational Power ◽

Next Generation ◽

Special Issue ◽

Learning Methods ◽

Machine Learning Methods ◽

Sequencing Technologies ◽

Generation Sequencing

The massive increase in computational power over the recent years and wider applicationsof machine learning methods, coincidental or not, were paralleled by remarkable advances inhigh-throughput DNA sequencing technologies.[...]

Download Full-text

A new evolutionary rough fuzzy integrated machine learning technique for microRNA selection using next-generation sequencing data of breast cancer

Proceedings of the Genetic and Evolutionary Computation Conference Companion on - GECCO '19 ◽

10.1145/3319619.3326836 ◽

2019 ◽

Author(s):

Jnanendra Prasad Sarkar ◽

Indrajit Saha ◽

Somnath Rakshit ◽

Monalisa Pal ◽

Michal Wlasnowolski ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Machine Learning Technique ◽

Learning Technique ◽

Generation Sequencing

Download Full-text

SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing

BMC Genomics ◽

10.1186/s12864-016-3281-2 ◽

2016 ◽

Vol 17 (1) ◽

Cited By ~ 32

Author(s):

Jean-François Spinella ◽

Pamela Mehanna ◽

Ramon Vidal ◽

Virginie Saillour ◽

Pauline Cassart ◽

...

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Next Generation ◽

Somatic Variant ◽

Low Pass ◽

Generation Sequencing ◽

Variant Identification

Download Full-text

Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data

BMC Proceedings ◽

10.1186/s12919-016-0020-2 ◽

2016 ◽

Vol 10 (S7) ◽

Cited By ~ 10

Author(s):

Elizabeth Held ◽

Joshua Cape ◽

Nathan Tintle

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Logistic Regression ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Regression Methods ◽

Generation Sequencing

Download Full-text

The pan-genome of Saccharomyces cerevisiae

FEMS Yeast Research ◽

10.1093/femsyr/foz064 ◽

2019 ◽

Vol 19 (7) ◽

Cited By ~ 2

Author(s):

Gang Li ◽

Boyang Ji ◽

Jens Nielsen

Keyword(s):

Machine Learning ◽

Saccharomyces Cerevisiae ◽

Next Generation Sequencing ◽

High Throughput ◽

Model System ◽

Next Generation ◽

Pan Genome ◽

Excellent Model ◽

High Throughput Phenotyping ◽

Generation Sequencing

ABSTRACT Understanding genotype–phenotype relationship is fundamental in biology. With the benefit from next-generation sequencing and high-throughput phenotyping methodologies, there have been generated much genome and phenome data for Saccharomyces cerevisiae. This makes it an excellent model system to understand the genotype–phenotype relationship. In this paper, we presented the reconstruction and application of the yeast pan-genome in resolving genotype–phenotype relationship by a machine learning-assisted approach.

Download Full-text

The use of a next-generation sequencing-derived machine-learning risk-prediction model (OncoCast-MPM) for malignant pleural mesothelioma: a retrospective study

The Lancet Digital Health ◽

10.1016/s2589-7500(21)00104-7 ◽

2021 ◽

Author(s):

Marjorie G Zauderer ◽

Axel Martin ◽

Jacklynn Egger ◽

Hira Rizvi ◽

Michael Offin ◽

...

Keyword(s):

Machine Learning ◽

Retrospective Study ◽

Next Generation Sequencing ◽

Prediction Model ◽

Risk Prediction ◽

Malignant Pleural Mesothelioma ◽

Risk Prediction Model ◽

Pleural Mesothelioma ◽

Next Generation ◽

Generation Sequencing

Download Full-text

A Simple Deep Learning Approach for Detecting Duplications and Deletions in Next-Generation Sequencing Data

10.1101/657361 ◽

2019 ◽

Author(s):

Tom Hill ◽

Robert L. Unckless

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Copy Number Variants ◽

Difficult Problem ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

High Coverage ◽

Long Read ◽

Generation Sequencing

AbstractCopy number variants (CNV) are associated with phenotypic variation in several species. However, properly detecting changes in copy numbers of sequences remains a difficult problem, especially in lower quality or lower coverage next-generation sequencing data. Here, inspired by recent applications of machine learning in genomics, we describe a method to detect duplications and deletions in short-read sequencing data. In low coverage data, machine learning appears to be more powerful in the detection of CNVs than the gold-standard methods or coverage estimation alone, and of equal power in high coverage data. We also demonstrate how replicating training sets allows a more precise detection of CNVs, even identifying novel CNVs in two genomes previously surveyed thoroughly for CNVs using long read data.Available at: https://github.com/tomh1lll/dudeml

Download Full-text

Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments

BioMed Research International ◽

10.1155/2020/8531502 ◽

2020 ◽

Vol 2020 ◽

pp. 1-6

Author(s):

Erdal Cosgun ◽

Min Oh

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Human Genome ◽

Prediction Models ◽

Variant Calling ◽

Real Data ◽

Next Generation ◽

Sequencing Technologies ◽

Massively Parallel Processing ◽

Generation Sequencing

Background. Next-generation sequencing enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concerns is the reliability of variant calls. Although researchers can utilize raw quality scores of variant calling, they are forced to start the further analysis without any preevaluation of the quality scores. Method. We presented a machine learning approach for estimating quality scores of variant calls derived from BWA+GATK. We analyzed correlations between the quality score and these annotations, specifying informative annotations which were used as features to predict variant quality scores. To test the predictive models, we simulated 24 paired-end Illumina sequencing reads with 30x coverage base. Also, twenty-four human genome sequencing reads resulting from Illumina paired-end sequencing with at least 30x coverage were secured from the Sequence Read Archive. Results. Using BWA+GATK, VCFs were derived from simulated and real sequencing reads. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). The robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively).

Download Full-text