Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

AbstractBackgroundMolecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. Here we present a machine learning-based method to distinguish artifacts from bona fide Single Nucleotide Variants (SNVs) detected by NGS from tumor specimens.MethodsA cohort of 11,278 SNVs identified through clinical sequencing of tumor specimens were collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A three-class (real, artifact and uncertain) model was developed on the training set, fine-tuned using the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants.ResultsThe optimized classifier demonstrated 100% specificity and 97% sensitivity over 5,587 SNVs of the test set. 1,252 out of 1,341 true positive variants were identified as real, 4,143 out of 4,246 false positive calls were deemed artifacts, while only 192(3.4%) SNVs were labeled as “uncertain” with zero misclassification between the true positives and artifacts in the test set.ConclusionsWe presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received a definitive label and thus were exempt from manual review. This framework could improve quality and efficiency of variant review process in clinical labs.

Download Full-text

Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Clinical Chemistry ◽

10.1373/clinchem.2019.308213 ◽

2019 ◽

Vol 66 (1) ◽

pp. 239-246 ◽

Cited By ~ 1

Author(s):

Chao Wu ◽

Xiaonan Zhao ◽

Mark Welsh ◽

Kellianne Costello ◽

Kajia Cao ◽

...

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Clinical Laboratory ◽

Next Generation ◽

Single Nucleotide Variants ◽

Test Set ◽

Clinical Laboratories ◽

Bona Fide ◽

Validation Set ◽

Generation Sequencing

Abstract BACKGROUND Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning–based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens. METHODS A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants. RESULTS The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as “uncertain,” with zero misclassification between the true positives and artifacts in the test set. CONCLUSIONS We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.

Download Full-text

Basic principles of genetic disease

ESC CardioMed ◽

10.1093/med/9780198784906.003.0148 ◽

2018 ◽

pp. 669-671

Author(s):

Eric Schulze-Bahr

Keyword(s):

Genetic Disease ◽

Copy Number ◽

Copy Number Variants ◽

Single Nucleotide Variants ◽

Individual Genome ◽

Base Pairs ◽

Single Nucleotide ◽

Basic Principles ◽

Bona Fide ◽

Genomic Regions

The human genome consists of approximately 3 billion (3 × 109) base pairs of DNA (around 20,000 genes), organized as 23 chromosomes (diploid parental set), and a small mitochondrial genome (37 genes, including 13 proteins; 16,589 base pairs) of maternal origin. Most human genetic variation is natural, that is, common or rare (minor allele frequency >0.1%) and does not cause disease—apart from every true disease-causing (bona fide) mutation each individual genome harbours more than 3.5 million single nucleotide variants (including >10,000 non-synonymous changes causing amino acid substitutions) and 200–300 large structural or copy number variants (insertions/deletions, up to several thousands of base-pairs) that are non-disease-causing variations and scattered throughout coding and non-coding genomic regions.

Download Full-text

Applying machine learning to detect early stages of cardiac remodelling and dysfunction

European Heart Journal - Cardiovascular Imaging ◽

10.1093/ehjci/jeaa135 ◽

2020 ◽

Cited By ~ 1

Author(s):

František Sabovčik ◽

Nicholas Cauwenberghs ◽

Dmitry Kouznetsov ◽

Francois Haddad ◽

Amparo Alonso-Betanzos ◽

...

Keyword(s):

Machine Learning ◽

Clinical Laboratory ◽

Characteristic Curve ◽

Left Ventricular ◽

Support Vector ◽

High Area ◽

Echocardiographic Examination ◽

Validation Set ◽

Set Up ◽

Lv Diastolic Dysfunction

Abstract Aims Both left ventricular (LV) diastolic dysfunction (LVDD) and hypertrophy (LVH) as assessed by echocardiography are independent prognostic markers of future cardiovascular events in the community. However, selective screening strategies to identify individuals at risk who would benefit most from cardiac phenotyping are lacking. We, therefore, assessed the utility of several machine learning (ML) classifiers built on routinely measured clinical, biochemical, and electrocardiographic features for detecting subclinical LV abnormalities. Methods and results We included 1407 participants (mean age, 51 years, 51% women) randomly recruited from the general population. We used echocardiographic parameters reflecting LV diastolic function and structure to define LV abnormalities (LVDD, n = 252; LVH, n = 272). Next, four supervised ML algorithms (XGBoost, AdaBoost, Random Forest (RF), Support Vector Machines, and Logistic regression) were used to build classifiers based on clinical data (67 features) to categorize LVDD and LVH. We applied a nested 10-fold cross-validation set-up. XGBoost and RF classifiers exhibited a high area under the receiver operating characteristic curve with values between 86.2% and 88.1% for predicting LVDD and between 77.7% and 78.5% for predicting LVH. Age, body mass index, different components of blood pressure, history of hypertension, antihypertensive treatment, and various electrocardiographic variables were the top selected features for predicting LVDD and LVH. Conclusion XGBoost and RF classifiers combining routinely measured clinical, laboratory, and electrocardiographic data predicted LVDD and LVH with high accuracy. These ML classifiers might be useful to pre-select individuals in whom further echocardiographic examination, monitoring, and preventive measures are warranted.

Download Full-text

A Combined Static and Dynamic Analysis Approach to Detect Malicious Browser Extensions

Security and Communication Networks ◽

10.1155/2018/7087239 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16

Author(s):

Yao Wang ◽

Wandong Cai ◽

Pin Lyu ◽

Wei Shao

Keyword(s):

Machine Learning ◽

False Positive Rate ◽

Feature Selection Method ◽

Machine Learning Techniques ◽

Security Risk ◽

Test Set ◽

Static And Dynamic Analysis ◽

Detection Model ◽

Validation Set ◽

Browser Extensions

Ill-intentioned browser extensions pose an emergent security risk and have become one of the most common attack vectors on the Internet due to their wide popularity and high privilege. Once installed, malicious extensions are executed and attempt to compromise a victim’s browser. To detect malicious browser extensions, security researchers have put forward several techniques. These techniques primarily concentrate on the usage of API calls by malicious extensions, imposing restricted policies for extensions, and monitoring extension’s activities. In this paper, we propose a machine-learning-based approach to detect malicious extensions. We apply static and dynamic techniques to analyse an extension for extracting features. The analysis process extracts features from the source codes including JavaScript codes, HTML pages, and CSS files and the execution activities of an extension. To guarantee the robustness of the features, a feature selection method is then applied to retain the most relevant features while discarding low-correlated features. The detection models based on machine-learning techniques are subsequently constructed by leveraging these features. As can be seen from evaluation results, our detection model, containing over 4,600 labelled extension samples, is able to detect malicious extensions with an accuracy of 96.52% in validation set and 95.18% in test set, with a false positive rate of 2.38% in validation set and 3.66% in test set.

Download Full-text

Reducing Sanger confirmation testing through false positive prediction algorithms

Genetics in Medicine ◽

10.1038/s41436-021-01148-3 ◽

2021 ◽

Author(s):

James M. Holt ◽

Melissa Kelly ◽

Brett Sundlof ◽

Ghunwa Nakouzi ◽

David Bick ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Turnaround Time ◽

Learning Models ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

False Positive Prediction ◽

Confirmatory Testing ◽

Reference Human Genome ◽

Machine Learning Models

Abstract Purpose Clinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity, it also results in increased turnaround time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing. Methods We sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results with an established set of variants for each genome referred to as a truth set. We then trained machine learning models to identify variants that were labeled as false positives. Results After training, the models identified 99.5% of the false positive heterozygous single-nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of nonactionable, nonprimary SNVs by 85% and indels by 75%. Employing the algorithm in clinical practice reduced overall orthogonal testing using dideoxynucleotide (Sanger) sequencing by 71%. Conclusion Our results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.

Download Full-text

Machine learning methods applied to genotyping data capture interactions between single nucleotide variants in late onset Alzheimer's disease

10.1101/2021.08.30.21262815 ◽

2021 ◽

Author(s):

Magdalena Arnal Segura ◽

Dietmar Fernandez ◽

Claudia Giambartolomei ◽

Giorgio Bini ◽

Eleftherios Samaras ◽

...

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Late Onset ◽

Hot Spot ◽

Association Studies ◽

Machine Learning Algorithms ◽

Genome Wide Association Studies ◽

Single Nucleotide Variants ◽

Single Nucleotide

INTRODUCTION Genome-wide association studies (GWAS) in late onset Alzheimer's disease (LOAD) provide lists of individual genetic determinants. However, GWAS are not good at capturing the synergistic effects among multiple genetic variants and lack good specificity. METHODS We applied tree-based machine learning algorithms (MLs) to discriminate LOAD (> 700 individuals) and age-matched unaffected subjects using single nucleotide variants (SNVs) from AD studies, obtaining specific genomic profiles with the prioritized SNVs. RESULTS The MLs prioritized a set of SNVs located in close proximity genes PVRL2, TOMM40, APOE and APOC1. The captured genomic profiles in this region showed a clear interaction between rs405509 and rs1160985. Additionally, rs405509 located in APOE promoter interacts with rs429358 among others, seemingly neutralizing their predisposing effect. Interactions are characterized by their association with specific comorbidities and the presence of eQTL and sQTLs. DISCUSSION Our approach efficiently discriminates LOAD from controls, capturing genomic profiles defined by interactions among SNVs in a hot-spot region.

Download Full-text

Margin-Based Pareto Ensemble Pruning: An Ensemble Pruning Algorithm That Learns to Search Optimized Ensembles

Computational Intelligence and Neuroscience ◽

10.1155/2019/7560872 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Ruihan Hu ◽

Songbin Zhou ◽

Yisen Liu ◽

Zhiri Tang

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Classification Performance ◽

Test Set ◽

Pruning Algorithm ◽

Ensemble Pruning ◽

Learning Framework ◽

Classification Tasks ◽

Validation Set ◽

Definition Of

The ensemble pruning system is an effective machine learning framework that combines several learners as experts to classify a test set. Generally, ensemble pruning systems aim to define a region of competence based on the validation set to select the most competent ensembles from the ensemble pool with respect to the test set. However, the size of the ensemble pool is usually fixed, and the performance of an ensemble pool heavily depends on the definition of the region of competence. In this paper, a dynamic pruning framework called margin-based Pareto ensemble pruning is proposed for ensemble pruning systems. The framework explores the optimized ensemble pool size during the overproduction stage and finetunes the experts during the pruning stage. The Pareto optimization algorithm is used to explore the size of the overproduction ensemble pool that can result in better performance. Considering the information entropy of the learners in the indecision region, the marginal criterion for each learner in the ensemble pool is calculated using margin criterion pruning, which prunes the experts with respect to the test set. The effectiveness of the proposed method for classification tasks is assessed using datasets. The results show that margin-based Pareto ensemble pruning can achieve smaller ensemble sizes and better classification performance in most datasets when compared with state-of-the-art models.

Download Full-text

Estimating the Frequency of Single Point Driver Mutations across Common Solid Tumours

Scientific Reports ◽

10.1038/s41598-019-48765-2 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Madeleine Darbyshire ◽

Zachary du Toit ◽

Mark F. Rogers ◽

Tom R. Gaunt ◽

Colin Campbell

Keyword(s):

Machine Learning ◽

Human Cancer ◽

Single Point ◽

Cancer Genome ◽

Solid Tumours ◽

Driver Mutations ◽

Cancer Type ◽

Single Nucleotide Variants ◽

Driver Genes ◽

Single Nucleotide

Abstract For cancers, such as common solid tumours, variants in the genome give a selective growth advantage to certain cells. It has recently been argued that the mean count of coding single nucleotide variants acting as disease-drivers in common solid tumours is frequently small in size, but significantly variable by cancer type (hypermutation is excluded from this study). In this paper we investigate this proposal through the use of integrative machine-learning-based classifiers we have proposed recently for predicting the disease-driver status of single nucleotide variants (SNVs) in the human cancer genome. We find that predicted driver counts are compatible with this proposal, have similar variabilities by cancer type and, to a certain extent, the drivers are identifiable by these machine learning methods. We further discuss predicted driver counts stratified by stage of disease and driver counts in non-coding regions of the cancer genome, in addition to driver-genes.

Download Full-text

A machine learning model for screening of body fluid cytology smears

10.1101/2021.07.20.453010 ◽

2021 ◽

Author(s):

Parikshit Sanyal ◽

Sayak Paul ◽

Vandana Rana ◽

Kanchan Kulhari

Keyword(s):

Machine Learning ◽

Body Fluid ◽

Learning Model ◽

Malignant Cells ◽

Training Set ◽

Test Set ◽

White Balance ◽

Machine Learning Model ◽

Validation Set ◽

Fluid Cytology

Introduction: Body fluid cytology is one of the commonest investigations performed in indoor patients, both for diagnosis of suspected carcinoma as well as staging of known carcinoma. Carcinoma is diagnosed in body fluids by the pathologist through microscopic examination and searching for malignant epithelial cell clusters. The process of screening body fluid smears is a time consuming and error prone process. Aim: We have attempted to construct a machine learning model which can screen body fluid cytology smears for malignant cells. Materials and methods: MGG stained Ascitic / pleural fluid cytology smears were included from 21 cases (14 malignant, 07 benign) in this study. A total of 693 microphotographs were taken at 40x magnification at the same illumination and after correction of white balance. A Magnus Microphotography system was used for photography. The images were split into the training set (195 images), test set (120 images) and validation set (378 images). A machine learning model, a convolutional neural network, was developed in the Python programming language using the Keras deep learning library. The model was trained with the images of the training set. After completion of training, the model was evaluated on the test set of images. Results: Evaluation of the model on the test set produced a sensitivity of 97.87%, specificity 85.26%, PPV 95.18%, NPV 93.10% In 06 images, the model has failed to detect singly scattered malignant cells/ clusters. 14 (3.7%) false positives was reported by the model. The machine learning model shows potential utility as a screening tool. However, it needs improvement in detecting singly scattered malignant cells and filtering inflammatory infiltrate.

Download Full-text

Reducing Sanger Confirmation Testing through False Positive Prediction Algorithms

10.1101/2020.04.30.066159 ◽

2020 ◽

Author(s):

James M. Holt ◽

Melissa Wilk ◽

Brett Sundlof ◽

Ghunwa Nakouzi ◽

David Bick ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Learning Models ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

False Positive Prediction ◽

Confirmatory Testing ◽

Clinical Genome Sequencing ◽

Reference Human Genome ◽

Machine Learning Models

AbstractPurposeClinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity it also results in increased turn-around-time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing.MethodsWe sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results to an established set of variants for each genome referred to as a ‘truth-set’. We then trained machine learning models to identify variants that were labeled as false positives.ResultsAfter training, the models identified 99.5% of the false positive heterozygous single nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of true positive SNVs to 1.67% and indels to 20.29%. Employing the algorithm in clinical practice reduced orthogonal testing using dideoxynucleotide (Sanger) sequencing by 78.22%.ConclusionOur results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.

Download Full-text