scholarly journals VarSight: Prioritizing Clinically Reported Variants with Binary Classification Algorithms

2019 ◽  
Author(s):  
James M. Holt ◽  
Brandon Wilk ◽  
Camille L. Birch ◽  
Donna M. Brown ◽  
Manavalan Gajapathy ◽  
...  

AbstractMotivationIn genomic medicine for rare disease patients, the primary goal is to identify one or more variants that cause their disease. Typically, this is done through filtering and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance.ResultsWe tested the application of classification algorithms that ingest variant predictions along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 patients in the Undiagnosed Diseases Network. We treated the classifiers as variant prioritization systems and compared them to another variant prioritization algorithm and two single-measure controls. We showed that these classifiers outperformed the other methods with the best classifier ranking 73% of all reported variants and 97% of reported pathogenic variants in the top 20.AvailabilityThe scripts used to generate results presented in this paper are available at https://github.com/HudsonAlpha/[email protected]

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
James M. Holt ◽  
◽  
Brandon Wilk ◽  
Camille L. Birch ◽  
Donna M. Brown ◽  
...  

Abstract Background When applying genomic medicine to a rare disease patient, the primary goal is to identify one or more genomic variants that may explain the patient’s phenotypes. Typically, this is done through annotation, filtering, and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance. Methods We tested the application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 patients in the Undiagnosed Diseases Network. Results We treated the classifiers as variant prioritization systems and compared them to four variant prioritization algorithms and two single-measure controls. We showed that the trained classifiers outperformed all other tested methods with the best classifiers ranking 72% of all reported variants and 94% of reported pathogenic variants in the top 20. Conclusions We demonstrated how freely available binary classification algorithms can be used to prioritize variants even in the presence of real-world variability. Furthermore, these classifiers outperformed all other tested methods, suggesting that they may be well suited for working with real rare disease patient datasets.


2020 ◽  
Author(s):  
Phillip A. Richmond ◽  
Tamar V. Av-Shalom ◽  
Oriol Fornes ◽  
Bhavi Modi ◽  
Alison M. Elliott ◽  
...  

AbstractMendelian rare genetic diseases affect 5-10% of the population, and with over 5,300 genes responsible for ~7,000 different diseases, they are challenging to diagnose. The use of whole genome sequencing (WGS) has bolstered the diagnosis rate significantly. Effective use of WGS relies upon the ability to identify the disrupted gene responsible for disease phenotypes. This process involves genomic variant calling and prioritization, and is the beneficiary of improvements to sequencing technology, variant calling approaches, and increased capacity to prioritize genomic variants with potential pathogenicity. As analysis pipelines continue to improve, careful testing of their efficacy is paramount. However, real-life cases typically emerge anecdotally, and utilization of clinically sensitive and identifiable data for testing pipeline improvements is regulated and limiting. We identified the need for a gene-based variant simulation framework which can create mock rare disease scenarios, utilizing known pathogenic variants or through the creation of novel gene-disrupting variants. To fill this need, we present GeneBreaker, a tool which creates synthetic rare disease cases with utility for benchmarking variant calling approaches, testing the efficacy of variant prioritization, and as an educational mechanism for training diagnostic practitioners in the expanding field of genomic medicine. GeneBreaker is freely available at http://GeneBreaker.cmmt.ubc.ca.


2021 ◽  
Author(s):  
Chenjie Zeng ◽  
Lisa A Bastarache ◽  
Ran Tao ◽  
Eric Venner ◽  
Scott Hebbring ◽  
...  

Knowledge of the clinical spectrum of rare genetic disorders helps in disease management and variant pathogenicity interpretation. Leveraging electronic health record (EHR)-linked genetic testing data from the eMERGE network, we determined the associations between a set of 23 hereditary cancer genes and 3017 phenotypes in 23544 individuals. This phenome-wide association study replicated 45% (184/406) of known gene-phenotype associations (P = 5.1 ×10-125). Meta-analysis with an independent EHR-derived cohort of 3242 patients confirmed 14 novel associations with phenotypes in the neoplastic, genitourinary, digestive, congenital, metabolic, mental and neurologic categories. Phenotype risk scores (PheRS) based on weighted aggregations of EHR phenotypes accurately predicted variant pathogenicity for at least 50% of pathogenic variants for 8/23 genes. We generated a catalog of PheRS for 7800 variants, including 5217 variants of uncertain significance, to provide empirical evidence of potential pathogenicity. This study highlights the potential of EHR data in genomic medicine.


1994 ◽  
Vol 05 (01) ◽  
pp. 95-112 ◽  
Author(s):  
R. CAMPANINI ◽  
G. DI CARO ◽  
M. VILLANI ◽  
I. D’ANTONE ◽  
G. GIUSTI

Genetic algorithms are search or classification algorithms based on natural models. They present a high degree of internal parallelism. We developed two versions, differing in the way the population is organized and we studied and compared their characteristics and performances when applied to the optimization of multidimensional function problems. All the implementations are realized on transputer networks.


Linguistics ◽  
2018 ◽  
Vol 56 (2) ◽  
pp. 361-400 ◽  
Author(s):  
Jack Hoeksema

Abstract This paper presents Dutch and English predicates that behave as positive polarity items and provides a partial, semantically-grounded classification of this group of PPIs. The items are studied from the perspective of anti-licensing behavior (by negation, either locally or long-distance, in questions, and by weakly negative quantifiers such as little and few). Predicates, unlike quantifiers, do not have wide scope readings (which allow quantificational PPIs such as somebody to appear in the syntactic scope of negation). Using a mixture of corpus data and introspective judgments, we show that anti-licensing among PPIs is not uniform (mirroring earlier results on NPIs which likewise show considerable variation). Rescuing contexts are likewise shown to differ among PPIs. Some of the PPI predicates show complex interaction with illocutionary force (especially mandative force), and others with differences between presupposed and asserted propositions. High degree predicates, finally, point toward the existence of connections between the marking of degree and positive polarity. PPI status is argued to be the result of a complex interaction between the effects of negation and other nonveridical operators, and other semantic factors, which differ among subclasses of PPIs. Anti-licensing by weak negation correlates fairly well with anti-licensing by long-distance negation, a finding which is (partly) in line with a recent proposal by Spector (2014, Global positive polarity items and obligatory exhaustivity. Semantics and Pragmatics 7(11). 1–61) concerning global PPIs. However, we find there to be more variation among the PPIs studied here than the classification of Spector (2014) or any binary classification stipulates.


2021 ◽  
Author(s):  
Jason Meil

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.<br> <br>Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. <br> <br>The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. </p>


2019 ◽  
Vol 11 (1) ◽  
Author(s):  
Elias L. Salfati ◽  
Emily G. Spencer ◽  
Sarah E. Topol ◽  
Evan D. Muse ◽  
Manuel Rueda ◽  
...  

Abstract Background Whole-exome sequencing (WES) has become an efficient diagnostic test for patients with likely monogenic conditions such as rare idiopathic diseases or sudden unexplained death. Yet, many cases remain undiagnosed. Here, we report the added diagnostic yield achieved for 101 WES cases re-analyzed 1 to 7 years after initial analysis. Methods Of the 101 WES cases, 51 were rare idiopathic disease cases and 50 were postmortem “molecular autopsy” cases of early sudden unexplained death. Variants considered for reporting were prioritized and classified into three groups: (1) diagnostic variants, pathogenic and likely pathogenic variants in genes known to cause the phenotype of interest; (2) possibly diagnostic variants, possibly pathogenic variants in genes known to cause the phenotype of interest or pathogenic variants in genes possibly causing the phenotype of interest; and (3) variants of uncertain diagnostic significance, potentially deleterious variants in genes possibly causing the phenotype of interest. Results Initial analysis revealed diagnostic variants in 13 rare disease cases (25.4%) and 5 sudden death cases (10%). Re-analysis resulted in the identification of additional diagnostic variants in 3 rare disease cases (5.9%) and 1 sudden unexplained death case (2%), which increased our molecular diagnostic yield to 31.4% and 12%, respectively. Conclusions The basis of new findings ranged from improvement in variant classification tools, updated genetic databases, and updated clinical phenotypes. Our findings highlight the potential for re-analysis to reveal diagnostic variants in cases that remain undiagnosed after initial WES.


Stats ◽  
2020 ◽  
Vol 3 (4) ◽  
pp. 427-443
Author(s):  
Gildas Tagny-Ngompé ◽  
Stéphane Mussard ◽  
Guillaume Zambrano ◽  
Sébastien Harispe ◽  
Jacky Montmain

This paper presents and compares several text classification models that can be used to extract the outcome of a judgment from justice decisions, i.e., legal documents summarizing the different rulings made by a judge. Such models can be used to gather important statistics about cases, e.g., success rate based on specific characteristics of cases’ parties or jurisdiction, and are therefore important for the development of Judicial prediction not to mention the study of Law enforcement in general. We propose in particular the generalized Gini-PLS which better considers the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. Modeling the studied task as a supervised binary classification, we also introduce the LOGIT-Gini-PLS suited to the explanation of a binary target variable. In addition, various technical aspects regarding the evaluated text classification approaches which consists of combinations of representations of judgments and classification algorithms are studied using an annotated corpora of French justice decisions.


2020 ◽  
Vol 5 (1) ◽  
Author(s):  
Timo Lassmann ◽  
Richard W. Francis ◽  
Alexia Weeks ◽  
Dave Tang ◽  
Sarra E. Jamieson ◽  
...  

AbstractExome sequencing has enabled molecular diagnoses for rare disease patients but often with initial diagnostic rates of ~25−30%. Here we develop a robust computational pipeline to rank variants for reassessment of unsolved rare disease patients. A comprehensive web-based patient report is generated in which all deleterious variants can be filtered by gene, variant characteristics, OMIM disease and Phenolyzer scores, and all are annotated with an ACMG classification and links to ClinVar. The pipeline ranked 21/34 previously diagnosed variants as top, with 26 in total ranked ≤7th, 3 ranked ≥13th; 5 failed the pipeline filters. Pathogenic/likely pathogenic variants by ACMG criteria were identified for 22/145 unsolved cases, and a previously undefined candidate disease variant for 27/145. This open access pipeline supports the partnership between clinical and research laboratories to improve the diagnosis of unsolved exomes. It provides a flexible framework for iterative developments to further improve diagnosis.


Sign in / Sign up

Export Citation Format

Share Document