scholarly journals CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Philipp Rentzsch ◽  
Max Schubach ◽  
Jay Shendure ◽  
Martin Kircher

Abstract Background Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. Methods It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. Results We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. Conclusions While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


Cancers ◽  
2021 ◽  
Vol 13 (7) ◽  
pp. 1495
Author(s):  
Tú Nguyen-Dumont ◽  
James G. Dowty ◽  
Robert J. MacInnis ◽  
Jason A. Steen ◽  
Moeen Riaz ◽  
...  

While gene panel sequencing is becoming widely used for cancer risk prediction, its clinical utility with respect to predicting aggressive prostate cancer (PrCa) is limited by our current understanding of the genetic risk factors associated with predisposition to this potentially lethal disease phenotype. This study included 837 men diagnosed with aggressive PrCa and 7261 controls (unaffected men and men who did not meet criteria for aggressive PrCa). Rare germline pathogenic variants (including likely pathogenic variants) were identified by targeted sequencing of 26 known or putative cancer predisposition genes. We found that 85 (10%) men with aggressive PrCa and 265 (4%) controls carried a pathogenic variant (p < 0.0001). Aggressive PrCa odds ratios (ORs) were estimated using unconditional logistic regression. Increased risk of aggressive PrCa (OR (95% confidence interval)) was identified for pathogenic variants in BRCA2 (5.8 (2.7–12.4)), BRCA1 (5.5 (1.8–16.6)), and ATM (3.8 (1.6–9.1)). Our study provides further evidence that rare germline pathogenic variants in these genes are associated with increased risk of this aggressive, clinically relevant subset of PrCa. These rare genetic variants could be incorporated into risk prediction models to improve their precision to identify men at highest risk of aggressive prostate cancer and be used to identify men with newly diagnosed prostate cancer who require urgent treatment.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Adriano dos Santos ◽  
Erina Vitório Rodrigues ◽  
Bruno Galvêas Laviola ◽  
Larissa Pereira Ribeiro Teodoro ◽  
Paulo Eduardo Teodoro ◽  
...  

AbstractGenome-wide selection (GWS) has been becoming an essential tool in the genetic breeding of long-life species, as it increases the gain per time unit. This study had a hypothesis that GWS is a tool that can decrease the breeding cycle in Jatropha. Our objective was to compare GWS with phenotypic selection in terms of accuracy and efficiency over three harvests. Models were developed throughout the harvests to evaluate their applicability in predicting genetic values in later harvests. For this purpose, 386 individuals of the breeding population obtained from crossings between 42 parents were evaluated. The population was evaluated in random block design, with six replicates over three harvests. The genetic effects of markers were predicted in the population using 811 SNP's markers with call rate = 95% and minor allele frequency (MAF) > 4%. GWS enables gains of 108 to 346% over the phenotypic selection, with a 50% reduction in the selection cycle. This technique has potential for the Jatropha breeding since it allows the accurate obtaining of GEBV and higher efficiency compared to the phenotypic selection by reducing the time necessary to complete the selection cycle. In order to apply GWS in the first harvests, a large number of individuals in the breeding population are needed. In the case of few individuals in the population, it is recommended to perform a larger number of harvests.


2020 ◽  
Vol 70 (5) ◽  
pp. 1211-1230
Author(s):  
Abdus Saboor ◽  
Hassan S. Bakouch ◽  
Fernando A. Moala ◽  
Sheraz Hussain

AbstractIn this paper, a bivariate extension of exponentiated Fréchet distribution is introduced, namely a bivariate exponentiated Fréchet (BvEF) distribution whose marginals are univariate exponentiated Fréchet distribution. Several properties of the proposed distribution are discussed, such as the joint survival function, joint probability density function, marginal probability density function, conditional probability density function, moments, marginal and bivariate moment generating functions. Moreover, the proposed distribution is obtained by the Marshall-Olkin survival copula. Estimation of the parameters is investigated by the maximum likelihood with the observed information matrix. In addition to the maximum likelihood estimation method, we consider the Bayesian inference and least square estimation and compare these three methodologies for the BvEF. A simulation study is carried out to compare the performance of the estimators by the presented estimation methods. The proposed bivariate distribution with other related bivariate distributions are fitted to a real-life paired data set. It is shown that, the BvEF distribution has a superior performance among the compared distributions using several tests of goodness–of–fit.


2020 ◽  
Vol 10 (24) ◽  
pp. 9151
Author(s):  
Yun-Chia Liang ◽  
Yona Maimury ◽  
Angela Hsiang-Ling Chen ◽  
Josue Rodolfo Cuevas Juarez

Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.


2021 ◽  
Author(s):  
Asieh Amousoltani Arani ◽  
Mohammadreza Sehhati ◽  
Mohammad Amin Tabatabaiefar

A new feature space, which can discriminate deleterious variants, was constructed by the integration of various input data using the proposed supervised nonnegative matrix tri-factorization (sNMTF) algorithm.


Heredity ◽  
2021 ◽  
Author(s):  
Iván Galván-Femenía ◽  
Carles Barceló-Vidal ◽  
Lauro Sumoy ◽  
Victor Moreno ◽  
Rafael de Cid ◽  
...  

AbstractThe detection of family relationships in genetic databases is of interest in various scientific disciplines such as genetic epidemiology, population and conservation genetics, forensic science, and genealogical research. Nowadays, screening genetic databases for related individuals forms an important aspect of standard quality control procedures. Relatedness research is usually based on an allele sharing analysis of identity by state (IBS) or identity by descent (IBD) alleles. Existing IBS/IBD methods mainly aim to identify first-degree relationships (parent–offspring or full siblings) and second degree (half-siblings, avuncular, or grandparent–grandchild) pairs. Little attention has been paid to the detection of in-between first and second-degree relationships such as three-quarter siblings (3/4S) who share fewer alleles than first-degree relationships but more alleles than second-degree relationships. With the progressively increasing sample sizes used in genetic research, it becomes more likely that such relationships are present in the database under study. In this paper, we extend existing likelihood ratio (LR) methodology to accurately infer the existence of 3/4S, distinguishing them from full siblings and second-degree relatives. We use bootstrap confidence intervals to express uncertainty in the LRs. Our proposal accounts for linkage disequilibrium (LD) by using marker pruning, and we validate our methodology with a pedigree-based simulation study accounting for both LD and recombination. An empirical genome-wide array data set from the GCAT Genomes for Life cohort project is used to illustrate the method.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Daniel J. Panyard ◽  
Kyeong Mo Kim ◽  
Burcu F. Darst ◽  
Yuetiva K. Deming ◽  
Xiaoyuan Zhong ◽  
...  

AbstractThe study of metabolomics and disease has enabled the discovery of new risk factors, diagnostic markers, and drug targets. For neurological and psychiatric phenotypes, the cerebrospinal fluid (CSF) is of particular importance. However, the CSF metabolome is difficult to study on a large scale due to the relative complexity of the procedure needed to collect the fluid. Here, we present a metabolome-wide association study (MWAS), which uses genetic and metabolomic data to impute metabolites into large samples with genome-wide association summary statistics. We conduct a metabolome-wide, genome-wide association analysis with 338 CSF metabolites, identifying 16 genotype-metabolite associations (metabolite quantitative trait loci, or mQTLs). We then build prediction models for all available CSF metabolites and test for associations with 27 neurological and psychiatric phenotypes, identifying 19 significant CSF metabolite-phenotype associations. Our results demonstrate the feasibility of MWAS to study omic data in scarce sample types.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Olusola Olawoye ◽  
Chimdi Chuka-Okosa ◽  
Onoja Akpa ◽  
Tony Realini ◽  
Michael Hauser ◽  
...  

Abstract Background This report describes the design and methodology of the “Eyes of Africa: The Genetics of Blindness,” a collaborative study funded through the Human Heredity and Health in Africa (H3Africa) program of the National Institute of Health. Methods This is a case control study that is collecting a large well phenotyped data set among glaucoma patients and controls for a genome wide association study. (GWAS). Multiplex families segregating Mendelian forms of early-onset glaucoma will also be collected for exome sequencing. Discussion A total of 4500 cases/controls have been recruited into the study at the end of the 3rd funded year of the study. All these participants have been appropriately phenotyped and blood samples have been received from these participants. Recent GWAS of POAG in African individuals demonstrated genome-wide significant association with the APBB2 locus which is an association that is unique to individuals of African ancestry. This study will add to the existing knowledge and understanding of POAG in the African population.


2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


Sign in / Sign up

Export Citation Format

Share Document