scholarly journals Gene pathogenicity prediction of Mendelian diseases via the Random Forest algorithm

2019 ◽  
Author(s):  
Sijie He ◽  
Weiwei Chen ◽  
Hankui Liu ◽  
Shengting Li ◽  
Dongzhu Lei ◽  
...  

AbstractThe study of Mendelian diseases and the identification of their causative genes are of great significance in the field of genetics. The evaluation of the pathogenicity of genes and the total number of Mendelian disease genes are both important questions worth studying. However, very few studies have addressed these issues to date, so we attempt to answer them in this study.We calculated gene pathogenicity prediction (GPP) score by a machine learning approach (random forest algorithm) to evaluate the pathogenicity of genes. When we applied the GPP score to the testing gene set, we obtained accuracy of 80%, recall of 93% and area under the curve (AUC) of 0.87. Our results estimated that a total of 10,399 protein-coding genes were Mendelian disease genes. Furthermore, we found the GPP score was positively correlated with the severity of disease.Our results indicate that GPP score may provide a robust and reliable guideline to predict the pathogenicity of protein-coding genes. To our knowledge, this is the first trial to estimate the total number of Mendelian disease genes.

2019 ◽  
Vol 138 (6) ◽  
pp. 673-679 ◽  
Author(s):  
Sijie He ◽  
Weiwei Chen ◽  
Hankui Liu ◽  
Shengting Li ◽  
Dongzhu Lei ◽  
...  

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sofia Kapsiani ◽  
Brendan J. Howlin

AbstractAgeing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of Caenorhabditis elegans. Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. The best performing classifier, built using molecular descriptors, achieved an area under the curve score (AUC) of 0.815 for classifying the compounds in the test set. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The top 30 features included descriptors related to atom and bond counts, topological and partial charge properties. The model was applied to predict the class of compounds in an external database, consisting of 1738 small-molecules. The chemical compounds of the screening database with a predictive probability of ≥ 0.80 for increasing the lifespan of Caenorhabditis elegans were broadly separated into (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds.


PLoS ONE ◽  
2021 ◽  
Vol 16 (11) ◽  
pp. e0260195
Author(s):  
Marcelo Dantas Tavares de Melo ◽  
Jose de Arimatéia Batista Araujo-Filho ◽  
José Raimundo Barbosa ◽  
Camila Rocon ◽  
Carlos Danilo Miranda Regis ◽  
...  

Aims Noncompaction cardiomyopathy (NCC) is considered a genetic cardiomyopathy with unknown pathophysiological mechanisms. We propose to evaluate echocardiographic predictors for rigid body rotation (RBR) in NCC using a machine learning (ML) based model. Methods and results Forty-nine outpatients with NCC diagnosis by echocardiography and magnetic resonance imaging (21 men, 42.8±14.8 years) were included. A comprehensive echocardiogram was performed. The layer-specific strain was analyzed from the apical two-, three, four-chamber views, short axis, and focused right ventricle views using 2D echocardiography (2DE) software. RBR was present in 44.9% of patients, and this group presented increased LV mass indexed (118±43.4 vs. 94.1±27.1g/m2, P = 0.034), LV end-diastolic and end-systolic volumes (P< 0.001), E/e’ (12.2±8.68 vs. 7.69±3.13, P = 0.034), and decreased LV ejection fraction (40.7±8.71 vs. 58.9±8.76%, P < 0.001) when compared to patients without RBR. Also, patients with RBR presented a significant decrease of global longitudinal, radial, and circumferential strain. When ML model based on a random forest algorithm and a neural network model was applied, it found that twist, NC/C, torsion, LV ejection fraction, and diastolic dysfunction are the strongest predictors to RBR with accuracy, sensitivity, specificity, area under the curve of 0.93, 0.99, 0.80, and 0.88, respectively. Conclusion In this study, a random forest algorithm was capable of selecting the best echocardiographic predictors to RBR pattern in NCC patients, which was consistent with worse systolic, diastolic, and myocardium deformation indices. Prospective studies are warranted to evaluate the role of this tool for NCC risk stratification.


2016 ◽  
Author(s):  
Roddy Walsh ◽  
Kate Thomson ◽  
James S Ware ◽  
Birgit H Funke ◽  
Jessica Woodley ◽  
...  

The accurate interpretation of variation in Mendelian disease genes has lagged behind data generation as sequencing has become increasingly accessible. Ongoing large sequencing efforts present huge interpretive challenges, but also provide an invaluable opportunity to characterize the spectrum and importance of rare variation. Here we analyze sequence data from 7,855 clinical cardiomyopathy cases and 60,706 ExAC reference samples to better understand genetic variation in a representative autosomal dominant disorder. We show that in some genes previously reported as important causes of a given cardiomyopathy, rare variation is not clinically informative and there is a high likelihood of false positive interpretation. By contrast, in other genes, we find that diagnostic laboratories may be overly conservative when assessing variant pathogenicity. We outline improved interpretation approaches for specific genes and variant classes and propose that these will increase the clinical utility of testing across a range of Mendelian diseases.


Cancers ◽  
2019 ◽  
Vol 11 (12) ◽  
pp. 2007 ◽  
Author(s):  
Pushpanjali Gupta ◽  
Sum-Fu Chiang ◽  
Prasan Kumar Sahoo ◽  
Suvendu Kumar Mohapatra ◽  
Jeng-Fu You ◽  
...  

The prediction of tumor in the TNM staging (tumor, node, and metastasis) stage of colon cancer using the most influential histopathology parameters and to predict the five years disease-free survival (DFS) period using machine learning (ML) in clinical research have been studied here. From the colorectal cancer (CRC) registry of Chang Gung Memorial Hospital, Linkou, Taiwan, 4021 patients were selected for the analysis. Various ML algorithms were applied for the tumor stage prediction of the colon cancer by considering the Tumor Aggression Score (TAS) as a prognostic factor. Performances of different ML algorithms were evaluated using five-fold cross-validation, which is an effective way of the model validation. The accuracy achieved by the algorithms taking both cases of standard TNM staging and TNM staging with the Tumor Aggression Score was determined. It was observed that the Random Forest model achieved an F-measure of 0.89, when the Tumor Aggression Score was considered as an attribute along with the standard attributes normally used for the TNM stage prediction. We also found that the Random Forest algorithm outperformed all other algorithms, with an accuracy of approximately 84% and an area under the curve (AUC) of 0.82 ± 0.10 for predicting the five years DFS.


2020 ◽  
Author(s):  
Sofia Kapsiani ◽  
Brendan James Howlin

Abstract Ageing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of the worm species Caenorhabditis elegans. Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. Feature selection was achieved using variation and mutual information-based methods. The best performing classifier, built using molecular descriptors, achieved an area under the curve (AUC) score of 0.815 for classifying the compounds in the test set. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The top 30 most important features included descriptors related to atom and bond counts, topological and partial charge properties. The model was applied to predict the class of compounds in an external database, consisting of 1,738 small-molecules. The chemical compounds of the screening database with a predictive probability of ≥ 0.80 for increasing the lifespan of Caenorhabditis elegans were broadly separated into (i) flavonoids, (ii) fatty acids and conjugates, and (iii) organooxygen compounds.


2021 ◽  
Author(s):  
Sofia Kapsiani ◽  
Brendan J. Howlin

Abstract Ageing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of Caenorhabditis elegans. Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. The best performing classifier, built using molecular descriptors, achieved an area under the curve (AUC) score of 0.815 for classifying the compounds in the test set. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The top 30 features included descriptors related to atom and bond counts, topological and partial charge properties. The model was applied to predict the class of compounds in an external database, consisting of 1,738 small-molecules. The chemical compounds of the screening database with a predictive probability of ≥ 0.80 for increasing the lifespan of Caenorhabditis elegans were broadly separated into (i) flavonoids, (ii) fatty acids and conjugates, and (iii) organooxygen compounds.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yaozhong Liu ◽  
Na Liu ◽  
Fan Bai ◽  
Qiming Liu

Background: Atrial fibrillation (AF) is the most common arrhythmia. We aimed to construct competing endogenous RNA (ceRNA) networks associated with the susceptibility and persistence of AF by applying the weighted gene co-expression network analysis (WGCNA) and prioritize key genes using the random walk with restart on multiplex networks (RWR-M) algorithm.Methods: RNA sequencing results from 235 left atrial appendage samples were downloaded from the GEO database. The top 5,000 lncRNAs/mRNAs with the highest variance were used to construct a gene co-expression network using the WGCNA method. AF susceptibility- or persistence-associated modules were identified by correlating the module eigengene with the atrial rhythm phenotype. Using a module-specific manner, ceRNA pairs of lncRNA–mRNA were predicted. The RWR-M algorithm was applied to calculate the proximity between lncRNAs and known AF protein-coding genes. Random forest classifiers, based on the expression value of key lncRNA-associated ceRNA pairs, were constructed and validated against an independent data set.Results: From the 21 identified modules, magenta and tan modules were associated with AF susceptibility, whereas turquoise and yellow modules were associated with AF persistence. ceRNA networks in magenta and tan modules were primarily involved in the inflammatory process, whereas ceRNA networks in turquoise and yellow modules were primarily associated with electrical remodeling. A total of 106 previously identified AF-associated protein-coding genes were found in the ceRNA networks, including 16 that were previously implicated in the genome-wide association study. Myocardial infarction–associated transcript (MIAT) and LINC00964 were prioritized as key lncRNAs through RWR-M. The classifiers based on their associated ceRNA pairs were able to distinguish AF from sinus rhythm with respective AUC values of 0.810 and 0.940 in the training set and 0.870 and 0.922 in the independent test set. The AF-related single-nucleotide polymorphism rs35006907 was found in the intronic region of LINC00964 and negatively regulated the LINC00964 expression.Conclusion: Our study constructed AF susceptibility- and persistence-associated ceRNA networks, linked genetics with epigenetics, identified MIAT and LINC00964 as key lncRNAs, and constructed random forest classifiers based on their associated ceRNA pairs. These results will help us to better understand the mechanisms underlying AF from the ceRNA perspective and provide candidate therapeutic and diagnostic tools.


Sign in / Sign up

Export Citation Format

Share Document