scholarly journals A maximum flow-based network approach for identification of stable noncoding biomarkers associated with the multigenic neurological condition, autism

2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Maya Varma ◽  
Kelley M. Paskov ◽  
Brianna S. Chrisman ◽  
Min Woo Sun ◽  
Jae-Yoon Jung ◽  
...  

Abstract Background Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders. Results We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier. Conclusion Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Margot Gunning ◽  
Paul Pavlidis

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.


10.2196/24246 ◽  
2021 ◽  
Vol 23 (2) ◽  
pp. e24246 ◽  
Author(s):  
Siavash Bolourani ◽  
Max Brenner ◽  
Ping Wang ◽  
Thomas McGinn ◽  
Jamie S Hirsch ◽  
...  

Background Predicting early respiratory failure due to COVID-19 can help triage patients to higher levels of care, allocate scarce resources, and reduce morbidity and mortality by appropriately monitoring and treating the patients at greatest risk for deterioration. Given the complexity of COVID-19, machine learning approaches may support clinical decision making for patients with this disease. Objective Our objective is to derive a machine learning model that predicts respiratory failure within 48 hours of admission based on data from the emergency department. Methods Data were collected from patients with COVID-19 who were admitted to Northwell Health acute care hospitals and were discharged, died, or spent a minimum of 48 hours in the hospital between March 1 and May 11, 2020. Of 11,525 patients, 933 (8.1%) were placed on invasive mechanical ventilation within 48 hours of admission. Variables used by the models included clinical and laboratory data commonly collected in the emergency department. We trained and validated three predictive models (two based on XGBoost and one that used logistic regression) using cross-hospital validation. We compared model performance among all three models as well as an established early warning score (Modified Early Warning Score) using receiver operating characteristic curves, precision-recall curves, and other metrics. Results The XGBoost model had the highest mean accuracy (0.919; area under the curve=0.77), outperforming the other two models as well as the Modified Early Warning Score. Important predictor variables included the type of oxygen delivery used in the emergency department, patient age, Emergency Severity Index level, respiratory rate, serum lactate, and demographic characteristics. Conclusions The XGBoost model had high predictive accuracy, outperforming other early warning scores. The clinical plausibility and predictive ability of XGBoost suggest that the model could be used to predict 48-hour respiratory failure in admitted patients with COVID-19.


Author(s):  
Brian Carnahan ◽  
Gérard Meyer ◽  
Lois-Ann Kuntz

Multivariate classification models play an increasingly important role in human factors research. In the past, these models have been based primarily on discriminant analysis and logistic regression. Models developed from machine learning research offer the human factors professional a viable alternative to these traditional statistical classification methods. To illustrate this point, two machine learning approaches - genetic programming and decision tree induction - were used to construct classification models designed to predict whether or not a student truck driver would pass his or her commercial driver license (CDL) examination. The models were developed and validated using the curriculum scores and CDL exam performances of 37 student truck drivers who had completed a 320-hr driver training course. Results indicated that the machine learning classification models were superior to discriminant analysis and logistic regression in terms of predictive accuracy. Actual or potential applications of this research include the creation of models that more accurately predict human performance outcomes.


Author(s):  
Subhendu Kumar Pani ◽  
Bikram Kesari Ratha ◽  
Ajay Kumar Mishra

Microarray technology of DNA permits simultaneous monitoring and determining of thousands of gene expression activation levels in a single experiment. Data mining technique such as classification is extensively used on microarray data for medical diagnosis and gene analysis. However, high dimensionality of the data affects the performance of classification and prediction. Consequently, a key issue in microarray data is feature selection and dimensionality reduction in order to achieve better classification and predictive accuracy. There are several machine learning approaches available for feature selection. In this study, the authors use Particle Swarm Organization (PSO) and Genetic Algorithm (GA) to find the performance of several popular classifiers on a set of microarray datasets. Experimental results conclude that feature selection affects the performance.


Genes ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 137
Author(s):  
Supatcha Lertampaiporn ◽  
Tayvich Vorapreeda ◽  
Apiradee Hongsthong ◽  
Chinae Thammarongtham

Antimicrobial peptides (AMPs) are natural peptides possessing antimicrobial activities. These peptides are important components of the innate immune system. They are found in various organisms. AMP screening and identification by experimental techniques are laborious and time-consuming tasks. Alternatively, computational methods based on machine learning have been developed to screen potential AMP candidates prior to experimental verification. Although various AMP prediction programs are available, there is still a need for improvement to reduce false positives (FPs) and to increase the predictive accuracy. In this work, several well-known single and ensemble machine learning approaches have been explored and evaluated based on balanced training datasets and two large testing datasets. We have demonstrated that the developed program with various predictive models has high performance in differentiating between AMPs and non-AMPs. Thus, we describe the development of a program for the prediction and recognition of AMPs using MaxProbVote, which is an ensemble model. Moreover, to increase prediction efficiency, the ensemble model was integrated with a new hybrid feature based on logistic regression. The ensemble model integrated with the hybrid feature can effectively increase the prediction sensitivity of the developed program called Ensemble-AMPPred, resulting in overall improvements in terms of both sensitivity and specificity compared to those of currently available programs.


Author(s):  
Nurul Amirah Mashudi ◽  
Norulhusna Ahmad ◽  
Norliza Mohd Noor

Autism spectrum disorder (ASD) is a neurological-related disorder. Patients with ASD have poor social interaction and lack of communication that lead to restricted activities. Thus, early diagnosis with a reliable system is crucial as the symptoms may affect the patient’s entire lifetime. Machine learning approaches are an effective and efficient method for the prediction of ASD disease. The study mainly aims to achieve the accuracy of ASD classification using a variety of machine learning approaches. The dataset comprises 16 selected attributes that are inclusive of 703 patients and non-patients. The experiments are performed within the simulation environment and analyzed using the Waikato environment for knowledge analysis (WEKA) platform. Linear support vector machine (SVM), k-nearest neighbours (k-NN), J48, Bagging, Stacking, AdaBoost, and naïve bayes are the methods used to compute the prediction of ASD status on the subject using 3, 5, and 10-folds cross validation. The analysis is then computed to evaluate the accuracy, sensitivity, and specificity of the proposed methods. The comparative result between the machine learning approaches has shown that linear SVM, J48, Bagging, Stacking, and naïve bayes produce the highest accuracy at 100% with the lowest error rate.


2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Ralph K. Akyea ◽  
Nadeem Qureshi ◽  
Joe Kai ◽  
Stephen F. Weng

Abstract Familial hypercholesterolaemia (FH) is a common inherited disorder, causing lifelong elevated low-density lipoprotein cholesterol (LDL-C). Most individuals with FH remain undiagnosed, precluding opportunities to prevent premature heart disease and death. Some machine-learning approaches improve detection of FH in electronic health records, though clinical impact is under-explored. We assessed performance of an array of machine-learning approaches for enhancing detection of FH, and their clinical utility, within a large primary care population. A retrospective cohort study was done using routine primary care clinical records of 4,027,775 individuals from the United Kingdom with total cholesterol measured from 1 January 1999 to 25 June 2019. Predictive accuracy of five common machine-learning algorithms (logistic regression, random forest, gradient boosting machines, neural networks and ensemble learning) were assessed for detecting FH. Predictive accuracy was assessed by area under the receiver operating curves (AUC) and expected vs observed calibration slope; with clinical utility assessed by expected case-review workload and likelihood ratios. There were 7928 incident diagnoses of FH. In addition to known clinical features of FH (raised total cholesterol or LDL-C and family history of premature coronary heart disease), machine-learning (ML) algorithms identified features such as raised triglycerides which reduced the likelihood of FH. Apart from logistic regression (AUC, 0.81), all four other ML approaches had similarly high predictive accuracy (AUC > 0.89). Calibration slope ranged from 0.997 for gradient boosting machines to 1.857 for logistic regression. Among those screened, high probability cases requiring clinical review varied from 0.73% using ensemble learning to 10.16% using deep learning, but with positive predictive values of 15.5% and 2.8% respectively. Ensemble learning exhibited a dominant positive likelihood ratio (45.5) compared to all other ML models (7.0–14.4). Machine-learning models show similar high accuracy in detecting FH, offering opportunities to increase diagnosis. However, the clinical case-finding workload required for yield of cases will differ substantially between models.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Alexandre Hild Aono ◽  
Estela Araujo Costa ◽  
Hugo Vianna Silva Rody ◽  
James Shiniti Nagai ◽  
Ricardo José Gonzaga Pimenta ◽  
...  

AbstractSugarcane is an economically important crop, but its genomic complexity has hindered advances in molecular approaches for genetic breeding. New cultivars are released based on the identification of interesting traits, and for sugarcane, brown rust resistance is a desirable characteristic due to the large economic impact of the disease. Although marker-assisted selection for rust resistance has been successful, the genes involved are still unknown, and the associated regions vary among cultivars, thus restricting methodological generalization. We used genotyping by sequencing of full-sib progeny to relate genomic regions with brown rust phenotypes. We established a pipeline to identify reliable SNPs in complex polyploid data, which were used for phenotypic prediction via machine learning. We identified 14,540 SNPs, which led to a mean prediction accuracy of 50% when using different models. We also tested feature selection algorithms to increase predictive accuracy, resulting in a reduced dataset with more explanatory power for rust phenotypes. As a result of this approach, we achieved an accuracy of up to 95% with a dataset of 131 SNPs related to brown rust QTL regions and auxiliary genes. Therefore, our novel strategy has the potential to assist studies of the genomic organization of brown rust resistance in sugarcane.


2020 ◽  
Vol 41 (Supplement_2) ◽  
Author(s):  
J Tung ◽  
A.J Rogers ◽  
N Ravi ◽  
N.K Bhatia ◽  
R.L Shah ◽  
...  

Abstract Background Detection of myocardial infarction (MI) traditionally requires ECG Q waves, which have poor sensitivity, or imaging, which is time consuming. We hypothesized that machine learning (ML) of the ECG could identify prior MI, but its accuracy may depend highly upon the architecture and parameters chosen. Purpose To compare ML architectures that predict prior MI from the ECG. Methods We curated ECGs in 608 patients seen in cardiology clinics at 2 centers. We transformed 12-lead ECGs to median beats in Frank (X, Y, Z) planes (fig. A). We tested 3 architectures: a 1D deep neural network (DNN), a 3D neural network, and a support vector machine (SVM). The 1D DNN used only temporal convolutions (fig B) while the 3D DNN uses a spatial convolution (fig C) prior to the fully-connected layer (fig. C). Predictive accuracy for history of MI was compared for all architectures (fig. D). Results Patients (61.4±14.5 years, 31.2% female) had a 28.7% (175/608) prevalence of prior MI. Optimized SVM of 6 features provided accuracy of 66.1% for identifying prior MI, similar to ECG Q wave analysis. 1D DDN had accuracy of 63.6% with an area under curve (AUC) of 0.625. 3D DNN outperformed 1D DNN and SVM, providing an accuracy of 71±5% (using k=5-fold cross validation), with an AUC of 0.730. Conclusion ECG machine learning can identify prior MI better than Q wave analysis, but is sensitive to technical parameters and specific computational architecture. It is important to develop a framework to enable robust comparisons of different ML studies and future refinements. Funding Acknowledgement Type of funding source: Public grant(s) – National budget only. Main funding source(s): National Institutes of Health - United States


2020 ◽  
Vol 109 (11) ◽  
pp. 2195-2212
Author(s):  
Oghenejokpeme I. Orhobor ◽  
Nickolai N. Alexandrov ◽  
Ross D. King

AbstractThe features in some machine learning datasets can naturally be divided into groups. This is the case with genomic data, where features can be grouped by chromosome. In many applications it is common for these groupings to be ignored, as interactions may exist between features belonging to different groups. However, including a group that does not influence a response introduces noise when fitting a model, leading to suboptimal predictive accuracy. Here we present two general frameworks for the generation and combination of meta-features when feature groupings are present. Furthermore, we make comparisons to multi-target learning, given that one is typically interested in predicting multiple phenotypes. We evaluated the frameworks and multi-target learning approaches on a genomic rice dataset where the regression task is to predict plant phenotype. Our results demonstrate that there are use cases for both the meta and multi-target approaches, given that overall, they significantly outperform the base case.


Sign in / Sign up

Export Citation Format

Share Document