Sampling Defective Pathways in Phenotype Prediction Problems via the Fisher’s Ratio Sampler

Author(s):  
Ana Cernea ◽  
Juan Luis Fernández-Martínez ◽  
Enrique J. deAndrés-Galiana ◽  
Francisco Javier Fernández-Ovies ◽  
Zulima Fernández-Muñiz ◽  
...  
Author(s):  
Juan Luis Fernández-Martínez ◽  
Enrique J. deAndrés-Galiana ◽  
Enrique J. deAndrés-Galiana ◽  
Ana Cernea ◽  
Francisco Javier Fernández-Ovies ◽  
...  

Discrimination of case-control status based on gene expression differences has potential to identify novel pathways relevant to neurodegenerative diseases including Parkinson’s disease (PD). In this paper we applied two different novel algorithms to predict dysregulated pathways of gene expression across several different regions of the brain in PD and controls. The Fisher’s ratio sampler uses the Fisher’s ratio of the most discriminatory genes as prior probability distribution to sample the genetic networks and their likelihood (accuracy) was established via Leave-One-Out-Cross Validation (LOOCV). The holdout sampler finds the minimum-scale signatures corresponding to different random holdouts, establishing their likelihood using the validation dataset in each holdout. Phenotype prediction problems have by genesis a very high underdetermined character. We used both approaches to sample different lists of genes that optimally discriminate PD from controls and subsequently used gene ontology to identify pathways affected by disease. Both algorithms identified common pathways of Insulin signaling, FOXA1 Transcription Factor Network, HIF-1 Signaling, p53 Signaling and Chromatin Regulation/Acetylation. This analysis provides new therapeutic targets to treat PD.


2019 ◽  
Vol 109 (2) ◽  
pp. 251-277 ◽  
Author(s):  
Nastasiya F. Grinberg ◽  
Oghenejokpeme I. Orhobor ◽  
Ross D. King

Abstract In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.


Author(s):  
Juan Luis Fernández-Martínez ◽  
Ana Cernea ◽  
Enrique J. deAndrés-Galiana ◽  
Francisco Javier Fernández-Ovies ◽  
Zulima Fernández-Muñiz ◽  
...  

2016 ◽  
Vol 23 (8) ◽  
pp. 678-692 ◽  
Author(s):  
Enrique J. deAndrés-Galiana ◽  
Juan Luis Fernández-Martínez ◽  
Stephen T. Sonis

2017 ◽  
Author(s):  
Nastasiya F. Grinberg ◽  
Oghenejokpeme I. Orhobor ◽  
Ross D. King

AbstractIn phenotype prediction, the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods (elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM)), with two state-of-the-art classical statistical genetics methods (including genomic BLUP). Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all phenotypes considered standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. When applied to the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure, which suggests one way to improve standard machine learning methods when population structure is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise.


1970 ◽  
Vol 1 (3) ◽  
pp. 181-205 ◽  
Author(s):  
ERIK ERIKSSON

The term “stochastic hydrology” implies a statistical approach to hydrologic problems as opposed to classic hydrology which can be considered deterministic in its approach. During the International Hydrology Symposium, held 6-8 September 1967 at Fort Collins, a number of hydrology papers were presented consisting to a large extent of studies on long records of hydrological elements such as river run-off, these being treated as time series in the statistical sense. This approach is, no doubt, of importance for future work especially in relation to prediction problems, and there seems to be no fundamental difficulty for introducing the stochastic concepts into various hydrologic models. There is, however, some developmental work required – not to speak of educational in respect to hydrologists – before the full benefit of the technique is obtained. The present paper is to some extent an exercise in the statistical study of hydrological time series – far from complete – and to some extent an effort to interpret certain features of such time series from a physical point of view. The material used is 30 years of groundwater level observations in an esker south of Uppsala, the observations being discussed recently by Hallgren & Sands-borg (1968).


2021 ◽  
Vol 20 (1) ◽  
Author(s):  
Jingru Zhou ◽  
Yingping Zhuang ◽  
Jianye Xia

Abstract Background Genome-scale metabolic model (GSMM) is a powerful tool for the study of cellular metabolic characteristics. With the development of multi-omics measurement techniques in recent years, new methods that integrating multi-omics data into the GSMM show promising effects on the predicted results. It does not only improve the accuracy of phenotype prediction but also enhances the reliability of the model for simulating complex biochemical phenomena, which can promote theoretical breakthroughs for specific gene target identification or better understanding the cell metabolism on the system level. Results Based on the basic GSMM model iHL1210 of Aspergillus niger, we integrated large-scale enzyme kinetics and proteomics data to establish a GSMM based on enzyme constraints, termed a GEM with Enzymatic Constraints using Kinetic and Omics data (GECKO). The results show that enzyme constraints effectively improve the model’s phenotype prediction ability, and extended the model’s potential to guide target gene identification through predicting metabolic phenotype changes of A. niger by simulating gene knockout. In addition, enzyme constraints significantly reduced the solution space of the model, i.e., flux variability over 40.10% metabolic reactions were significantly reduced. The new model showed also versatility in other aspects, like estimating large-scale $$k_{{cat}}$$ k cat values, predicting the differential expression of enzymes under different growth conditions. Conclusions This study shows that incorporating enzymes’ abundance information into GSMM is very effective for improving model performance with A. niger. Enzyme-constrained model can be used as a powerful tool for predicting the metabolic phenotype of A. niger by incorporating proteome data. In the foreseeable future, with the fast development of measurement techniques, and more precise and rich proteomics quantitative data being obtained for A. niger, the enzyme-constrained GSMM model will show greater application space on the system level.


Author(s):  
Andrew Jacobsen ◽  
Matthew Schlegel ◽  
Cameron Linke ◽  
Thomas Degris ◽  
Adam White ◽  
...  

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update—a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the stepsize parameters to minimize prediction error. These metadescent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental metadescent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.


Sign in / Sign up

Export Citation Format

Share Document