scholarly journals Prediction performance of linear models and gradient boosting machine on complex phenotypes in outbred mice

2021 ◽  
Author(s):  
Bruno C. Perez ◽  
Marco C.A.M. Bink ◽  
Gary A. Churchill ◽  
Karen L. Svenson ◽  
Mario P.L. Calus

Recent literature suggests machine learning methods can capture interactions between loci and therefore could outperform linear models when predicting traits with relevant epistatic effects. However, investigating this empirically requires data with high mapping resolution and phenotypes for traits with known non-additive gene action. The objective of the present study was to compare the performance of linear (GBLUP, BayesB and elastic net [ENET]) methods to a non-parametric tree-based ensemble (gradient boosting machine GBM) method for genomic prediction of complex traits in mice. The dataset used contained phenotypic and genotypic information for 835 animals from 6 non-overlapping generations. Traits analyzed were bone mineral density (BMD), body weight at 10, 15 and 20 weeks (BW10, BW15 and BW20), fat percentage (FAT%), circulating cholesterol (CHOL), glucose (GLUC), insulin (INS) and triglycerides (TGL), and urine creatinine (UCRT). After quality control, the genotype dataset contained 50,112 SNP markers. Animals from older generations were considered as a reference subset, while animals in the latest generation as candidates for the validation subset. We also evaluated the impact of different levels of connectedness between reference and validation sets. Model performance was measured as the Pearsons correlation coefficient and mean squared error (MSE) between adjusted phenotypes and the models prediction for animals in the validation subset. Outcomes were also compared across models by checking the overlapping top markers and animals. Linear models outperformed GBM for seven out of ten traits. For these models, accuracy was proportional to the traits heritability. For traits BMD, CHOL and GLU, the GBM model showed better prediction accuracy and lower MSE. Interestingly, for these three traits there is evidence in literature of a relevant portion of phenotypic variance being explained by epistatic effects. We noticed that for lower connectedness, i.e., imposing a gap of one to two generations between reference and validation populations, the superior performance of GBM was only maintained for GLU. Using a subset of top markers selected from a GBM model helped for some of the traits to improve accuracy of prediction when these were fitted into linear and GBM models. The GBM model showed consistently fewer markers and animals in common among the top ranked than linear models. Our results indicate that GBM is more strongly affected by data size and decreased connectedness between reference and validation sets than the linear models. Nevertheless, our results indicate that GBM is a competitive method to predict complex traits in an outbred mice population, especially for traits with assumed epistatic effects.

2020 ◽  
Author(s):  
Zhaotong Lin ◽  
Souvik Seal ◽  
Saonli Basu

AbstractSNP heritability of a trait is measured by the proportion of total variance explained by the additive effects of genome-wide single nucleotide polymorphisms (SNPs). Linear mixed models are routinely used to estimate SNP heritability for many complex traits. The basic concept behind this approach is to model genetic contribution as a random effect, where the variance of this genetic contribution attributes to the heritability of the trait. This linear mixed model approach requires estimation of ‘relatedness’ among individuals in the sample, which is usually captured by estimating a genetic relationship matrix (GRM). Heritability is estimated by the restricted maximum likelihood (REML) or method of moments (MOM) approaches, and this estimation relies heavily on the GRM computed from the genetic data on individuals. Presence of population substructure in the data could significantly impact the GRM estimation and may introduce bias in heritability estimation. The common practice of accounting for such population substructure is to adjust for the top few principal components of the GRM as covariates in the linear mixed model. Here we propose an alternative way of estimating heritability in multi-ethnic studies. Our proposed approach is a MOM estimator derived from the Haseman-Elston regression and gives an asymptotically unbiased estimate of heritability in presence of population stratification. It introduces adjustments for the population stratification in a second-order estimating equation and allows for the total phenotypic variance vary by ethnicity. We study the performance of different MOM and REML approaches in presence of population stratification through extensive simulation studies. We estimate the heritability of height, weight and other anthropometric traits in the UK Biobank cohort to investigate the impact of subtle population substructure on SNP heritability estimation.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Zia U. Ahmed ◽  
Kang Sun ◽  
Michael Shelly ◽  
Lina Mu

AbstractMachine learning (ML) has demonstrated promise in predicting mortality; however, understanding spatial variation in risk factor contributions to mortality rate requires explainability. We applied explainable artificial intelligence (XAI) on a stack-ensemble machine learning model framework to explore and visualize the spatial distribution of the contributions of known risk factors to lung and bronchus cancer (LBC) mortality rates in the conterminous United States. We used five base-learners—generalized linear model (GLM), random forest (RF), Gradient boosting machine (GBM), extreme Gradient boosting machine (XGBoost), and Deep Neural Network (DNN) for developing stack-ensemble models. Then we applied several model-agnostic approaches to interpret and visualize the stack ensemble model's output in global and local scales (at the county level). The stack ensemble generally performs better than all the base learners and three spatial regression models. A permutation-based feature importance technique ranked smoking prevalence as the most important predictor, followed by poverty and elevation. However, the impact of these risk factors on LBC mortality rates varies spatially. This is the first study to use ensemble machine learning with explainable algorithms to explore and visualize the spatial heterogeneity of the relationships between LBC mortality and risk factors in the contiguous USA.


2021 ◽  
Author(s):  
Sofyan Sbahi ◽  
Naaila Ouazzani ◽  
Abderrahmane Lahrouni ◽  
Abdessamed Hejjaj ◽  
Laila Mandi

<p>The quality of effluents from wastewater treatment plants still challenging especially in underprivileged rural areas where water resources are mostly affected by pollution, depletion and excessive exploitation. Thus, the prediction of phosphorus removal is one of the most important tasks in the management of wastewater effluent. Predictive model accuracy is crucial for safe reuse of treated water for public health and the environment. However, linear models that use a high dimensional dataset may be unable to build accurate and interpretable models. To address this complexity, the current study evaluates the effect of hydraulic retention time (HRT) on the removal of orthophosphates (PO<sub>4</sub>–P) and total phosphorus (TP) by the multi-soil-layering (MSL) eco-friendly technology. In addition, it attempts to predict this removal from domestic wastewater using a combined approach based on feature selection technique and gradient boosting machine algorithm (GBM). Sixteen physicochemical and bacterial indicators were monitored for a one-year period. The results show that the HRT impact significantly (p < 0.01) the removal of phosphorus content by the MSL system. The HRT, pH, PO<sub>4</sub>–P and TP were suggested relevant for predicting the removal of TP, while HRT and PO<sub>4</sub>–P were sufficient for predicting the removal rate of PO<sub>4</sub>–P. The analysis of accuracy using the validation dataset demonstrates that GBM models have high credibility as they achieve an R² > 0.92, while the analysis of sensitivity reveals that the HRT was the most important factor affecting phosphorus removal in the MSL system. In addition, the modeling results show that the GBM model has proven to be useful for predicting pollutant removal in the MSL technology and investigating its behavior.</p><p> </p>


2020 ◽  
Author(s):  
Mohamed Abdulkadir ◽  
Christopher Hübel ◽  
Moritz Herle ◽  
Ruth J.F. Loos ◽  
Gerome Breen ◽  
...  

AbstractBackgroundDeviating growth from the norm during childhood has been associated with anorexia nervosa (AN) and obesity later in life. In this study, we examined whether polygenic scores (PGS) for AN and obesity are associated, individually or combined, with a range of anthropometric trajectories spanning the first two decades of life.MethodsAN-PGS and obesity-PGS were calculated for participants of the Avon Longitudinal Study of Parents and Children (ALSPAC; N= 8,654 participants with genotype data and at least one outcome measure). Using generalized (mixed) linear models, we associated PGS with trajectories of weight, height, body mass index (BMI), fat mass index (FMI), lean mass index (LMI), and bone mineral density (BMD). Growth trajectories were derived using spline modeling or mixed effects modeling.ResultsBetween age 5-24 years, Females with one SD higher AN-PGS had on average a 0.01% lower BMI trajectory, and between age 10-24 years a 0.01% lower FMI trajectory and 0.05% lower weight trajectory. Higher obesity-PGS was associated with higher BMI, FMI, LMI, BMD, weight, and lower height trajectories in both sexes. The average growth trajectories of females with high AN-PGS/low obesity-PGS remained consistently lower than those with low AN-PGS/low obesity-PGS; this difference did not reach statistical significance. However, post-hoc comparisons suggest that females with high AN-PGS/low obesity-PGS did follow lower growth trajectories compared to those with high PGS for both traits.ConclusionAN-PGS and obesity-PGS have detectable sex-dependent effects on a range of anthropometry trajectories. These findings encourage further research in understanding how the AN-PGS and the obesity-PGS co-influence growth during childhood in which the obesity-PGS can mitigate the effects of the AN-PGS.


2018 ◽  
Author(s):  
Chenyong Miao ◽  
Jinliang Yang ◽  
James C. Schnable

AbstractBackgroundAssociation studies use statistical links between genetic markers and variation in a phenotype’s value across many individuals to identify genes controlling variation in the target phenotype. However, this approach, particularly conducted on a genome-wide scale (GWAS), has limited power to identify the genes responsible for variation in traits controlled by complex genetic architectures.ResultsHere we employ simulation studies utilizing real-world genotype datasets from association populations in four species with distinct minor allele frequency distributions, population structures, and patterns linkage disequilibrium to evaluate the impact of variation in both heritability and trait complexity on both conventional mixed linear model based GWAS and two new approaches specifically developed for complex traits. Mixed linear model based GWAS rapidly losses power for more complex traits. FarmCPU, a method based on multi-locus mixed linear models, provides the greatest statistical power for moderately complex traits. A Bayesian approach adopted from genomic prediction provides the greatest statistical power to identify causal genetic loci for extremely complex traits.ConclusionsUsing estimates of the complexity of the genetic architecture of target traits can guide the selection of appropriate statistical methods and improve the overall accuracy and power of GWAS.


2020 ◽  
Vol 39 (5) ◽  
pp. 6579-6590
Author(s):  
Sandy Çağlıyor ◽  
Başar Öztayşi ◽  
Selime Sezgin

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Gerhard Müller ◽  
Manuela Bombana ◽  
Monika Heinzel-Gutenbrenner ◽  
Nikolaus Kleindienst ◽  
Martin Bohus ◽  
...  

Abstract Background Mental disorders are related to high individual suffering and significant socio-economic burdens. However, it remains unclear to what extent self-reported mental distress is related to individuals’ days of incapacity to work and their medical costs. This study aims to investigate the impact of self-reported mental distress for specific and non-specific days of incapacity to work and specific and non-specific medical costs over a two-year span. Method Within a longitudinal research design, 2287 study participants’ mental distress was assessed using the Hospital Anxiety and Depression Scale (HADS). HADS scores were included as predictors in generalized linear models with a Tweedie distribution with log link function to predict participants’ days of incapacity to work and medical costs retrieved from their health insurance routine data during the following two-year period. Results Current mental distress was found to be significantly related to the number of specific days absent from work and medical costs. Compared to participants classified as no cases by the HADS (2.6 days), severe case participants showed 27.3-times as many specific days of incapacity to work in the first year (72 days) and 10.3-times as many days in the second year (44 days), and resulted in 11.4-times more medical costs in the first year (2272 EUR) and 6.2-times more in the second year (1319 EUR). The relationship of mental distress to non-specific days of incapacity to work and non-specific medical costs was also significant, but mainly driven from specific absent days and specific medical costs. Our results also indicate that the prevalence of presenteeism is considerably high: 42% of individuals continued to go to work despite severe mental distress. Conclusions Our results show that self-reported mental distress, assessed by the HADS, is highly related to the days of incapacity to work and medical costs in the two-year period. Reducing mental distress by improving preventive structures for at-risk populations and increasing access to evidence-based treatments for individuals with mental disorders might, therefore, pay for itself and could help to reduce public costs.


Sign in / Sign up

Export Citation Format

Share Document