scholarly journals Elements for Building Supervised Statistical Machine Learning Models

Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThis chapter gives details of the linear multiple regression model including assumptions and some pros and cons, the maximum likelihood. Gradient descendent methods are described for learning the parameters under this model. Penalized linear multiple regression is derived under Ridge and Lasso penalties, which also emphasizes the estimation of the regularization parameter of importance for its successful implementation. Examples are given for both penalties (Ridge and Lasso) and but not for penalized regression multiple regression framework for illustrating the circumstances when the penalized versions should be preferred. Finally, the fundamentals of penalized and non-penalized logistic regression are provided under a gradient descendent framework. We give examples of logistic regression. Each example comes with the corresponding R codes to facilitate their quick understanding and use.

PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0256592
Author(s):  
Mark N. Warden ◽  
Susan Searles Nielsen ◽  
Alejandra Camacho-Soto ◽  
Roman Garnett ◽  
Brad A. Racette

Identifying people with Parkinson disease during the prodromal period, including via algorithms in administrative claims data, is an important research and clinical priority. We sought to improve upon an existing penalized logistic regression model, based on diagnosis and procedure codes, by adding prescription medication data or using machine learning. Using Medicare Part D beneficiaries age 66–90 from a population-based case-control study of incident Parkinson disease, we fit a penalized logistic regression both with and without Part D data. We also built a predictive algorithm using a random forest classifier for comparison. In a combined approach, we introduced the probability of Parkinson disease from the random forest, as a predictor in the penalized regression model. We calculated the receiver operator characteristic area under the curve (AUC) for each model. All models performed well, with AUCs ranging from 0.824 (simplest model) to 0.835 (combined approach). We conclude that medication data and random forests improve Parkinson disease prediction, but are not essential.


2018 ◽  
Author(s):  
Florian Privé ◽  
Hugues Aschard ◽  
Michael G.B. Blum

AbstractPolygenic Risk Scores (PRS) consist in combining the information across many single-nucleotide polymorphisms (SNPs) in a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T.In this paper, we present an efficient method to jointly estimate SNP effects, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. The choice of hyper-parameters for a predictive model is very important since it can dramatically impact its predictive performance. As an example, AUC values range from less than 60% to 90% in a model with 30 causal SNPs, depending on the p-value threshold in C+T.We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. PLR consistently achieves higher predictive performance than the two other methods while being as fast as C+T. We find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC of 89% and of 82.5%.In conclusion, our study demonstrates that penalized logistic regression can achieve more discriminative polygenic risk scores, while being applicable to large-scale individual-level data thanks to the implementation we provide in the R package bigstatsr.


2020 ◽  
Vol 8 (3) ◽  
pp. 214-219
Author(s):  
Patrick Bezerra Fernandes ◽  
Rodrigo Amorim Barbosa ◽  
Maria Da Graça Morais ◽  
Cauby De Medeiros-Neto ◽  
Antonio Leandro Chaves Gurgel ◽  
...  

The aim of this study was to verify the precision and accuracy of 5 models for leaf area prediction using length and width of leaf blades of Megathyrsus maximus cv. BRS Zuri and to reparametrize models. Data for the predictor variables, length (L) and width (W) of leaf blades of BRS Zuri grass tillers, were collected in May 2018 in the experimental area of Embrapa Gado de Corte, Mato Grosso do Sul, Brazil. The predictor variables had high correlation values (P<0.001). In the analysis of adequacy of the models, the first-degree models that use leaf blade length (Model A), leaf width × leaf length (Model B) and linear multiple regression (Model C) promoted estimated values similar to the leaf area values observed (P>0.05), with high values for determination coefficient (>80%) and correlation concordance coefficient (>90%). Among the 5 models evaluated, the linear multiple regression (Model C: β0 = -5.97, β1 = 0.489, β2 = 1.11 and β3 = 0.351; R² = 89.64; P<0.001) and as predictor variables, width, length and length × width of the leaf blade, are the most adequate to generate precise and exact estimates of the leaf area of BRS Zuri grass.


1979 ◽  
Vol 49 (2) ◽  
pp. 583-590 ◽  
Author(s):  
Lars Nystedt ◽  
Kevin R. Murphy

The accuracy of multiple regression models, models employing subjective weights and models employing relative subjective weights in reproducing judgments was studied. Multiple regression models were most accurate. When subjects were divided into two groups according to the degree of configurality shown in their matrix of subjective weights, striking differences were found in the degree of overlap of the multiple regression models and the models employing subjective weights. In particular, when subjective policies were essentially linear, the predicted judgments produced by these policies were highly correlated with the predicted judgments of the multiple regression models. When subjective policies were highly configural, the subjective models accounted for variance in judgments not accounted for by the linear multiple regression model.


Author(s):  
Himanshu Rajput

Smartphone-based messaging applications have shown phenomenal growth with the proliferation of the internet coupled with the high penetration of smartphones into masses. The current study is an attempt to understand the relationship between the individuals personality and their use of WhatsApp, a popular smartphone-based messaging application in Indian context. For personality assessment the study takes Big Five Inventory. A questionnaire consisting items on individual WhatsApp use and Big Five Inventory was administered to students in an Indian University. Multiple regression and logistic regression revealed significant relationships between personality and WhatsApp usage and use of its different inbuilt functions.


Author(s):  
Keisuke Kokubun ◽  
Yoshinori Yamakawa

The coronavirus disease (COVID-19) continues to spread globally. While social distancing has attracted attention as a measure to prevent the spread of infection, some occupations find it difficult to implement. Therefore, this study aims to investigate the relationship between work characteristics and social distancing using data available on O*NET, an occupational information site. A total of eight factors were extracted by performing an exploratory factor analysis: work conditions, supervisory work, information processing, response to aggression, specialization, autonomy, interaction outside the organization, and interdependence. A multiple regression analysis showed that interdependence, response to aggression, and interaction outside the organization, which are categorized as ”social characteristics,” and information processing and specialization, which are categorized as “knowledge characteristics,” were associated with physical proximity. Furthermore, we added customer, which represents contact with the customer, and remote working, which represents a small amount of outdoor activity, to our multiple regression model, and confirmed that they increased the explanatory power of the model. This suggests that those who work under interdependence, face aggression, and engage in outside activities, and/or have frequent contact with customers, little interaction outside the organization, and little information processing will have the most difficulty in maintaining social distancing.


Sign in / Sign up

Export Citation Format

Share Document