Elements for Building Supervised Statistical Machine Learning Models

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_3 ◽

2022 ◽

pp. 71-108

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Logistic Regression ◽

Multiple Regression ◽

Regularization Parameter ◽

Penalized Regression ◽

Multiple Regression Model ◽

Successful Implementation ◽

Statistical Machine Learning ◽

Linear Multiple Regression ◽

Pros And Cons ◽

Penalized Logistic Regression

AbstractThis chapter gives details of the linear multiple regression model including assumptions and some pros and cons, the maximum likelihood. Gradient descendent methods are described for learning the parameters under this model. Penalized linear multiple regression is derived under Ridge and Lasso penalties, which also emphasizes the estimation of the regularization parameter of importance for its successful implementation. Examples are given for both penalties (Ridge and Lasso) and but not for penalized regression multiple regression framework for illustrating the circumstances when the penalized versions should be preferred. Finally, the fundamentals of penalized and non-penalized logistic regression are provided under a gradient descendent framework. We give examples of logistic regression. Each example comes with the corresponding R codes to facilitate their quick understanding and use.

Download Full-text

A comparison of prediction approaches for identifying prodromal Parkinson disease

PLoS ONE ◽

10.1371/journal.pone.0256592 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0256592

Author(s):

Mark N. Warden ◽

Susan Searles Nielsen ◽

Alejandra Camacho-Soto ◽

Roman Garnett ◽

Brad A. Racette

Keyword(s):

Logistic Regression ◽

Random Forest ◽

Parkinson Disease ◽

Regression Model ◽

Area Under The Curve ◽

Penalized Regression ◽

Administrative Claims ◽

Combined Approach ◽

Part D ◽

Penalized Logistic Regression

Identifying people with Parkinson disease during the prodromal period, including via algorithms in administrative claims data, is an important research and clinical priority. We sought to improve upon an existing penalized logistic regression model, based on diagnosis and procedure codes, by adding prescription medication data or using machine learning. Using Medicare Part D beneficiaries age 66–90 from a population-based case-control study of incident Parkinson disease, we fit a penalized logistic regression both with and without Part D data. We also built a predictive algorithm using a random forest classifier for comparison. In a combined approach, we introduced the probability of Parkinson disease from the random forest, as a predictor in the penalized regression model. We calculated the receiver operator characteristic area under the curve (AUC) for each model. All models performed well, with AUCs ranging from 0.824 (simplest model) to 0.835 (combined approach). We conclude that medication data and random forests improve Parkinson disease prediction, but are not essential.

Download Full-text

Efficient implementation of penalized regression for genetic risk prediction

10.1101/403337 ◽

2018 ◽

Cited By ~ 1

Author(s):

Florian Privé ◽

Hugues Aschard ◽

Michael G.B. Blum

Keyword(s):

Logistic Regression ◽

Genetic Risk ◽

Association Studies ◽

Predictive Performance ◽

Penalized Regression ◽

Risk Scores ◽

P Value ◽

Genome Wide Association Studies ◽

Polygenic Risk ◽

Penalized Logistic Regression

AbstractPolygenic Risk Scores (PRS) consist in combining the information across many single-nucleotide polymorphisms (SNPs) in a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T.In this paper, we present an efficient method to jointly estimate SNP effects, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. The choice of hyper-parameters for a predictive model is very important since it can dramatically impact its predictive performance. As an example, AUC values range from less than 60% to 90% in a model with 30 causal SNPs, depending on the p-value threshold in C+T.We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. PLR consistently achieves higher predictive performance than the two other methods while being as fast as C+T. We find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC of 89% and of 82.5%.In conclusion, our study demonstrates that penalized logistic regression can achieve more discriminative polygenic risk scores, while being applicable to large-scale individual-level data thanks to the implementation we provide in the R package bigstatsr.

Download Full-text

Evaluation and reparametrization of mathematical models for prediction of the leaf area of Megathyrsus maximus cv. BRS Zuri

Tropical Grasslands - Forrajes Tropicales ◽

10.17138/tgft(8)214-219 ◽

2020 ◽

Vol 8 (3) ◽

pp. 214-219

Author(s):

Patrick Bezerra Fernandes ◽

Rodrigo Amorim Barbosa ◽

Maria Da Graça Morais ◽

Cauby De Medeiros-Neto ◽

Antonio Leandro Chaves Gurgel ◽

...

Keyword(s):

Regression Model ◽

Leaf Area ◽

Multiple Regression ◽

Leaf Blade ◽

Multiple Regression Model ◽

Leaf Length ◽

Predictor Variables ◽

Mato Grosso ◽

Linear Multiple Regression ◽

Model C

The aim of this study was to verify the precision and accuracy of 5 models for leaf area prediction using length and width of leaf blades of Megathyrsus maximus cv. BRS Zuri and to reparametrize models. Data for the predictor variables, length (L) and width (W) of leaf blades of BRS Zuri grass tillers, were collected in May 2018 in the experimental area of Embrapa Gado de Corte, Mato Grosso do Sul, Brazil. The predictor variables had high correlation values (P<0.001). In the analysis of adequacy of the models, the first-degree models that use leaf blade length (Model A), leaf width × leaf length (Model B) and linear multiple regression (Model C) promoted estimated values similar to the leaf area values observed (P>0.05), with high values for determination coefficient (>80%) and correlation concordance coefficient (>90%). Among the 5 models evaluated, the linear multiple regression (Model C: β0 = -5.97, β1 = 0.489, β2 = 1.11 and β3 = 0.351; R² = 89.64; P<0.001) and as predictor variables, width, length and length × width of the leaf blade, are the most adequate to generate precise and exact estimates of the leaf area of BRS Zuri grass.

Download Full-text

Automatic Correction Method of Soft-Sensor Function that Employs Linear Multiple Regression Model for Pulp Bleaching Process

JAPAN TAPPI JOURNAL ◽

10.2524/jtappij.68.1245 ◽

2014 ◽

Vol 68 (11) ◽

pp. 1245-1251

Author(s):

Yoshitatsu Mori

Keyword(s):

Regression Model ◽

Multiple Regression ◽

Correction Method ◽

Multiple Regression Model ◽

Soft Sensor ◽

Automatic Correction ◽

Pulp Bleaching ◽

Bleaching Process ◽

Linear Multiple Regression

Download Full-text

Some Conditions Affecting the Utility of Subjectively Weighted Models in Decision Making

Perceptual and Motor Skills ◽

10.2466/pms.1979.49.2.583 ◽

1979 ◽

Vol 49 (2) ◽

pp. 583-590 ◽

Cited By ~ 2

Author(s):

Lars Nystedt ◽

Kevin R. Murphy

Keyword(s):

Decision Making ◽

Regression Model ◽

Multiple Regression ◽

Regression Models ◽

Multiple Regression Model ◽

Linear Multiple Regression ◽

Multiple Regression Models ◽

Highly Correlated ◽

Degree Of Overlap

The accuracy of multiple regression models, models employing subjective weights and models employing relative subjective weights in reproducing judgments was studied. Multiple regression models were most accurate. When subjects were divided into two groups according to the degree of configurality shown in their matrix of subjective weights, striking differences were found in the degree of overlap of the multiple regression models and the models employing subjective weights. In particular, when subjective policies were essentially linear, the predicted judgments produced by these policies were highly correlated with the predicted judgments of the multiple regression models. When subjective policies were highly configural, the subjective models accounted for variance in judgments not accounted for by the linear multiple regression model.

Download Full-text

The Improvement of Marine Traffic Survey Using Radar and the Way of its Analysis-II : Estimation of the Vessel's Overall Length by the Linear Multiple Regression Model

The Journal of Japan Institute of Navigation ◽

10.9749/jin.83.57 ◽

1990 ◽

Vol 83 (0) ◽

pp. 57-64

Author(s):

Naoto SATOH ◽

Yasumitsu MIYAZAKI ◽

Yasuo TAKENAKA ◽

Keisuke TSUJI

Keyword(s):

Regression Model ◽

Multiple Regression ◽

Multiple Regression Model ◽

Linear Multiple Regression ◽

Its Analysis ◽

Marine Traffic ◽

The Way

Download Full-text

Manipulation of a robot by EMG signals using linear multiple regression model

2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566) ◽

10.1109/iros.2004.1389690 ◽

2005 ◽

Cited By ~ 2

Author(s):

N. Tsujiuchi ◽

K. Takayuki ◽

M. Yoneda

Keyword(s):

Regression Model ◽

Multiple Regression ◽

Multiple Regression Model ◽

Linear Multiple Regression

Download Full-text

Motion Estimation from EMG Signals using Linear Multiple Regression Model

The Proceedings of JSME annual Conference on Robotics and Mechatronics (Robomec) ◽

10.1299/jsmermd.2004.79_3 ◽

2004 ◽

Vol 2004 (0) ◽

pp. 79-80

Author(s):

N Tsujiuchi ◽

T Koizumi ◽

M Yoneda

Keyword(s):

Motion Estimation ◽

Regression Model ◽

Multiple Regression ◽

Multiple Regression Model ◽

Linear Multiple Regression

Download Full-text

Whos Chatting?: Interplay between Personality and WhatsApp Use

International Journal of Marketing and Business Communication ◽

10.21863/ijmbc/2015.4.4.022 ◽

2015 ◽

Vol 4 (4) ◽

Author(s):

Himanshu Rajput

Keyword(s):

Logistic Regression ◽

Multiple Regression ◽

Big Five ◽

Personality Assessment ◽

The Internet ◽

Big Five Inventory ◽

Significant Relationships ◽

Indian Context ◽

High Penetration ◽

The Relationship

Smartphone-based messaging applications have shown phenomenal growth with the proliferation of the internet coupled with the high penetration of smartphones into masses. The current study is an attempt to understand the relationship between the individuals personality and their use of WhatsApp, a popular smartphone-based messaging application in Indian context. For personality assessment the study takes Big Five Inventory. A questionnaire consisting items on individual WhatsApp use and Big Five Inventory was administered to students in an Indian University. Multiple regression and logistic regression revealed significant relationships between personality and WhatsApp usage and use of its different inbuilt functions.

Download Full-text

The Impact of Work Characteristics on Social Distancing: Implications at the Time of COVID-19

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18105074 ◽

2021 ◽

Vol 18 (10) ◽

pp. 5074

Author(s):

Keisuke Kokubun ◽

Yoshinori Yamakawa

Keyword(s):

Information Processing ◽

Multiple Regression ◽

Explanatory Power ◽

Work Conditions ◽

Multiple Regression Model ◽

Social Characteristics ◽

Work Characteristics ◽

Social Distancing ◽

Using Data ◽

The Impact

The coronavirus disease (COVID-19) continues to spread globally. While social distancing has attracted attention as a measure to prevent the spread of infection, some occupations find it difficult to implement. Therefore, this study aims to investigate the relationship between work characteristics and social distancing using data available on O*NET, an occupational information site. A total of eight factors were extracted by performing an exploratory factor analysis: work conditions, supervisory work, information processing, response to aggression, specialization, autonomy, interaction outside the organization, and interdependence. A multiple regression analysis showed that interdependence, response to aggression, and interaction outside the organization, which are categorized as ”social characteristics,” and information processing and specialization, which are categorized as “knowledge characteristics,” were associated with physical proximity. Furthermore, we added customer, which represents contact with the customer, and remote working, which represents a small amount of outdoor activity, to our multiple regression model, and confirmed that they increased the explanatory power of the model. This suggests that those who work under interdependence, face aggression, and engage in outside activities, and/or have frequent contact with customers, little interaction outside the organization, and little information processing will have the most difficulty in maintaining social distancing.

Download Full-text