Lasso and Group Lasso with Categorical Predictors: Impact of Coding Strategy on Variable Selection and Prediction

Mapping Intimacies ◽

10.31234/osf.io/wc45u ◽

2020 ◽

Author(s):

Yihuan Huang ◽

Amanda Kay Montoya

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Variable Selection ◽

Prediction Accuracy ◽

Group Lasso ◽

Categorical Variables ◽

Learning Approaches ◽

Dominant Group ◽

Data Set ◽

Coding Strategy

Machine learning methods are being increasingly adopted in psychological research. Lasso performs variable selection and regularization, and is particularly appealing to psychology researchers because of its connection to linear regression. Researchers conflate properties of linear regression with properties of lasso; however, we demonstrate that this is not the case for models with categorical predictors. Specifically, the coding strategy used for categorical predictors impacts lasso’s performance but not linear regression. Group lasso is an alternative to lasso for models with categorical predictors. We demonstrate the inconsistency of lasso and group lasso models using a real data set: lasso performs different variable selection and has different prediction accuracy depending on the coding strategy, and group lasso performs consistent variable selection but has different prediction accuracy. Additionally, group lasso may include many predictors when very few are needed, leading to overfitting. Using Monte Carlo simulation, we show that categorical variables with one group mean differing from all others (one dominant group) are more likely to be included in the model by group lasso than lasso, leading to overfitting. This effect is strongest when the mean difference is large and there are many categories. Researchers primarily focus on the similarity between linear regression and lasso, but pay little attention to their different properties. This project demonstrates that when using lasso and group lasso, the effect of coding strategies should be considered. We conclude with recommended solutions to this issue and future directions of exploration to improve implementation of machine learning approaches in psychological science.

Download Full-text

Teasing Apart Silvopasture System Components Using Machine Learning for Optimization

Soil Systems ◽

10.3390/soilsystems5030041 ◽

2021 ◽

Vol 5 (3) ◽

pp. 41

Author(s):

Tulsi P. Kharel ◽

Amanda J. Ashworth ◽

Phillip R. Owens ◽

Dirk Philipp ◽

Andrew L. Thomas ◽

...

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Regression Tree ◽

Plant Production ◽

System Level ◽

Learning Approaches ◽

Nutrient Distribution ◽

Wetness Index ◽

Variable Approach ◽

Animal Grazing

Silvopasture systems combine tree and livestock production to minimize market risk and enhance ecological services. Our objective was to explore and develop a method for identifying driving factors linked to productivity in a silvopastoral system using machine learning. A multi-variable approach was used to detect factors that affect system-level output (i.e., plant production (tree and forage), soil factors, and animal response based on grazing preference). Variables from a three-year (2017–2019) grazing study, including forage, tree, soil, and terrain attribute parameters, were analyzed. Hierarchical variable clustering and random forest model selected 10 important variables for each of four major clusters. A stepwise multiple linear regression and regression tree approach was used to predict cattle grazing hours per animal unit (h ha−1 AU−1) using 40 variables (10 per cluster) selected from 130 total variables. Overall, the variable ranking method selected more weighted variables for systems-level analysis. The regression tree performed better than stepwise linear regression for interpreting factor-level effects on animal grazing preference. Cattle were more likely to graze forage on soils with Cd levels <0.04 mg kg−1 (126% greater grazing hours per AU), soil Cr <0.098 mg kg−1 (108%), and a SAGA wetness index of <2.7 (57%). Cattle also preferred grazing (88%) native grasses compared to orchardgrass (Dactylis glomerata L.). The result shows water flow within the landscape position (wetness index), and associated metals distribution may be used as an indicator of animal grazing preference. Overall, soil nutrient distribution patterns drove grazing response, although animal grazing preference was also influenced by aboveground (forage and tree), soil, and landscape attributes. Machine learning approaches helped explain pasture use and overall drivers of grazing preference in a multifunctional system.

Download Full-text

Machine learning identifies an immunological pattern associated with multiple juvenile idiopathic arthritis subtypes

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2018-214354 ◽

2019 ◽

Vol 78 (5) ◽

pp. 617-628 ◽

Cited By ~ 5

Author(s):

Erika Van Nieuwenhove ◽

Vasiliki Lagou ◽

Lien Van Eyck ◽

James Dooley ◽

Ulrich Bodenhofer ◽

...

Keyword(s):

Machine Learning ◽

Juvenile Idiopathic Arthritis ◽

Large Scale ◽

Inflammatory Diseases ◽

Adaptive Immune System ◽

Healthy Children ◽

Learning Approaches ◽

Data Set ◽

Immune Signature ◽

Systemic Jia

ObjectivesJuvenile idiopathic arthritis (JIA) is the most common class of childhood rheumatic diseases, with distinct disease subsets that may have diverging pathophysiological origins. Both adaptive and innate immune processes have been proposed as primary drivers, which may account for the observed clinical heterogeneity, but few high-depth studies have been performed.MethodsHere we profiled the adaptive immune system of 85 patients with JIA and 43 age-matched controls with indepth flow cytometry and machine learning approaches.ResultsImmune profiling identified immunological changes in patients with JIA. This immune signature was shared across a broad spectrum of childhood inflammatory diseases. The immune signature was identified in clinically distinct subsets of JIA, but was accentuated in patients with systemic JIA and those patients with active disease. Despite the extensive overlap in the immunological spectrum exhibited by healthy children and patients with JIA, machine learning analysis of the data set proved capable of discriminating patients with JIA from healthy controls with ~90% accuracy.ConclusionsThese results pave the way for large-scale immune phenotyping longitudinal studies of JIA. The ability to discriminate between patients with JIA and healthy individuals provides proof of principle for the use of machine learning to identify immune signatures that are predictive to treatment response group.

Download Full-text

Application of Machine Learning Techniques to Predict Mechanical Properties for Polyamide 2200 (PA12) in Additive Manufacturing

10.20944/preprints201903.0051.v1 ◽

2019 ◽

Author(s):

Ivanna Baturynska

Keyword(s):

Machine Learning ◽

Mechanical Properties ◽

Additive Manufacturing ◽

Linear Regression ◽

Prediction Accuracy ◽

Regression Models ◽

Tensile Modulus ◽

Machine Learning Techniques ◽

Linear Regression Models ◽

Elongation At Break

Additive manufacturing (AM) is an attractive technology for manufacturing industry due to flexibility in design and functionality, but inconsistency in quality is one of the major limitations that does not allow utilizing this technology for production of end-use parts. Prediction of mechanical properties can be one of the possible ways to improve the repeatability of the results. The part placement, part orientation, and STL model properties (number of mesh triangles, surface, and volume) are used to predict tensile modulus, nominal stress and elongation at break for polyamide 2200 (also known as PA12). EOS P395 polymer powder bed fusion system was used to fabricate 217 specimens in two identical builds (434 specimens in total). Prediction is performed for XYZ, XZY, ZYX, and Angle orientations separately, and all orientations together. The different non-linear models based on machine learning methods have higher prediction accuracy compared with linear regression models. Linear regression models have prediction accuracy higher than 80% only for Tensile Modulus and Elongation at break in Angle orientation. Since orientation-based modeling has low prediction accuracy due to a small number of data points and lack of information about material properties, these models need to be improved in the future based on additional experimental work.

Download Full-text

Comparison of Machine Learning Approaches to Improve Diagnosis of Optic Neuropathy Using Photopic Negative Response Measured Using a Handheld Device

Frontiers in Medicine ◽

10.3389/fmed.2021.771713 ◽

2021 ◽

Vol 8 ◽

Author(s):

Tina Diao ◽

Fareshta Kushzad ◽

Megh D. Patel ◽

Megha P. Bindiganavale ◽

Munam Wasi ◽

...

Keyword(s):

Machine Learning ◽

Time Series ◽

Optic Neuropathy ◽

Negative Response ◽

Photopic Negative Response ◽

Learning Approaches ◽

Data Set ◽

Full Field ◽

Handheld Device ◽

Technical Requirements

The photopic negative response of the full-field electroretinogram (ERG) is reduced in optic neuropathies. However, technical requirements for measurement and poor classification performance have limited widespread clinical application. Recent advances in hardware facilitate efficient clinic-based recording of the full-field ERG. Time series classification, a machine learning approach, may improve classification by using the entire ERG waveform as the input. In this study, full-field ERGs were recorded in 217 eyes (109 optic neuropathy and 108 controls) of 155 subjects. User-defined ERG features including photopic negative response were reduced in optic neuropathy eyes (p < 0.0005, generalized estimating equation models accounting for age). However, classification of optic neuropathy based on user-defined features was only fair with receiver operating characteristic area under the curve ranging between 0.62 and 0.68 and F1 score at the optimal cutoff ranging between 0.30 and 0.33. In comparison, machine learning classifiers using a variety of time series analysis approaches had F1 scores of 0.58–0.76 on a test data set. Time series classifications are promising for improving optic neuropathy diagnosis using ERG waveforms. Larger sample sizes will be important to refine the models.

Download Full-text

Abstract TP458: High Accuracy of Predictive Models for SAH Using Different Machine Learning Approaches

Stroke ◽

10.1161/str.51.suppl_1.tp458 ◽

2020 ◽

Vol 51 (Suppl_1) ◽

Author(s):

Paul Litvak ◽

Jeevan Medikonda ◽

Girish Menon ◽

Pitchaiah Mandava

Keyword(s):

Machine Learning ◽

Predictive Models ◽

Prediction Accuracy ◽

Support Vector ◽

World Federation ◽

Learning Approaches ◽

Flow Models ◽

Multi Stage ◽

Stage 1 ◽

Categorical Scale

Background: Patients suffering from subarachnoid hemorrhage (SAH) have poor long-term outcomes. There are predictive models for ischemic and hemorrhagic stroke. However, there is paucity of models for SAH. Machine learning concepts were applied to build multi-stage Neural Networks (NN), Support Vector Machines (SVM) and Keras/Tensor Flow models to predict SAH outcomes. Methods: A database of ~800 aneurysmal SAH patients from Kasturba Medical College was utilized. Baseline variables of World Federation of Neurosurgeons 5-point scale (WFNS 1-5), age, gender, and presence/absence of hypertension and diabetes were considered in Stage 1. Stage 2 included all Stage 1 variables along with presence/absence of radiologic signs vasospasm and ischemia. Stage 3 includes earlier 2 stages and discharge Glasgow Outcome Scale (GOS 1-5). GOS at 3 months was predicted using 2-layer NN/SVM/Keras-TensorFlow models on the five point categorical scale as well as dichotomized to dead/alive and favorable (GOS 4-5) or unfavorable (GOS 1-3). Prediction accuracy of models was compared to the recorded GOS. Results: Prediction accuracy shown as percentages (See Table) for all three stages was similar for SVM, NN and Keras/TensorFlow models. Accuracy was remarkably higher with dichotomization compared to the complete five point GOS categorical scale. Conclusions: SVM, NN, and Keras-TensorFlow based machine learning models can be used to predict SAH outcomes to a high degree of accuracy. These powerful predictive models can be used to prognosticate and select patients into trials.

Download Full-text

Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine Learning Techniques

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch009 ◽

2012 ◽

pp. 155-165

Author(s):

S. Prasanthi ◽

S.Durga Bhavani ◽

T. Sobha Rani ◽

Raju S. Bapi

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Kinase Inhibitors ◽

Kinase Inhibitor ◽

Classification Problem ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Decision Tree Classifier ◽

Data Set ◽

Learning Techniques

Vast majority of successful drugs or inhibitors achieve their activity by binding to, and modifying the activity of a protein leading to the concept of druggability. A target protein is druggable if it has the potential to bind the drug-like molecules. Hence kinase inhibitors need to be studied to understand the specificity of a kinase inhibitor in choosing a particular kinase target. In this paper we focus on human kinase drug target sequences since kinases are known to be potential drug targets. Also we do a preliminary analysis of kinase inhibitors in order to study the problem in the protein-ligand space in future. The identification of druggable kinases is treated as a classification problem in which druggable kinases are taken as positive data set and non-druggable kinases are chosen as negative data set. The classification problem is addressed using machine learning techniques like support vector machine (SVM) and decision tree (DT) and using sequence-specific features. One of the challenges of this classification problem is due to the unbalanced data with only 48 druggable kinases available against 509 non-drugggable kinases present at Uniprot. The accuracy of the decision tree classifier obtained is 57.65 which is not satisfactory. A two-tier architecture of decision trees is carefully designed such that recognition on the non-druggable dataset also gets improved. Thus the overall model is shown to achieve a final performance accuracy of 88.37. To the best of our knowledge, kinase druggability prediction using machine learning approaches has not been reported in literature.

Download Full-text

Performance analysis on least absolute shrinkage selection operator, elastic net and correlation adjusted elastic net regression methods

International Journal of Advanced Statistics and Probability ◽

10.14419/ijasp.v3i1.4364 ◽

2015 ◽

Vol 3 (1) ◽

pp. 93

Author(s):

Pascalis Kadaro Matthew ◽

Abubakar Yahaya

Keyword(s):

Linear Regression ◽

Prediction Accuracy ◽

Penalized Regression ◽

Ordinary Least Squares ◽

Complex Model ◽

Elastic Net ◽

Data Set ◽

Regression Methods ◽

Regression Techniques ◽

Selection Operator

<p>Some few decades ago, penalized regression techniques for linear regression have been developed specifically to reduce the flaws inherent in the prediction accuracy of the classical ordinary least squares (OLS) regression technique. In this paper, we used a diabetes data set obtained from previous literature to compare three of these well-known techniques, namely: Least Absolute Shrinkage Selection Operator (LASSO), Elastic Net and Correlation Adjusted Elastic Net (CAEN). After thorough analysis, it was observed that CAEN generated a less complex model.</p>

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Application of Machine Learning Techniques to Predict the Mechanical Properties of Polyamide 2200 (PA12) in Additive Manufacturing

Applied Sciences ◽

10.3390/app9061060 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1060

Author(s):

Ivanna Baturynska

Keyword(s):

Machine Learning ◽

Mechanical Properties ◽

Additive Manufacturing ◽

Linear Regression ◽

Prediction Accuracy ◽

Regression Models ◽

Tensile Modulus ◽

Machine Learning Techniques ◽

Linear Regression Models ◽

Elongation At Break

Additive manufacturing (AM) is an attractive technology for the manufacturing industry due to flexibility in its design and functionality, but inconsistency in quality is one of the major limitations preventing utilizing this technology for the production of end-use parts. The prediction of mechanical properties can be one of the possible ways to improve the repeatability of results. The part placement, part orientation, and STL model properties (number of mesh triangles, surface, and volume) are used to predict tensile modulus, nominal stress, and elongation at break for polyamide 2200 (also known as PA12). An EOS P395 polymer powder bed fusion system was used to fabricate 217 specimens in two identical builds (434 specimens in total). Prediction is performed for XYZ, XZY, ZYX, and Angle orientations separately, and all orientations together. The different non-linear models based on machine learning methods have higher prediction accuracy compared with linear regression models. Linear regression models only have prediction accuracy higher than 80% for Tensile Modulus and Elongation at break in Angle orientation. Since orientation-based modeling has low prediction accuracy due to a small number of data points and lack of information about the material properties, these models need to be improved in the future based on additional experimental work.

Download Full-text

Evaluation of Machine Learning Approaches to Predict Soil Organic Matter and pH Using vis-NIR Spectra

Sensors ◽

10.3390/s19020263 ◽

2019 ◽

Vol 19 (2) ◽

pp. 263 ◽

Cited By ~ 16

Author(s):

Meihua Yang ◽

Dongyun Xu ◽

Songchao Chen ◽

Hongyi Li ◽

Zhou Shi

Keyword(s):

Machine Learning ◽

Organic Matter ◽

Soil Organic Matter ◽

Least Squares ◽

Paddy Soil ◽

Prediction Accuracy ◽

Accurate Determination ◽

Support Vector ◽

Learning Approaches ◽

Lower Yangtze

Soil organic matter (SOM) and pH are essential soil fertility indictors of paddy soil in the middle-lower Yangtze Plain. Rapid, non-destructive and accurate determination of SOM and pH is vital to preventing soil degradation caused by inappropriate land management practices. Visible-near infrared (vis-NIR) spectroscopy with multivariate calibration can be used to effectively estimate soil properties. In this study, 523 soil samples were collected from paddy fields in the Yangtze Plain, China. Four machine learning approaches—partial least squares regression (PLSR), least squares-support vector machines (LS-SVM), extreme learning machines (ELM) and the Cubist regression model (Cubist)—were used to compare the prediction accuracy based on vis-NIR full bands and bands reduced using the genetic algorithm (GA). The coefficient of determination (R2), root mean square error (RMSE), and ratio of performance to inter-quartile distance (RPIQ) were used to assess the prediction accuracy. The ELM with GA reduced bands was the best model for SOM (SOM: R2 = 0.81, RMSE = 5.17, RPIQ = 2.87) and pH (R2 = 0.76, RMSE = 0.43, RPIQ = 2.15). The performance of the LS-SVM for pH prediction did not differ significantly between the model with GA (R2 = 0.75, RMSE = 0.44, RPIQ = 2.08) and without GA (R2 = 0.74, RMSE = 0.45, RPIQ = 2.07). Although a slight increase was observed when ELM were used for prediction of SOM and pH using reduced bands (SOM: R2 = 0.81, RMSE = 5.17, RPIQ = 2.87; pH: R2 = 0.76, RMSE = 0.43, RPIQ = 2.15) compared with full bands (R2 = 0.81, RMSE = 5.18, RPIQ = 2.83; pH: R2 = 0.76, RMSE = 0.45, RPIQ = 2.07), the number of wavelengths was greatly reduced (SOM: 201 to 44; pH: 201 to 32). Thus, the ELM coupled with reduced bands by GA is recommended for prediction of properties of paddy soil (SOM and pH) in the middle-lower Yangtze Plain.

Download Full-text