Predictive and interpretable models via the stacked elastic net

Bioinformatics ◽

10.1093/bioinformatics/btaa535 ◽

2020 ◽

Author(s):

Armin Rauschenberger ◽

Enrico Glaab ◽

Mark van de Wiel

Keyword(s):

Machine Learning ◽

R Package ◽

Elastic Net ◽

Machine Learning Techniques ◽

Supplementary Information ◽

Biomedical Sciences ◽

Molecular Features ◽

Learning Techniques ◽

Meta Learning ◽

Interpretable Models

Abstract Motivation Machine learning in the biomedical sciences should ideally provide predictive and interpretable models. When predicting outcomes from clinical or molecular features, applied researchers often want to know which features have effects, whether these effects are positive or negative, and how strong these effects are. Regression analysis includes this information in the coefficients but typically renders less predictive models than more advanced machine learning techniques. Results Here we propose an interpretable meta-learning approach for high-dimensional regression. The elastic net provides a compromise between estimating weak effects for many features and strong effects for some features. It has a mixing parameter to weight between ridge and lasso regularisation. Instead of selecting one weighting by tuning, we combine multiple weightings by stacking. We do this in a way that increases predictivity without sacrificing interpretability. Availability and Implementation The R package starnet is available on GitHub: https://github.com/rauschenberger/starnet. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Basic polar and hydrophobic properties are the main characteristics that affect the binding of transcription factors to methylation sites

Bioinformatics ◽

10.1093/bioinformatics/btaa492 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4263-4268

Author(s):

Zijie Shen ◽

Quan Zou

Keyword(s):

Machine Learning ◽

Amino Acids ◽

Transcription Factors ◽

Machine Learning Techniques ◽

Supplementary Information ◽

Analysis Process ◽

Learning Techniques ◽

Key Factor ◽

Learning Analysis

Abstract Motivation Methylation and transcription factors (TFs) are part of the mechanisms regulating gene expression. However, the numerous mechanisms regulating the interactions between methylation and TFs remain unknown. We employ machine-learning techniques to discover the characteristics of TFs that bind to methylation sites. Results The classical machine-learning analysis process focuses on improving the performance of the analysis method. Conversely, we focus on the functional properties of the TF sequences. We obtain the principal properties of TFs, namely, the basic polar and hydrophobic Ile amino acids affecting the interaction between TFs and methylated DNA. The recall of the positive instances is 0.878 when their basic polar value is >0.1743. Both basic polar and hydrophobic Ile amino acids distinguish 74% of TFs bound to methylation sites. Therefore, we infer that basic polar amino acids affect the interactions of TFs with methylation sites. Based on our results, the role of the hydrophobic Ile residue is consistent with that described in previous studies, and the basic polar amino acids may also be a key factor modulating the interactions between TFs and methylation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of Lung Function in Adolescence Using Epigenetic Aging: A Machine Learning Approach

10.20944/preprints202009.0168.v1 ◽

2020 ◽

Author(s):

Md. Adnan Arefeen ◽

Sumaiya Tabassum Nimi ◽

M. Sohel Rahman ◽

S. Hasan Arshad ◽

John W. Holloway ◽

...

Keyword(s):

Machine Learning ◽

Lung Function ◽

Forced Expiratory Volume ◽

Elastic Net ◽

Machine Learning Techniques ◽

Time Points ◽

Learning Techniques ◽

Epigenetic Aging ◽

Epigenetic Age ◽

Lung Health

Epigenetic aging has been found associated with a number of phenotypes and diseases. Few studies investigated its effect on lung function in relatively older people. However, this effect has not been explored in younger population. This study examines whether lung function at adolescent can be predicted with epigenetic age accelerations (AAs) using machine learning techniques. DNA methylation based AAs were estimated in 326 matched samples at two time points (at 10 years and 18 years) from the Isle of Wight Birth Cohort. Five machine learning regression models (linear, lasso, ridge, elastic net, and Bayesian ridge) were used to predict FEV1 (Forced Expiratory Volume in one second) and FVC (Forced Vital Capacity) at 18 years from feature selected predictor variables (based on mutual information) and AA changes between the two time points. The best models were ridge regression (R2 = 75.21% ± 7.42%; RMSE = 0.3768 ± 0.0653) and elastic net regression (R2 = 75.38% ± 6.98%; RMSE = 0.445 ± 0.069) for FEV1 and FVC, respectively. This study suggests that the application of machine learning in conjunction with tracking changes in AA over life span can be beneficial to assess the lung health in adolescence.

Download Full-text

Prediction of Lung Function in Adolescence Using Epigenetic Aging: A Machine Learning Approach

Methods and Protocols ◽

10.3390/mps3040077 ◽

2020 ◽

Vol 3 (4) ◽

pp. 77

Author(s):

Md Adnan Arefeen ◽

Sumaiya Tabassum Nimi ◽

M. Sohel Rahman ◽

S. Hasan Arshad ◽

John W. Holloway ◽

...

Keyword(s):

Machine Learning ◽

Lung Function ◽

Forced Expiratory Volume ◽

Elastic Net ◽

Machine Learning Techniques ◽

Time Points ◽

Learning Techniques ◽

Epigenetic Aging ◽

Epigenetic Age ◽

Lung Health

Epigenetic aging has been found to be associated with a number of phenotypes and diseases. A few studies have investigated its effect on lung function in relatively older people. However, this effect has not been explored in the younger population. This study examines whether lung function in adolescence can be predicted with epigenetic age accelerations (AAs) using machine learning techniques. DNA methylation based AAs were estimated in 326 matched samples at two time points (at 10 years and 18 years) from the Isle of Wight Birth Cohort. Five machine learning regression models (linear, lasso, ridge, elastic net, and Bayesian ridge) were used to predict FEV1 (forced expiratory volume in one second) and FVC (forced vital capacity) at 18 years from feature selected predictor variables (based on mutual information) and AA changes between the two time points. The best models were ridge regression (R2 = 75.21% ± 7.42%; RMSE = 0.3768 ± 0.0653) and elastic net regression (R2 = 75.38% ± 6.98%; RMSE = 0.445 ± 0.069) for FEV1 and FVC, respectively. This study suggests that the application of machine learning in conjunction with tracking changes in AA over the life span can be beneficial to assess the lung health in adolescence.

Download Full-text

TaxoNN: ensemble of neural networks on stratified microbiome data for disease prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa542 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4544-4550 ◽

Cited By ~ 1

Author(s):

Divya Sharma ◽

Andrew D Paterson ◽

Wei Xu

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Human Microbiome ◽

Machine Learning Techniques ◽

Supplementary Information ◽

Operational Taxonomic Units ◽

Learning Techniques ◽

Conventional Machine ◽

Risk Of Disease ◽

Microbiome Data

Abstract Motivation Research supports the potential use of microbiome as a predictor of some diseases. Motivated by the findings that microbiome data is complex in nature, and there is an inherent correlation due to hierarchical taxonomy of microbial Operational Taxonomic Units (OTUs), we propose a novel machine learning method incorporating a stratified approach to group OTUs into phylum clusters. Convolutional Neural Networks (CNNs) were used to train within each of the clusters individually. Further, through an ensemble learning approach, features obtained from each cluster were then concatenated to improve prediction accuracy. Our two-step approach comprising stratification prior to combining multiple CNNs, aided in capturing the relationships between OTUs sharing a phylum efficiently, as compared to using a single CNN ignoring OTU correlations. Results We used simulated datasets containing 168 OTUs in 200 cases and 200 controls for model testing. Thirty-two OTUs, potentially associated with risk of disease were randomly selected and interactions between three OTUs were used to introduce non-linearity. We also implemented this novel method in two human microbiome studies: (i) Cirrhosis with 118 cases, 114 controls; (ii) type 2 diabetes (T2D) with 170 cases, 174 controls; to demonstrate the model’s effectiveness. Extensive experimentation and comparison against conventional machine learning techniques yielded encouraging results. We obtained mean AUC values of 0.88, 0.92, 0.75, showing a consistent increment (5%, 3%, 7%) in simulations, Cirrhosis and T2D data, respectively, against the next best performing method, Random Forest. Availability and implementation https://github.com/divya031090/TaxoNN_OTU. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of depression cases, incidence, and chronicity in a large occupational cohort using machine learning techniques: an analysis of the ELSA-Brasil study

Psychological Medicine ◽

10.1017/s0033291720001579 ◽

2020 ◽

pp. 1-9

Author(s):

Diego Librenza-Garcia ◽

Ives Cavalcante Passos ◽

Jacson Gabriel Feiten ◽

Paulo A. Lotufo ◽

Alessandra C. Goulart ◽

...

Keyword(s):

Machine Learning ◽

Elastic Net ◽

Machine Learning Techniques ◽

Adult Health ◽

Recurrent Depression ◽

Elastic Net Regularization ◽

Diagnosis And Prognosis ◽

Learning Techniques ◽

Occupational Cohort

Abstract Background Depression is highly prevalent and marked by a chronic and recurrent course. Despite being a major cause of disability worldwide, little is known regarding the determinants of its heterogeneous course. Machine learning techniques present an opportunity to develop tools to predict diagnosis and prognosis at an individual level. Methods We examined baseline (2008–2010) and follow-up (2012–2014) data of the Brazilian Longitudinal Study of Adult Health (ELSA-Brasil), a large occupational cohort study. We implemented an elastic net regularization analysis with a 10-fold cross-validation procedure using socioeconomic and clinical factors as predictors to distinguish at follow-up: (1) depressed from non-depressed participants, (2) participants with incident depression from those who did not develop depression, and (3) participants with chronic (persistent or recurrent) depression from those without depression. Results We assessed 15 105 and 13 922 participants at waves 1 and 2, respectively. The elastic net regularization model distinguished outcome levels in the test dataset with an area under the curve of 0.79 (95% CI 0.76–0.82), 0.71 (95% CI 0.66–0.77), 0.90 (95% CI 0.86–0.95) for analyses 1, 2, and 3, respectively. Conclusions Diagnosis and prognosis related to depression can be predicted at an individual subject level by integrating low-cost variables, such as demographic and clinical data. Future studies should assess longer follow-up periods and combine biological predictors, such as genetics and blood biomarkers, to build more accurate tools to predict depression course.

Download Full-text

Inflation forecasting in an emerging economy: selecting variables with machine learning algorithms

International Journal of Emerging Markets ◽

10.1108/ijoem-05-2020-0577 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Önder Özgür ◽

Uğur Akkoç

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Elastic Net ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Content Type ◽

Shrinkage Methods ◽

Inflation Forecasting ◽

Turkish Economy ◽

Learning Techniques

PurposeThe main purpose of this study is to forecast inflation rates in the case of the Turkish economy with shrinkage methods of machine learning algorithms.Design/methodology/approachThis paper compares the predictive ability of a set of machine learning techniques (ridge, lasso, ada lasso and elastic net) and a group of benchmark specifications (autoregressive integrated moving average (ARIMA) and multivariate vector autoregression (VAR) models) on the extensive dataset.FindingsResults suggest that shrinkage methods perform better for variable selection. It is also seen that lasso and elastic net algorithms outperform conventional econometric methods in the case of Turkish inflation. These algorithms choose the energy production variables, construction-sector measure, reel effective exchange rate and money market indicators as the most relevant variables for inflation forecasting.Originality/valueTurkish economy that is a typical emerging country has experienced two digit and high volatile inflation regime starting with the year 2017. This study contributes to the literature by introducing the machine learning techniques to forecast inflation in the Turkish economy. The study also compares the relative performance of machine learning techniques and different conventional methods to predict inflation in the Turkish economy and provide the empirical methodology offering the best predictive performance among their counterparts.

Download Full-text

Fiscore Package: Effective Protein Structural Data Visualisation and Exploration

10.1101/2021.08.25.457640 ◽

2021 ◽

Author(s):

Auste Kanapeckaite

Keyword(s):

Machine Learning ◽

Protein Function ◽

Mathematical Formulation ◽

Structural Data ◽

R Package ◽

Gaussian Mixture ◽

Structural Features ◽

Machine Learning Techniques ◽

Topological Features ◽

Learning Techniques

Lack of bioinformatics tools to quickly assess protein conformational and topological features motivated to create an integrative and user-friendly R package. Moreover,Fiscore package implements a pipeline for Gaussian mixture modelling making such machine learning techniques readily accessible to non-experts. This is especially important since probabilistic machine learning techniques can help with a better interpretation of complex biological phenomena when it is necessary to elucidate various structural features that might play a role in protein function. Thus,Fiscore package builds on the mathematical formulation of protein physicochemical properties that can aid in drug discovery, target evaluation, or relational database building. Moreover, the package provides interactive environments to explore various features of interest. Finally, one of the goals of this package was to engage structural bioinformaticians and develop more R tools that could help researchers not necessarily specialising in this field. Package Fiscore(v.0.1.2) is distributed via CRAN and Github.

Download Full-text

Noise detection in classification problems

10.5753/ctd.2017.3469 ◽

2017 ◽

Cited By ~ 1

Author(s):

Luís P. F. Garcia ◽

Ana C. Lorena ◽

André C. P. L. F. De Carvalho

Keyword(s):

Machine Learning ◽

Data Quality ◽

Recommendation System ◽

Predictive Performance ◽

Real Data ◽

Machine Learning Techniques ◽

Noise Detection ◽

Classification Problems ◽

Learning Techniques ◽

Meta Learning

Large volumes of data have been produced in many application domains. Nonetheless, when data quality is low, the performance of Machine Learning techniques is harmed. Real data are frequently affected by the presence of noise, which, when used in the training of Machine Learning techniques for predictive tasks, can result in complex models, with high induction time and low predictive performance. Identification and removal of noise can improve data quality and, as a result, the induced model. This thesis proposes new techniques for noise detection and the development of a recommendation system based on meta-learning to recommend the most suitable filter for new tasks. Experiments using artificial and real datasets show the relevance of this research.

Download Full-text

Identification of molecular features necessary for selective inhibition of B cell lymphoma proteins using machine learning techniques

Molecular Diversity ◽

10.1007/s11030-018-9856-x ◽

2018 ◽

Vol 23 (1) ◽

pp. 55-73 ◽

Cited By ~ 3

Author(s):

Ahmad Mani-Varnosfaderani ◽

Marzieh Sadat Neiband ◽

Ali Benvidi

Keyword(s):

Machine Learning ◽

B Cell ◽

Cell Lymphoma ◽

B Cell Lymphoma ◽

Selective Inhibition ◽

Machine Learning Techniques ◽

Molecular Features ◽

Learning Techniques

Download Full-text

Using machine learning techniques to reduce data annotation time

PsycEXTRA Dataset ◽

10.1037/e577762012-020 ◽

2006 ◽

Author(s):

Christopher Schreiner ◽

Kari Torkkola ◽

Mike Gardner ◽

Keshu Zhang

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Data Annotation ◽

Learning Techniques

Download Full-text