Incorporating Prior Knowledge into Regularized Regression

Abstract Motivation Associated with genomic features like gene expression, methylation and genotypes, used in statistical modeling of health outcomes, there is a rich set of meta-features like functional annotations, pathway information and knowledge from previous studies, that can be used post hoc to facilitate the interpretation of a model. However, using this meta-feature information a priori rather than post hoc can yield improved prediction performance as well as enhanced model interpretation. Results We propose a new penalized regression approach that allows a priori integration of external meta-features. The method extends LASSO regression by incorporating individualized penalty parameters for each regression coefficient. The penalty parameters are, in turn, modeled as a log-linear function of the meta-features and are estimated from the data using an approximate empirical Bayes approach. Optimization of the marginal likelihood on which the empirical Bayes estimation is performed using a fast and stable majorization–minimization procedure. Through simulations, we show that the proposed regression with individualized penalties can outperform the standard LASSO in terms of both parameters estimation and prediction performance when the external data is informative. We further demonstrate our approach with applications to gene expression studies of bone density and breast cancer. Availability and implementation The methods have been implemented in the R package xtune freely available for download from https://cran.r-project.org/web/packages/xtune/index.html. Contact [email protected]

Download Full-text

Predicting gene expression using DNA methylation in three human populations

PeerJ ◽

10.7717/peerj.6757 ◽

2019 ◽

Vol 7 ◽

pp. e6757 ◽

Cited By ~ 8

Author(s):

Huan Zhong ◽

Soyeon Kim ◽

Degui Zhi ◽

Xiangqin Cui

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Linear Regression ◽

Human Population ◽

Penalized Regression ◽

Superior Performance ◽

Gene Region ◽

Human Populations ◽

Epigenetic Mark ◽

Lasso Regression

Background DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative correlation in the promoter region. However, its correlation with gene expression across genome at human population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples other than RNA samples. Results We examined DNA methylation in the gene region for predicting gene expression across individuals in non-cancer tissues of three human population datasets, adipose tissue of the Multiple Tissue Human Expression Resource Projects (MuTHER), peripheral blood mononuclear cell (PBMC) from Asthma and normal control study participates, and lymphoblastoid cell lines (LCL) from healthy individuals. Three prediction models were investigated, single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, the prediction power is generally low and varies across datasets. Only 30 and 42 genes were found to have cross-validation R2 greater than 0.3 in the PBMC and Adipose datasets, respectively. A substantially larger number of genes (258) were identified in the LCL dataset, which was generated from a more homogeneous cell line sample source. We also demonstrated that it gives better prediction power not to exclude any CpG probe due to cross hybridization or SNP effect. Conclusion In our three population analyses DNA methylation of CpG sites at gene region have limited prediction power for gene expression across individuals with linear regression models. The prediction power potentially varies depending on tissue, cell type, and data sources. In our analyses, the combination of LASSO regression and all probes not excluding any probe on the methylation array provides the best prediction for gene expression.

Download Full-text

Predicting gene expression using DNA methylation in two human populations

10.7287/peerj.preprints.27055v1 ◽

2018 ◽

Author(s):

Huan Zhong ◽

Soyeon Kim ◽

Degui Zhi ◽

Xiangqin Cui

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Linear Regression ◽

Penalized Regression ◽

Superior Performance ◽

Gene Region ◽

Human Populations ◽

Epigenetic Mark ◽

Lasso Regression ◽

Cpg Sites

Background. DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative regulation in the promoter region. However, its correlation with gene expression at population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples but not RNA samples. Results. We studied two human population datasets, Multiple Tissue Human Expression Resource Projects (MuTHER)’s Adipose tissue as well as asthma and normal peoples’ peripheral blood mononuclear cell (PBMC), for predicting gene expression using methylation of all CpG sites from the gene region. Three prediction models were investigated; single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, even with LASSO regression, very small prediction R2 was obtained for the majority of genes and only about one thousand genes had prediction R2 greater than 0.1. GO term and pathway analyses of these more predictable genes showed that they are enriched for immune and defense genes. Conclusion. In human populations, DNA methylation of CpG sites at gene region have weak prediction power for gene expression. The relatively more predictable genes tend to be defense and immune genes.

Download Full-text

Predicting gene expression using DNA methylation in two human populations

10.7287/peerj.preprints.27055 ◽

2018 ◽

Author(s):

Huan Zhong ◽

Soyeon Kim ◽

Degui Zhi ◽

Xiangqin Cui

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Linear Regression ◽

Penalized Regression ◽

Superior Performance ◽

Gene Region ◽

Human Populations ◽

Epigenetic Mark ◽

Lasso Regression ◽

Cpg Sites

Background. DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative regulation in the promoter region. However, its correlation with gene expression at population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples but not RNA samples. Results. We studied two human population datasets, Multiple Tissue Human Expression Resource Projects (MuTHER)’s Adipose tissue as well as asthma and normal peoples’ peripheral blood mononuclear cell (PBMC), for predicting gene expression using methylation of all CpG sites from the gene region. Three prediction models were investigated; single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, even with LASSO regression, very small prediction R2 was obtained for the majority of genes and only about one thousand genes had prediction R2 greater than 0.1. GO term and pathway analyses of these more predictable genes showed that they are enriched for immune and defense genes. Conclusion. In human populations, DNA methylation of CpG sites at gene region have weak prediction power for gene expression. The relatively more predictable genes tend to be defense and immune genes.

Download Full-text

THE CONVERGENCE RATES OF EMPIRICAL BAYES ESTIMATION FOR PARAMETERS OF TWO-SIDED TRUNCATION DISTRIBUTION FAMILIES

Acta Mathematica Scientia ◽

10.1016/s0252-9602(18)30366-7 ◽

1989 ◽

Vol 9 (4) ◽

pp. 403-413

Author(s):

Laisheng Wei

Keyword(s):

Empirical Bayes ◽

Convergence Rates ◽

Bayes Estimation ◽

Empirical Bayes Estimation

Download Full-text

A Note on Bayes Empirical Bayes Estimation by Means of Dirichlet Processes.

10.21236/ada170039 ◽

1985 ◽

Author(s):

Lynn Kuo

Keyword(s):

Empirical Bayes ◽

Bayes Estimation ◽

Dirichlet Processes ◽

Empirical Bayes Estimation

Download Full-text

Empirical Bayes Estimation With Kernel Sequence Method

10.21236/ada396449 ◽

2001 ◽

Author(s):

Shanti S. Gupta ◽

Jinjun Lu

Keyword(s):

Empirical Bayes ◽

Bayes Estimation ◽

Empirical Bayes Estimation ◽

Sequence Method

Download Full-text

Modeling In-Match Sports Dynamics Using the Evolving Probability Method

Applied Sciences ◽

10.3390/app11104429 ◽

2021 ◽

Vol 11 (10) ◽

pp. 4429

Author(s):

Ana Šarčević ◽

Damir Pintar ◽

Mihaela Vranić ◽

Ante Gojsalić

Keyword(s):

Monte Carlo Simulation ◽

Statistical Models ◽

Empirical Bayes ◽

Hybrid Approach ◽

Real Life ◽

Bayes Estimation ◽

Probability Method ◽

Specific Nature ◽

Sport Event ◽

Insight Into

The prediction of sport event results has always drawn attention from a vast variety of different groups of people, such as club managers, coaches, betting companies, and the general population. The specific nature of each sport has an important role in the adaption of various predictive techniques founded on different mathematical and statistical models. In this paper, a common approach of modeling sports with a strongly defined structure and a rigid scoring system that relies on an assumption of independent and identical point distributions is challenged. It is demonstrated that such models can be improved by introducing dynamics into the match models in the form of sport momentums. Formal mathematical models for implementing these momentums based on conditional probability and empirical Bayes estimation are proposed, which are ultimately combined through a unifying hybrid approach based on the Monte Carlo simulation. Finally, the method is applied to real-life volleyball data demonstrating noticeable improvements over the previous approaches when it comes to predicting match outcomes. The method can be implemented into an expert system to obtain insight into the performance of players at different stages of the match or to study field scenarios that may arise under different circumstances.

Download Full-text

An Algorithm for Nonparametric Estimation of a Multivariate Mixing Distribution with Applications to Population Pharmacokinetics

Pharmaceutics ◽

10.3390/pharmaceutics13010042 ◽

2020 ◽

Vol 13 (1) ◽

pp. 42

Author(s):

Walter M. Yamada ◽

Michael N. Neely ◽

Jay Bartroff ◽

David S. Bayard ◽

James V. Burke ◽

...

Keyword(s):

Drug Development ◽

Population Pharmacokinetics ◽

Empirical Bayes ◽

Applied Mathematics ◽

Grid Method ◽

Bayes Estimation ◽

Population Pharmacokinetic ◽

Support Points ◽

Primal Dual ◽

Log Normal

Population pharmacokinetic (PK) modeling has become a cornerstone of drug development and optimal patient dosing. This approach offers great benefits for datasets with sparse sampling, such as in pediatric patients, and can describe between-patient variability. While most current algorithms assume normal or log-normal distributions for PK parameters, we present a mathematically consistent nonparametric maximum likelihood (NPML) method for estimating multivariate mixing distributions without any assumption about the shape of the distribution. This approach can handle distributions with any shape for all PK parameters. It is shown in convexity theory that the NPML estimator is discrete, meaning that it has finite number of points with nonzero probability. In fact, there are at most N points where N is the number of observed subjects. The original infinite NPML problem then becomes the finite dimensional problem of finding the location and probability of the support points. In the simplest case, each point essentially represents the set of PK parameters for one patient. The probability of the points is found by a primal-dual interior-point method; the location of the support points is found by an adaptive grid method. Our method is able to handle high-dimensional and complex multivariate mixture models. An important application is discussed for the problem of population pharmacokinetics and a nontrivial example is treated. Our algorithm has been successfully applied in hundreds of published pharmacometric studies. In addition to population pharmacokinetics, this research also applies to empirical Bayes estimation and many other areas of applied mathematics. Thereby, this approach presents an important addition to the pharmacometric toolbox for drug development and optimal patient dosing.

Download Full-text

The landscape of gene co-expression modules correlating with prognostic genetic abnormalities in AML

Journal of Translational Medicine ◽

10.1186/s12967-021-02914-2 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Chao Guo ◽

Ya-yue Gao ◽

Qian-qian Ju ◽

Chun-xia Zhang ◽

Ming Gong ◽

...

Keyword(s):

Regression Analysis ◽

Prediction Model ◽

Hox Genes ◽

Expression Profiles ◽

Penalized Regression ◽

Diagnostic Utility ◽

Lasso Regression ◽

Hub Genes ◽

Npm1 Mutation ◽

Genetic Abnormalities

Abstract Background The heterogenous cytogenetic and molecular variations were harbored by AML patients, some of which are related with AML pathogenesis and clinical outcomes. We aimed to uncover the intrinsic expression profiles correlating with prognostic genetic abnormalities by WGCNA. Methods We downloaded the clinical and expression dataset from BeatAML, TCGA and GEO database. Using R (version 4.0.2) and ‘WGCNA’ package, the co-expression modules correlating with the ELN2017 prognostic markers were identified (R2 ≥ 0.4, p < 0.01). ORA detected the enriched pathways for the key co-expression modules. The patients in TCGA cohort were randomly assigned into the training set (50%) and testing set (50%). The LASSO penalized regression analysis was employed to build the prediction model, fitting OS to the expression level of hub genes by ‘glmnet’ package. Then the testing and 2 independent validation sets (GSE12417 and GSE37642) were used to validate the diagnostic utility and accuracy of the model. Results A total of 37 gene co-expression modules and 973 hub genes were identified for the BeatAML cohort. We found that 3 modules were significantly correlated with genetic markers (the ‘lightyellow’ module for NPM1 mutation, the ‘saddlebrown’ module for RUNX1 mutation, the ‘lightgreen’ module for TP53 mutation). ORA revealed that the ‘lightyellow’ module was mainly enriched in DNA-binding transcription factor activity and activation of HOX genes. The ‘saddlebrown’ module was enriched in immune response process. And the ‘lightgreen’ module was predominantly enriched in mitosis cell cycle process. The LASSO- regression analysis identified 6 genes (NFKB2, NEK9, HOXA7, APRC5L, FAM30A and LOC105371592) with non-zero coefficients. The risk score generated from the 6-gene model, was associated with ELN2017 risk stratification, relapsed disease, and prior MDS history. The 5-year AUC for the model was 0.822 and 0.824 in the training and testing sets, respectively. Moreover, the diagnostic utility of the model was robust when it was employed in 2 validation sets (5-year AUC 0.743–0.79). Conclusions We established the co-expression network signature correlated with the ELN2017 recommended prognostic genetic abnormalities in AML. The 6-gene prediction model for AML survival was developed and validated by multiple datasets.

Download Full-text