scholarly journals Multi-kernel linear mixed model with adaptive lasso for prediction analysis on high-dimensional multi-omics data

2019 ◽  
Vol 36 (6) ◽  
pp. 1785-1794
Author(s):  
Jun Li ◽  
Qing Lu ◽  
Yalu Wen

Abstract Motivation The use of human genome discoveries and other established factors to build an accurate risk prediction model is an essential step toward precision medicine. While multi-layer high-dimensional omics data provide unprecedented data resources for prediction studies, their corresponding analytical methods are much less developed. Results We present a multi-kernel penalized linear mixed model with adaptive lasso (MKpLMM), a predictive modeling framework that extends the standard linear mixed models widely used in genomic risk prediction, for multi-omics data analysis. MKpLMM can capture not only the predictive effects from each layer of omics data but also their interactions via using multiple kernel functions. It adopts a data-driven approach to select predictive regions as well as predictive layers of omics data, and achieves robust selection performance. Through extensive simulation studies, the analyses of PET-imaging outcomes from the Alzheimer’s Disease Neuroimaging Initiative study, and the analyses of 64 drug responses, we demonstrate that MKpLMM consistently outperforms competing methods in phenotype prediction. Availability and implementation The R-package is available at https://github.com/YaluWen/OmicPred. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Yang Hai ◽  
Yalu Wen

Abstract Motivation Accurate disease risk prediction is essential for precision medicine. Existing models either assume that diseases are caused by groups of predictors with small-to-moderate effects or a few isolated predictors with large effects. Their performance can be sensitive to the underlying disease mechanisms, which are usually unknown in advance. Results We developed a Bayesian linear mixed model (BLMM), where genetic effects were modelled using a hybrid of the sparsity regression and linear mixed model with multiple random effects. The parameters in BLMM were inferred through a computationally efficient variational Bayes algorithm. The proposed method can resemble the shape of the true effect size distributions, captures the predictive effects from both common and rare variants, and is robust against various disease models. Through extensive simulations and the application to a whole-genome sequencing dataset obtained from the Alzheimer’s Disease Neuroimaging Initiatives, we have demonstrated that BLMM has better prediction performance than existing methods and can detect variables and/or genetic regions that are predictive. Availability The R-package is available at https://github.com/yhai943/BLMM Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Wikum Dinalankara ◽  
Qian Ke ◽  
Donald Geman ◽  
Luigi Marchionni

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0249002
Author(s):  
Wikum Dinalankara ◽  
Qian Ke ◽  
Donald Geman ◽  
Luigi Marchionni

Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.


Forests ◽  
2019 ◽  
Vol 10 (3) ◽  
pp. 254 ◽  
Author(s):  
Omar Cabrera ◽  
Andreas Fries ◽  
Patrick Hildebrandt ◽  
Sven Günter ◽  
Reinhard Mosandl

Research Highlights: This study determined that treatment “release from competitors” causes different reactions in selected timber species respective to diametrical growth, in which the initial size of the tree (diametric class) is important. Also, the growth habit and phenological traits (defoliation) of the species must be considered, which may have an influence on growth after release. Background and Objectives: The objective of the study was to analyze the diametric growth of nine timber species after their release to answer the following questions: (i) Can the diametric growth of the selected timber species be increased by release? (ii) Does the release cause different responses among the tree species? (iii) Are other factors important, such as the initial diameter at breast height (DBH) or the general climate conditions? Materials and Methods: Four-hundred and eighty-eight trees belonging to nine timber species were selected and monitored over a three-year period. Release was applied to 197 trees, whereas 251 trees served as control trees to evaluate the response of diametrical growth. To determine the response of the trees, a linear mixed model (GLMM, R package: LMER4) was used, which was adjusted by a one-way ANOVA test. Results: All species showed a similar annual cycle respective to diametric increases, which is due to the per-humid climate in the area. Precipitation is secondary for the diametric growth because sufficient rainfall occurs throughout year. What is more important, however, are variations in temperature. However, the species responded differently to release. This is because the initial DBH and growth habit are more important factors. Therefore, the species could be classified into three specific groups: Positive, negative and no response to release. Conclusions: Species which prefer open sites responded positively to release, while shade tolerant species and species with pronounced phenological traits responded negatively. The initial DBH was also an important factor for diametric increases. This is because trees of class I (20 cm to 30 cm DBH) responded positively to the treatment, whereas for bigger or older individuals, the differences decreased or became negative.


2019 ◽  
Vol 35 (17) ◽  
pp. 3143-3145
Author(s):  
Kevin Matlock ◽  
Raziur Rahman ◽  
Souparno Ghosh ◽  
Ranadip Pal

Abstract Summary Biological processes are characterized by a variety of different genomic feature sets. However, often times when building models, portions of these features are missing for a subset of the dataset. We provide a modeling framework to effectively integrate this type of heterogeneous data to improve prediction accuracy. To test our methodology, we have stacked data from the Cancer Cell Line Encyclopedia to increase the accuracy of drug sensitivity prediction. The package addresses the dynamic regime of information integration involving sequential addition of features and samples. Availability and implementation The framework has been implemented as a R package Sstack, which can be downloaded from https://cran.r-project.org/web/packages/Sstack/index.html, where further explanation of the package is available. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (3) ◽  
pp. 842-850 ◽  
Author(s):  
Cheng Peng ◽  
Jun Wang ◽  
Isaac Asante ◽  
Stan Louie ◽  
Ran Jin ◽  
...  

Abstract Motivation Epidemiologic, clinical and translational studies are increasingly generating multiplatform omics data. Methods that can integrate across multiple high-dimensional data types while accounting for differential patterns are critical for uncovering novel associations and underlying relevant subgroups. Results We propose an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genomic, exposure and informative biomarkers/omic effects while jointly estimating subgroups relevant to the outcome of interest. Simulation studies indicate that we can obtain consistent estimates reflective of the true simulated values, accurately estimate subgroups and recapitulate subgroup-specific effects. We also demonstrate the use of the integrated model for future prediction of risk subgroups and phenotypes. We apply this approach to two real data applications to highlight the integration of genomic, exposure and metabolomic data. Availability and Implementation The LUCID method is implemented through the LUCIDus R package available on CRAN (https://CRAN.R-project.org/package=LUCIDus). Supplementary information Supplementary materials are available at Bioinformatics online.


2019 ◽  
Vol 35 (23) ◽  
pp. 4879-4885 ◽  
Author(s):  
Chao Ning ◽  
Dan Wang ◽  
Lei Zhou ◽  
Julong Wei ◽  
Yuanxin Liu ◽  
...  

Abstract Motivation Current dynamic phenotyping system introduces time as an extra dimension to genome-wide association studies (GWAS), which helps to explore the mechanism of dynamical genetic control for complex longitudinal traits. However, existing methods for longitudinal GWAS either ignore the covariance among observations of different time points or encounter computational efficiency issues. Results We herein developed efficient genome-wide multivariate association algorithms for longitudinal data. In contrast to existing univariate linear mixed model analyses, the proposed method has improved statistic power for association detection and computational speed. In addition, the new method can analyze unbalanced longitudinal data with thousands of individuals and more than ten thousand records within a few hours. The corresponding time for balanced longitudinal data is just a few minutes. Availability and implementation A software package to implement the efficient algorithm named GMA (https://github.com/chaoning/GMA) is available freely for interested users in relevant fields. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (24) ◽  
pp. 5182-5190 ◽  
Author(s):  
Luis G Leal ◽  
Alessia David ◽  
Marjo-Riita Jarvelin ◽  
Sylvain Sebert ◽  
Minna Männikkö ◽  
...  

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i128-i135
Author(s):  
Rui Zhu ◽  
Chao Jiang ◽  
Xiaofeng Wang ◽  
Shuang Wang ◽  
Hao Zheng ◽  
...  

Abstract Motivation The generalized linear mixed model (GLMM) is an extension of the generalized linear model (GLM) in which the linear predictor takes random effects into account. Given its power of precisely modeling the mixed effects from multiple sources of random variations, the method has been widely used in biomedical computation, for instance in the genome-wide association studies (GWASs) that aim to detect genetic variance significantly associated with phenotypes such as human diseases. Collaborative GWAS on large cohorts of patients across multiple institutions is often impeded by the privacy concerns of sharing personal genomic and other health data. To address such concerns, we present in this paper a privacy-preserving Expectation–Maximization (EM) algorithm to build GLMM collaboratively when input data are distributed to multiple participating parties and cannot be transferred to a central server. We assume that the data are horizontally partitioned among participating parties: i.e. each party holds a subset of records (including observational values of fixed effect variables and their corresponding outcome), and for all records, the outcome is regulated by the same set of known fixed effects and random effects. Results Our collaborative EM algorithm is mathematically equivalent to the original EM algorithm commonly used in GLMM construction. The algorithm also runs efficiently when tested on simulated and real human genomic data, and thus can be practically used for privacy-preserving GLMM construction. We implemented the algorithm for collaborative GLMM (cGLMM) construction in R. The data communication was implemented using the rsocket package. Availability and implementation The software is released in open source at https://github.com/huthvincent/cGLMM. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document