scholarly journals A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits

2019 ◽  
Vol 36 (3) ◽  
pp. 842-850 ◽  
Author(s):  
Cheng Peng ◽  
Jun Wang ◽  
Isaac Asante ◽  
Stan Louie ◽  
Ran Jin ◽  
...  

Abstract Motivation Epidemiologic, clinical and translational studies are increasingly generating multiplatform omics data. Methods that can integrate across multiple high-dimensional data types while accounting for differential patterns are critical for uncovering novel associations and underlying relevant subgroups. Results We propose an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genomic, exposure and informative biomarkers/omic effects while jointly estimating subgroups relevant to the outcome of interest. Simulation studies indicate that we can obtain consistent estimates reflective of the true simulated values, accurately estimate subgroups and recapitulate subgroup-specific effects. We also demonstrate the use of the integrated model for future prediction of risk subgroups and phenotypes. We apply this approach to two real data applications to highlight the integration of genomic, exposure and metabolomic data. Availability and Implementation The LUCID method is implemented through the LUCIDus R package available on CRAN (https://CRAN.R-project.org/package=LUCIDus). Supplementary information Supplementary materials are available at Bioinformatics online.

2019 ◽  
Vol 35 (21) ◽  
pp. 4336-4343 ◽  
Author(s):  
W Jenny Shi ◽  
Yonghua Zhuang ◽  
Pamela H Russell ◽  
Brian D Hobbs ◽  
Margaret M Parker ◽  
...  

Abstract Motivation Complex diseases often involve a wide spectrum of phenotypic traits. Better understanding of the biological mechanisms relevant to each trait promotes understanding of the etiology of the disease and the potential for targeted and effective treatment plans. There have been many efforts towards omics data integration and network reconstruction, but limited work has examined the incorporation of relevant (quantitative) phenotypic traits. Results We propose a novel technique, sparse multiple canonical correlation network analysis (SmCCNet), for integrating multiple omics data types along with a quantitative phenotype of interest, and for constructing multi-omics networks that are specific to the phenotype. As a case study, we focus on miRNA–mRNA networks. Through simulations, we demonstrate that SmCCNet has better overall prediction performance compared to popular gene expression network construction and integration approaches under realistic settings. Applying SmCCNet to studies on chronic obstructive pulmonary disease (COPD) and breast cancer, we found enrichment of known relevant pathways (e.g. the Cadherin pathway for COPD and the interferon-gamma signaling pathway for breast cancer) as well as less known omics features that may be important to the diseases. Although those applications focus on miRNA–mRNA co-expression networks, SmCCNet is applicable to a variety of omics and other data types. It can also be easily generalized to incorporate multiple quantitative phenotype simultaneously. The versatility of SmCCNet suggests great potential of the approach in many areas. Availability and implementation The SmCCNet algorithm is written in R, and is freely available on the web at https://cran.r-project.org/web/packages/SmCCNet/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (7) ◽  
pp. 2017-2024
Author(s):  
Weiwei Zhang ◽  
Ziyi Li ◽  
Nana Wei ◽  
Hua-Jun Wu ◽  
Xiaoqi Zheng

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (11) ◽  
pp. 3563-3565
Author(s):  
Li Chen

Abstract Summary Power analysis is essential to decide the sample size of metagenomic sequencing experiments in a case–control study for identifying differentially abundant (DA) microbes. However, the complexity of microbial data characteristics, such as excessive zeros, over-dispersion, compositionality, intrinsically microbial correlations and variable sequencing depths, makes the power analysis particularly challenging because the analytical form is usually unavailable. Here, we develop a simulation-based power assessment strategy and R package powmic, which considers the complexity of microbial data characteristics. A real data example demonstrates the usage of powmic. Availability and implementation powmic R package and online tutorial are available at https://github.com/lichen-lab/powmic. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
M. Petrizzelli ◽  
D. de Vienne ◽  
C. Dillmann

ABSTRACTHeterosis (hybrid vigor) and inbreeding depression, commonly considered as corollary phenomena, could nevertheless be decoupled under certain assumptions according to theoretical population genetics works. In order to explore this issue on real data, we analyzed the components of genetic variation in a population derived from a half-diallel cross between strains fromSaccharomyces cerevisiaeandS. uvarum, two related yeast species involved in alcoholic fermentation. A large number of phenotypic traits, either molecular (coming from quantitative proteomics) or related to fermentation and life-history, were measured during alcoholic fermentation. Because the parental strains were included in the design, we were able to distinguish between inbreeding effects, which measures phenotypic differences between inbred and hybrids, and heterosis, which measures phenotypic differences between a specific hybrid and the other hybrids sharing a common parent. The sources of phenotypic variation differed depending on the temperature, indicating the predominance of genotype by environment interactions. Decomposing the total genetic variance into variances of additive (intra- and inter-specific) effects, of inbreeding effects and of heterosis (intra- and inter-specific) effects, we showed that the distribution of variance components defined clear-cut groups of proteins and traits. Moreover, it was possible to cluster fermentation and life-history traits into most proteomic groups. Within groups, we observed positive, negative or null correlations between the variances of heterosis and inbreeding effects. To our knowledge, such a decoupling had never been experimentally demonstrated. This result suggests that, despite a common evolutionary history of individuals within a species, the different types of traits have been subject to different selective pressures.


2019 ◽  
Author(s):  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Suli Li ◽  
Kevin R. Coombes

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (6) ◽  
pp. 1785-1794
Author(s):  
Jun Li ◽  
Qing Lu ◽  
Yalu Wen

Abstract Motivation The use of human genome discoveries and other established factors to build an accurate risk prediction model is an essential step toward precision medicine. While multi-layer high-dimensional omics data provide unprecedented data resources for prediction studies, their corresponding analytical methods are much less developed. Results We present a multi-kernel penalized linear mixed model with adaptive lasso (MKpLMM), a predictive modeling framework that extends the standard linear mixed models widely used in genomic risk prediction, for multi-omics data analysis. MKpLMM can capture not only the predictive effects from each layer of omics data but also their interactions via using multiple kernel functions. It adopts a data-driven approach to select predictive regions as well as predictive layers of omics data, and achieves robust selection performance. Through extensive simulation studies, the analyses of PET-imaging outcomes from the Alzheimer’s Disease Neuroimaging Initiative study, and the analyses of 64 drug responses, we demonstrate that MKpLMM consistently outperforms competing methods in phenotype prediction. Availability and implementation The R-package is available at https://github.com/YaluWen/OmicPred. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (24) ◽  
pp. 5182-5190 ◽  
Author(s):  
Luis G Leal ◽  
Alessia David ◽  
Marjo-Riita Jarvelin ◽  
Sylvain Sebert ◽  
Minna Männikkö ◽  
...  

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (10) ◽  
pp. 3115-3123 ◽  
Author(s):  
Teng Fei ◽  
Tianwei Yu

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Giacomo Baruzzo ◽  
Ilaria Patuzzi ◽  
Barbara Di Camillo

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Krzysztof J Szkop ◽  
David S Moss ◽  
Irene Nobeli

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document