Model-based joint visualization of multiple compositional omics datasets

Abstract The integration of multiple omics datasets measured on the same samples is a challenging task: data come from heterogeneous sources and vary in signal quality. In addition, some omics data are inherently compositional, e.g. sequence count data. Most integrative methods are limited in their ability to handle covariates, missing values, compositional structure and heteroscedasticity. In this article we introduce a flexible model-based approach to data integration to address these current limitations: COMBI. We combine concepts, such as compositional biplots and log-ratio link functions with latent variable models, and propose an attractive visualization through multiplots to improve interpretation. Using real data examples and simulations, we illustrate and compare our method with other data integration techniques. Our algorithm is available in the R-package combi.

Download Full-text

Overlapping Group Logistic Regression with Applications to Genetic Pathway Selection

Cancer Informatics ◽

10.4137/cin.s40043 ◽

2016 ◽

Vol 15 ◽

pp. CIN.S40043 ◽

Cited By ~ 11

Author(s):

Yaohui Zeng ◽

Patrick Breheny

Keyword(s):

Hypothesis Testing ◽

Latent Variable ◽

Group Structure ◽

Enrichment Analysis ◽

Real Data ◽

Penalized Regression ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Pathway Information ◽

Variable Approach

Discovering important genes that account for the phenotype of interest has long been a challenge in genome-wide expression analysis. Analyses such as gene set enrichment analysis (GSEA) that incorporate pathway information have become widespread in hypothesis testing, but pathway-based approaches have been largely absent from regression methods due to the challenges of dealing with overlapping pathways and the resulting lack of available software. The R package grpreg is widely used to fit group lasso and other group-penalized regression models; in this study, we develop an extension, grpregOverlap, to allow for overlapping group structure using a latent variable approach. We compare this approach to the ordinary lasso and to GSEA using both simulated and real data. We find that incorporation of prior pathway information can substantially improve the accuracy of gene expression classifiers, and we shed light on several ways in which hypothesis-testing approaches such as GSEA differ from regression approaches with respect to the analysis of pathway data.

Download Full-text

Amalgams: data-driven amalgamation for the reference-free dimensionality reduction of zero-laden compositional data

10.1101/2020.02.27.968677 ◽

2020 ◽

Cited By ~ 1

Author(s):

Thomas P. Quinn ◽

Ionas Erb

Keyword(s):

Dimension Reduction ◽

Domain Knowledge ◽

Compositional Data ◽

Real Data ◽

R Package ◽

Data Driven ◽

Alternative Methods ◽

Compositional Data Analysis ◽

Data Sets ◽

Log Ratio

AbstractIn the health sciences, many data sets produced by next-generation sequencing (NGS) only contain relative information because of biological and technical factors that limit the total number of nucleotides observed for a given sample. As mutually dependent elements, it is not possible to interpret any component in isolation, at least without normalization. The field of compositional data analysis (CoDA) has emerged with alternative methods for relative data based on log-ratio transforms. However, NGS data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data without sacrificing interpretability. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension reduction method. In this report, we present data-driven amalgamation as a new method and conceptual framework for reducing the dimensionality of compositional data. Unlike expert-driven amalgamation which requires prior domain knowledge, our data-driven amalgamation method uses a genetic algorithm to answer the question, “What is the best way to amalgamate the data to achieve the user-defined objective?”. We present a user-friendly R package, called amalgam, that can quickly find the optimal amalgamation to (a) preserve the distance between samples, or (b) classify samples as diseased or not. Our benchmark on 13 real data sets confirm that these amalgamations compete with the state-of-the-art unsupervised and supervised dimension reduction methods in terms of performance, but result in new variables that are much easier to understand: they are groups of features added together.

Download Full-text

belg: A Tool for Calculating Boltzmann Entropy of Landscape Gradients

Entropy ◽

10.3390/e22090937 ◽

2020 ◽

Vol 22 (9) ◽

pp. 937

Author(s):

Jakub Nowosad ◽

Peichao Gao

Keyword(s):

Missing Values ◽

Real Data ◽

R Package ◽

Landscape Patterns ◽

Boltzmann Entropy ◽

The Core ◽

Landscape Gradients ◽

Landscape Mosaics ◽

Basic Functions ◽

Core Functionality

Entropy is a fundamental concept in thermodynamics that is important in many fields, including image processing, neurobiology, urban planning, and sustainability. As of recently, the application of Boltzmann entropy for landscape patterns was mostly limited to the conceptual discussion. However, in the last several years, a number of methods for calculating Boltzmann entropy for landscape mosaics and gradients were proposed. We developed an R package belg as an open source tool for calculating Boltzmann entropy of landscape gradients. The package contains functions to calculate relative and absolute Boltzmann entropy using the hierarchy-based and the aggregation-based methods. It also supports input raster with missing (NA) values, allowing for calculations on real data. In this study, we explain ideas behind implemented methods, describe the core functionality of the software, and present three examples of its use. The examples show the basic functions in this package, how to adjust Boltzmann entropy values for data with missing values, and how to use the belg package in larger workflows. We expect that the belg package will be a useful tool in the discussion of using entropy for a description of landscape patterns and facilitate a thermodynamic understanding of landscape dynamics.

Download Full-text

Multi-Partitions Subspace Clustering

Mathematics ◽

10.3390/math8040597 ◽

2020 ◽

Vol 8 (4) ◽

pp. 597 ◽

Cited By ~ 1

Author(s):

Vincent Vandewalle

Keyword(s):

Latent Variables ◽

Latent Variable ◽

Subspace Clustering ◽

Real Data ◽

Model Choice ◽

Model Based Clustering ◽

Model Based ◽

Choice Strategy ◽

Factorial Discriminant Analysis ◽

Bic Criterion

In model based clustering, it is often supposed that only one clustering latent variable explains the heterogeneity of the whole dataset. However, in many cases several latent variables could explain the heterogeneity of the data at hand. Finding such class variables could result in a richer interpretation of the data. In the continuous data setting, a multi-partition model based clustering is proposed. It assumes the existence of several latent clustering variables, each one explaining the heterogeneity of the data with respect to some clustering subspace. It allows to simultaneously find the multi-partitions and the related subspaces. Parameters of the model are estimated through an EM algorithm relying on a probabilistic reinterpretation of the factorial discriminant analysis. A model choice strategy relying on the BIC criterion is proposed to select to number of subspaces and the number of clusters by subspace. The obtained results are thus several projections of the data, each one conveying its own clustering of the data. Model’s behavior is illustrated on simulated and real data.

Download Full-text

A financial statement fraud model based on synthesized attribute selection and a dataset with missing values and imbalanced classes

Applied Soft Computing ◽

10.1016/j.asoc.2021.107487 ◽

2021 ◽

Vol 108 ◽

pp. 107487

Author(s):

Ching-Hsue Cheng ◽

Yung-Fu Kao ◽

Hsien-Ping Lin

Keyword(s):

Missing Values ◽

Attribute Selection ◽

Financial Statement ◽

Financial Statement Fraud ◽

Model Based ◽

Imbalanced Classes

Download Full-text

Detection of differentially methylated CpG sites between tumor samples with uneven tumor purities

Bioinformatics ◽

10.1093/bioinformatics/btz885 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2017-2024

Author(s):

Weiwei Zhang ◽

Ziyi Li ◽

Nana Wei ◽

Hua-Jun Wu ◽

Xiaoqi Zheng

Keyword(s):

Real Data ◽

R Package ◽

Differential Methylation ◽

Least Square ◽

Epigenetic Mechanism ◽

Supplementary Information ◽

Cpg Sites ◽

Tumor Purity ◽

Different Sources ◽

Normal Controls

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Estimating Differential Latent Variable Graphical Models with Applications to Brain Connectivity

Biometrika ◽

10.1093/biomet/asaa066 ◽

2020 ◽

Author(s):

S Na ◽

M Kolar ◽

O Koyejo

Keyword(s):

Graphical Models ◽

Latent Variable ◽

Graphical Model ◽

Brain Connectivity ◽

Statistical Error ◽

Real Data ◽

Low Rank ◽

Conditional Dependence ◽

Gradient Descent Algorithm ◽

The Difference

Abstract Differential graphical models are designed to represent the difference between the conditional dependence structures of two groups, thus are of particular interest for scientific investigation. Motivated by modern applications, this manuscript considers an extended setting where each group is generated by a latent variable Gaussian graphical model. Due to the existence of latent factors, the differential network is decomposed into sparse and low-rank components, both of which are symmetric indefinite matrices. We estimate these two components simultaneously using a two-stage procedure: (i) an initialization stage, which computes a simple, consistent estimator, and (ii) a convergence stage, implemented using a projected alternating gradient descent algorithm applied to a nonconvex objective, initialized using the output of the first stage. We prove that given the initialization, the estimator converges linearly with a nontrivial, minimax optimal statistical error. Experiments on synthetic and real data illustrate that the proposed nonconvex procedure outperforms existing methods.

Download Full-text

mixIndependR: a R package for statistical independence testing of loci in database of multi-locus genotypes

BMC Bioinformatics ◽

10.1186/s12859-020-03945-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Bing Song ◽

August E. Woerner ◽

John Planz

Keyword(s):

Population Genetics ◽

Linkage Disequilibrium ◽

Genetic Markers ◽

Software Package ◽

Tandem Repeats ◽

Population Data ◽

Real Data ◽

R Package ◽

Nucleotide Polymorphisms ◽

Mutual Independence

Abstract Background Multi-locus genotype data are widely used in population genetics and disease studies. In evaluating the utility of multi-locus data, the independence of markers is commonly considered in many genomic assessments. Generally, pairwise non-random associations are tested by linkage disequilibrium; however, the dependence of one panel might be triplet, quartet, or other. Therefore, a compatible and user-friendly software is necessary for testing and assessing the global linkage disequilibrium among mixed genetic data. Results This study describes a software package for testing the mutual independence of mixed genetic datasets. Mutual independence is defined as no non-random associations among all subsets of the tested panel. The new R package “mixIndependR” calculates basic genetic parameters like allele frequency, genotype frequency, heterozygosity, Hardy–Weinberg equilibrium, and linkage disequilibrium (LD) by mutual independence from population data, regardless of the type of markers, such as simple nucleotide polymorphisms, short tandem repeats, insertions and deletions, and any other genetic markers. A novel method of assessing the dependence of mixed genetic panels is developed in this study and functionally analyzed in the software package. By comparing the observed distribution of two common summary statistics (the number of heterozygous loci [K] and the number of share alleles [X]) with their expected distributions under the assumption of mutual independence, the overall independence is tested. Conclusion The package “mixIndependR” is compatible to all categories of genetic markers and detects the overall non-random associations. Compared to pairwise disequilibrium, the approach described herein tends to have higher power, especially when number of markers is large. With this package, more multi-functional or stronger genetic panels can be developed, like mixed panels with different kinds of markers. In population genetics, the package “mixIndependR” makes it possible to discover more about admixture of populations, natural selection, genetic drift, and population demographics, as a more powerful method of detecting LD. Moreover, this new approach can optimize variants selection in disease studies and contribute to panel combination for treatments in multimorbidity. Application of this approach in real data is expected in the future, and this might bring a leap in the field of genetic technology. Availability The R package mixIndependR, is available on the Comprehensive R Archive Network (CRAN) at: https://cran.r-project.org/web/packages/mixIndependR/index.html.

Download Full-text

Missing values and lack of information in water management datasets: an approach based on Bayesian Networks

10.5194/egusphere-egu21-10004 ◽

2021 ◽

Author(s):

Rosa F Ropero ◽

M Julia Flores ◽

Rafael Rumí

Keyword(s):

Extreme Events ◽

Coastal Area ◽

Flood Risk ◽

Missing Values ◽

Real Data ◽

Continuous Variables ◽

Target Variable ◽

River Level ◽

Lack Of Information ◽

Level Variable

Environmental data often present missing values or lack of information that make modelling tasks difficult. Under the framework of SAICMA Research Project, a flood risk management system is modelled for Andalusian Mediterranean catchment using information from the Andalusian Hydrological System. Hourly data were collected from October 2011 to September 2020, and present two issues:<ul><li>In Guadarranque River, for the dam level variable there is no data from May to August 2020, probably because of sensor damage.</li> <li>No information about river level is collected in the lower part of Guadiaro River, which make difficult to estimate flood risk in the coastal area.</li> </ul>In order to avoid removing dam variable from the entire model (or those missing months), or even reject modelling one river system, this abstract aims to provide modelling solutions based on Bayesian networks (BNs) that overcome this limitation.Guarranque River. Missing values.Dataset contains 75687 observations for 6 continuous variables. BNs regression models based on fixed structures (Na&#239;ve Bayes, NB, and Tree Augmented Na&#239;ve, TAN) were learnt using the complete dataset (until September 2019) with the aim of predicting the dam level variable as accurately as possible. A scenario was carried out with data from October 2019 to March 2020 and compared the prediction made for the target variable with the real data. Results show both NB (rmse: 6.29) and TAN (rmse: 5.74) are able to predict the behaviour of the target variable.Besides, a BN based on expert&#8217;s structural learning was learnt with real data and both datasets with imputed values by NB and TAN. Results show models learnt with imputed data (NB: 3.33; TAN: 3.07) improve the error rate of model with respect to real data (4.26).Guadairo River. Lack of information.Dataset contains 73636 observations with 14 continuous variables. Since rainfall variables present a high percentage of zero values (over 94%), they were discretised by Equal Frequency method with 4 intervals. The aim is to predict flooding risk in the coastal area but no data is collected from this area. Thus, an unsupervised classification based on hybrid BNs was performed. Here, target variable classifies all observations into a set of homogeneous groups and gives, for each observation, the probability of belonging to each group. Results show a total of 3 groups:<ul><li>Group 0, &#8220;Normal situation&#8221;: with rainfall values equal to 0, and mean of river level very low.</li> <li>Group 1, &#8220;Storm situation&#8221;: mean rainfall values are over 0.3 mm and all river level variables duplicate the mean with respect to group 0.</li> <li>Group 2, &#8220;Extreme situation&#8221;: Both rainfall and river level means values present the highest values far away from both previous groups.</li> </ul>Even when validation shows this methodology is able to identify extreme events, further work is needed. In this sense, data from autumn-winter season (from October 2020 to March 2021) will be used. Including this new information it would be possible to check if last extreme events (flooding event during December and Filomenastorm during January) are identified.&#160;&#160;&#160;

Download Full-text

Semiparametric inverse propensity weighting for nonignorable missing data

Biometrika ◽

10.1093/biomet/asv071 ◽

2016 ◽

Vol 103 (1) ◽

pp. 175-187 ◽

Cited By ~ 31

Author(s):

Jun Shao ◽

Lei Wang

Keyword(s):

Missing Data ◽

Missing Values ◽

Generalized Method Of Moments ◽

Estimating Equations ◽

Real Data ◽

Population Parameters ◽

Finite Sample ◽

External Data ◽

Nonignorable Missing ◽

Inverse Propensity Weighting

Abstract To estimate unknown population parameters based on data having nonignorable missing values with a semiparametric exponential tilting propensity, Kim & Yu (2011) assumed that the tilting parameter is known or can be estimated from external data, in order to avoid the identifiability issue. To remove this serious limitation on the methodology, we use an instrument, i.e., a covariate related to the study variable but unrelated to the missing data propensity, to construct some estimating equations. Because these estimating equations are semiparametric, we profile the nonparametric component using a kernel-type estimator and then estimate the tilting parameter based on the profiled estimating equations and the generalized method of moments. Once the tilting parameter is estimated, so is the propensity, and then other population parameters can be estimated using the inverse propensity weighting approach. Consistency and asymptotic normality of the proposed estimators are established. The finite-sample performance of the estimators is studied through simulation, and a real-data example is also presented.

Download Full-text