scholarly journals NetMix: A network-structured mixture model for reduced-bias estimation of altered subnetworks

Author(s):  
Matthew A. Reyna ◽  
Uthsav Chitra ◽  
Rebecca Elyanow ◽  
Benjamin J. Raphael

AbstractA classic problem in computational biology is the identification of altered subnetworks: subnetworks of an interaction network that contain genes/proteins that are differentially expressed, highly mutated, or otherwise aberrant compared to other genes/proteins. Numerous methods have been developed to solve this problem under various assumptions, but the statistical properties of these methods are often unknown. For example, some widely-used methods are reported to output very large subnetworks that are difficult to interpret biologically. In this work, we formulate the identification of altered subnetworks as the problem of estimating the parameters of a class of probability distributions which we call the Altered Subset Distribution (ASD). We derive a connection between a popular method, jActiveModules, and the maximum likelihood estimator (MLE) of the ASD. We show that the MLE is statistically biased, explaining the large subnetworks output by jActiveModules. We introduce NetMix, an algorithm that uses Gaussian mixture models to obtain less biased estimates of the parameters of the ASD. We demonstrate that NetMix outperforms existing methods in identifying altered subnetworks on both simulated and real data, including the identification of differentially expressed genes from both microarray and RNA-seq experiments and the identification of cancer driver genes in somatic mutation data.AvailabilityNetMix is available online at https://github.com/raphael-group/[email protected]

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Cesim Erten ◽  
Aissa Houdjedj ◽  
Hilal Kazan

Abstract Background Recent cancer genomic studies have generated detailed molecular data on a large number of cancer patients. A key remaining problem in cancer genomics is the identification of driver genes. Results We propose BetweenNet, a computational approach that integrates genomic data with a protein-protein interaction network to identify cancer driver genes. BetweenNet utilizes a measure based on betweenness centrality on patient specific networks to identify the so-called outlier genes that correspond to dysregulated genes for each patient. Setting up the relationship between the mutated genes and the outliers through a bipartite graph, it employs a random-walk process on the graph, which provides the final prioritization of the mutated genes. We compare BetweenNet against state-of-the art cancer gene prioritization methods on lung, breast, and pan-cancer datasets. Conclusions Our evaluations show that BetweenNet is better at recovering known cancer genes based on multiple reference databases. Additionally, we show that the GO terms and the reference pathways enriched in BetweenNet ranked genes and those that are enriched in known cancer genes overlap significantly when compared to the overlaps achieved by the rankings of the alternative methods.


2011 ◽  
Vol 23 (6) ◽  
pp. 1605-1622 ◽  
Author(s):  
Lingyan Ruan ◽  
Ming Yuan ◽  
Hui Zou

Finite gaussian mixture models are widely used in statistics thanks to their great flexibility. However, parameter estimation for gaussian mixture models with high dimensionality can be challenging because of the large number of parameters that need to be estimated. In this letter, we propose a penalized likelihood estimator to address this difficulty. The [Formula: see text]-type penalty we impose on the inverse covariance matrices encourages sparsity on its entries and therefore helps to reduce the effective dimensionality of the problem. We show that the proposed estimate can be efficiently computed using an expectation-maximization algorithm. To illustrate the practical merits of the proposed method, we consider its applications in model-based clustering and mixture discriminant analysis. Numerical experiments with both simulated and real data show that the new method is a valuable tool for high-dimensional data analysis.


2019 ◽  
Vol 12 (S7) ◽  
Author(s):  
Ying Hui ◽  
Pi-Jing Wei ◽  
Junfeng Xia ◽  
Yu-Tian Wang ◽  
Chun-Hou Zheng

Abstract Background Although there are huge volumes of genomic data, how to decipher them and identify driver events is still a challenge. The current methods based on network typically use the relationship between genomic events and consequent changes in gene expression to nominate putative driver genes. But there may exist some relationships within the transcriptional network. Methods We developed MECoRank, a novel method that improves the recognition accuracy of driver genes. MECoRank is based on bipartite graph to propagates the scores via an iterative process. After iteration, we will obtain a ranked gene list for each patient sample. Then, we applied the Condorcet voting method to determine the most impactful drivers in a population. Results We applied MECoRank to three cancer datasets to reveal candidate driver genes which have a greater impact on gene expression. Experimental results show that our method not only can identify more driver genes that have been validated than other methods, but also can recognize some impactful novel genes which have been proved to be more important in literature. Conclusions We propose a novel approach named MECoRank to prioritize driver genes based on their impact on the expression in the molecular interaction network. This method not only assesses mutation’s effect on the transcriptional network, but also assesses the differential expression’s effect within the transcriptional network. And the results demonstrated that MECoRank has better performance than the other competing approaches in identifying driver genes.


2019 ◽  
Vol 2019 ◽  
pp. 1-10 ◽  
Author(s):  
Bo Ma ◽  
Mingyang Wu ◽  
Zhilu Wu ◽  
Zhendong Yin ◽  
Tao Shen

In this paper, an effective multiuser detection (MUD) is proposed for direct sequence ultrawideband- (DS-UWB-) based space formation flying systems. The proposed method called GMM-MUD is based on Gaussian mixture models (GMMs) to suppress multiple access interference. The GMM describes probability distributions of the hypothesis testing problem which is used for bit classification. To reveal the difference between correct bits and error bits, the preprocessing operation applies a mapping function based on optimal multiuser detection. The parameters of GMM are estimated by using expectation-maximization (EM) algorithm. EM algorithm employs iterative operation to simplify the complexity of maximum likelihood estimation method and considers the mapping values of received bits as the observations. Simulation results demonstrate that the proposed GMM-MUD algorithm achieves good performances in terms of bit error rate performance, user capacity, and near-far resistance. Moreover, the computational complexity is low enough for space formation flying applications.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Nada A. Alqahtani ◽  
Zakiah I. Kalantan

Data scientists use various machine learning algorithms to discover patterns in large data that can lead to actionable insights. In general, high-dimensional data are reduced by obtaining a set of principal components so as to highlight similarities and differences. In this work, we deal with the reduced data using a bivariate mixture model and learning with a bivariate Gaussian mixture model. We discuss a heuristic for detecting important components by choosing the initial values of location parameters using two different techniques: cluster means, k-means and hierarchical clustering, and default values in the “mixtools” R package. The parameters of the model are obtained via an expectation maximization algorithm. The criteria from Bayesian point are evaluated for both techniques, demonstrating that both techniques are efficient with respect to computation capacity. The effectiveness of the discussed techniques is demonstrated through a simulation study and using real data sets from different fields.


2005 ◽  
Vol 128 (3) ◽  
pp. 479-483
Author(s):  
Hani Hamdan ◽  
Gérard Govaert

In this paper, we present a new and original mixture model approach for acoustic emission (AE) data clustering. AE techniques have been used in a variety of applications in industrial plants. These techniques can provide the most sophisticated monitoring test and can generally be done with the plant/pressure equipment operating at several conditions. Since the AE clusters may present several constraints (different proportions, volumes, orientations, and shapes), we propose to base the AE cluster analysis on Gaussian mixture models, which will be, in such situations, a powerful approach. Furthermore, the diagonal Gaussian mixture model seems to be well adapted to the detection and monitoring of defect classes since the weldings of cylindrical pressure equipment are lengthened horizontally and vertically (cluster shapes lengthened along the axes). The EM (Expectation-Maximization) algorithm applied to a diagonal Gaussian mixture model provides a satisfactory solution but the real time constraints imposed in our problem make the application of this algorithm impossible if the number of points becomes too big. The solution that we propose is to use the CEM (Classification Expectation-Maximization) algorithm, which converges faster and generates comparable solutions in terms of resulting partition. The practical results on real data are very satisfactory from the experts point of view.


2021 ◽  
Vol 16 ◽  
Author(s):  
Xianghua Peng ◽  
Fang Liu ◽  
Ping Liu ◽  
Xing Li ◽  
Xinguo Lu

Aim: In exploiting cancer initialization and progression, a great challenge is to identify the driver genes. Background: With advances in next-generation sequencing (NGS) technologies, identification of specific oncogenic genes has emerged through integrating multi-omics data. Although the existing computational models have identified many common driver genes, they rely on individual regulatory mechanisms or independent copy number variants, ignoring the dynamic function of genes in pathways and networks. Objective: the molecular metabolic pathway is a critical biological process in tumor initiation, progression and maintenance. Establishing the role of genes in pathways and networks helps to describe their functional roles under physiological and pathological conditions at multiple levels. Methods: we present a metabolic pathway based driver genes identification (pathDriver) to distinguish different cancer types/subtypes. In pathDriver, combined with protein-protein interaction network, the metabolic pathway is utilized to construct the pathway network. Then the interaction frequency (IF) and inverse pathway frequency (IPF) is used to evaluate the collaborative impact factor of genes in the pathway network. Finally, the cancer-specific driver genes are identified by calculating the scores of edges connected to genes in the pathway network. Results: We applied it to 16 kinds of TCGA cancers for pan-cancer analysis. Connclusion: the driving pathway identified biologically significant known cancer genes and the potential new candidate genes.


Author(s):  
Ziquan Liu ◽  
Lei Yu ◽  
Janet H. Hsiao ◽  
Antoni B. Chan

The Gaussian Mixture Model (GMM) is among the most widely used parametric probability distributions for representing data. However, it is complicated to analyze the relationship among GMMs since they lie on a high-dimensional manifold. Previous works either perform clustering of GMMs, which learns a limited discrete latent representation, or kernel-based embedding of GMMs, which is not interpretable due to difficulty in computing the inverse mapping. In this paper, we propose Parametric Manifold Learning of GMMs (PML-GMM), which learns a parametric mapping from a low-dimensional latent space to a high-dimensional GMM manifold. Similar to PCA, the proposed mapping is parameterized by the principal axes for the component weights, means, and covariances, which are optimized to minimize the reconstruction loss measured using Kullback-Leibler divergence (KLD). As the KLD between two GMMs is intractable, we approximate the objective function by a variational upper bound, which is optimized by an EM-style algorithm. Moreover, We derive an efficient solver by alternating optimization of subproblems and exploit Monte Carlo sampling to escape from local minima. We demonstrate the effectiveness of PML-GMM through experiments on synthetic, eye-fixation, flow cytometry, and social check-in data.


2020 ◽  
Author(s):  
Cesim Erten ◽  
Aissa Houdjedj ◽  
Hilal Kazan

AbstractBackgroundRecent cancer genomic studies have generated detailed molecular data on a large number of cancer patients. A key remaining problem in cancer genomics is the identification of driver genes. Results: We propose BetweenNet, a computational approach that integrates genomic data with a protein-protein interaction network to identify cancer driver genes. BetweenNet utilizes a measure based on betweenness centrality on patient specific networks to identify the so-called outlier genes that correspond to dysregulated genes for each patient. Setting up the relationship between the mutated genes and the outliers through a bipartite graph, it employs a random-walk process on the graph, which provides the final prioritization of the mutated genes. We compare BetweenNet against state-of-the art cancer gene prioritization methods on lung, breast, and pan-cancer datasets. Conclusions: Our evaluations show that BetweenNet is better at recovering known cancer genes based on multiple reference databases. Additionally, we show that the GO terms and the reference pathways enriched in BetweenNet ranked genes and those that are enriched in known cancer genes overlap significantly when compared to the overlaps achieved by the rankings of the alternative methods.


Sign in / Sign up

Export Citation Format

Share Document