A zero-inflated non-negative matrix factorization for the deconvolution of mixed signals of biological data

2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Yixin Kong ◽  
Ariangela Kozik ◽  
Cindy H. Nakatsu ◽  
Yava L. Jones-Hall ◽  
Hyonho Chun

Abstract A latent factor model for count data is popularly applied in deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the accuracy of the estimates could be much improved. However, the advantage quickly disappears in the presence of excessive zeros. To correctly account for this phenomenon in both mixed and pure samples, we propose a zero-inflated non-negative matrix factorization and derive an effective multiplicative parameter updating rule. In simulation studies, our method yielded the smallest bias. We applied our approach to brain gene expression as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF.

2021 ◽  
Author(s):  
Zhihong Zhang ◽  
Sai Hu ◽  
Wei Yan ◽  
Bihai Zhao ◽  
Lei Wang

Abstract BackgroundIdentification of essential proteins is very important for understanding the basic requirements to sustain a living organism. In recent years, various different computational methods have been proposed to identify essential proteins based on protein-protein interaction (PPI) networks. However, there has been reliable evidence that a huge amount of false negatives and false positives exist in PPI data. Therefore, it is necessary to reduce the influence of false data on accuracy of essential proteins prediction by integrating multi-source biological information with PPI networks.ResultsIn this paper, we proposed a non-negative matrix factorization and multiple biological information based model (NDM) for identifying essential proteins. The first stage in this progress was to construct a weighted PPI network by combing the information of protein domain, protein complex and the topology characteristic of the original PPI network. Then, the non-negative matrix factorization technique was used to reconstruct an optimized PPI network with whole enough weight of edges. In the final stage, the ranking score of each protein was computed by the PageRank algorithm in which the initial scores were calculated with homologous and subcellular localization information. In order to verify the effectiveness of the NDM method, we compared the NDM with other state-of-the-art essential proteins prediction methods. The comparison of the results obtained from different methods indicated that our NDM model has better performance in predicting essential proteins.ConclusionEmploying the non-negative matrix factorization and integrating multi-source biological data can effectively improve quality of the PPI network, which resulted in the led to optimization of the performance essential proteins identification. This will also provide a new perspective for other prediction based on protein-protein interaction networks.


2019 ◽  
Author(s):  
Rebecca Chen ◽  
Lav R. Varshney

AbstractWe extend the approximation-theoretic technique of optimal recovery to the setting of imputing missing values in clustered data, specifically for non-negative matrix factorization (NMF), and develop an implementable algorithm. Under certain geometric conditions, we prove tight upper bounds on NMF relative error, which is the first bound of this type for missing values. We also give probabilistic bounds for the same geometric assumptions. Experiments on image data and biological data show that this theoretically-grounded technique performs as well as or better than other imputation techniques that account for local structure.


2021 ◽  
Vol 8 ◽  
Author(s):  
Junmin Zhao ◽  
Yuanyuan Ma ◽  
Lifang Liu

A network is an efficient tool to organize complicated data. The Laplacian graph has attracted more and more attention for its good properties and has been applied to many tasks including clustering, feature selection, and so on. Recently, studies have indicated that though the Laplacian graph can capture the global information of data, it lacks the power to capture fine-grained structure inherent in network. In contrast, a Vicus matrix can make full use of local topological information from the data. Given this consideration, in this paper we simultaneously introduce Laplacian and Vicus graphs into a symmetric non-negative matrix factorization framework (LVSNMF) to seek and exploit the global and local structure patterns that inherent in the original data. Extensive experiments are conducted on three real datasets (cancer, cell populations, and microbiome data). The experimental results show the proposed LVSNMF algorithm significantly outperforms other competing algorithms, suggesting its potential in biological data analysis.


2021 ◽  
Author(s):  
Zachary J. DeBruine ◽  
Karsten Melcher ◽  
Timothy J. Triche

AbstractNon-negative matrix factorization (NMF) is an intuitively appealing method to extract additive combinations of measurements from noisy or complex data. NMF is applied broadly to text and image processing, time-series analysis, and genomics, where recent technological advances permit sequencing experiments to measure the representation of tens of thousands of features in millions of single cells. In these experiments, a count of zero for a given feature in a given cell may indicate either the absence of that feature or an insufficient read coverage to detect that feature (“dropout”). In contrast to spectral decompositions such as the Singular Value Decomposition (SVD), the strictly positive imputation of signal by NMF is an ideal fit for single-cell data with ambiguous zeroes. Nevertheless, most single-cell analysis pipelines apply SVD or Principal Component Analysis (PCA) on transformed counts because these implementations are fast while current NMF implementations are slow. To address this need, we present an accessible NMF implementation that is much faster than PCA and rivals the runtimes of state-of-the-art SVD. NMF models learned with our implementation from raw count matrices yield intuitive summaries of complex biological processes, capturing coordinated gene activity and enrichment of sample metadata. Our NMF implementation, available in the RcppML (Rcpp Machine Learning library) R package, improves upon current NMF implementations by introducing a scaling diagonal to enable convex L1 regularization for feature engineering, reproducible factor scalings, and symmetric factorizations. RcppML NMF easily handles sparse datasets with millions of samples, making NMF an attractive replacement for PCA in the analysis of single-cell experiments.


2019 ◽  
Author(s):  
Alexis Vandenbon ◽  
Diego Diez

AbstractSummarySingle-cell sequencing data is often visualized in 2-dimensional plots, including t-SNE plots. However, it is not straightforward to extract biological knowledge, such as differentially expressed genes, from these plots. Here we introduce singleCellHaystack, a methodology that addresses this problem. singleCellHaystack uses Kullback-Leibler Divergence to find genes that are expressed in subsets of cells that are non-randomly positioned on a 2D plot. We illustrate the usage of singleCellHaystack through applications on several single-cell datasets. singleCellHaystack is implemented as an R package, and includes additional functions for clustering and visualization of genes with interesting expression patterns.Availability and implementationhttps://github.com/alexisvdb/[email protected]


2016 ◽  
Vol 30 (20) ◽  
pp. 1650130 ◽  
Author(s):  
Xiao Liu ◽  
Yi-Ming Wei ◽  
Jian Wang ◽  
Wen-Jun Wang ◽  
Dong-Xiao He ◽  
...  

Community detection is a meaningful task in the analysis of complex networks, which has received great concern in various domains. A plethora of exhaustive studies has made great effort and proposed many methods on community detection. Particularly, a kind of attractive one is the two-step method which first makes a preprocessing for the network and then identifies its communities. However, not all types of methods can achieve satisfactory results by using such preprocessing strategy, such as the non-negative matrix factorization (NMF) methods. In this paper, rather than using the above two-step method as most works did, we propose a graph regularized-based model to improve, specialized, the NMF-based methods for the detection of communities, namely NMFGR. In NMFGR, we introduce the similarity metric which contains both the global and local information of networks, to reflect the relationships between two nodes, so as to improve the accuracy of community detection. Experimental results on both artificial and real-world networks demonstrate the superior performance of NMFGR to some competing methods.


2021 ◽  
Vol 25 (2) ◽  
pp. 339-357
Author(s):  
Guowang Du ◽  
Lihua Zhou ◽  
Kevin Lü ◽  
Haiyan Ding

Multi-view clustering aims to group similar samples into the same clusters and dissimilar samples into different clusters by integrating heterogeneous information from multi-view data. Non-negative matrix factorization (NMF) has been widely applied to multi-view clustering owing to its interpretability. However, most NMF-based algorithms only factorize multi-view data based on the shallow structure, neglecting complex hierarchical and heterogeneous information in multi-view data. In this paper, we propose a deep multiple non-negative matrix factorization (DMNMF) framework based on AutoEncoder for multi-view clustering. DMNMF consists of multiple Encoder Components and Decoder Components with deep structures. Each pair of Encoder Component and Decoder Component are used to hierarchically factorize the input data from a view for capturing the hierarchical information, and all Encoder and Decoder Components are integrated into an abstract level to learn a common low-dimensional representation for combining the heterogeneous information across multi-view data. Furthermore, graph regularizers are also introduced to preserve the local geometric information of each view. To optimize the proposed framework, an iterative updating scheme is developed. Besides, the corresponding algorithm called MVC-DMNMF is also proposed and implemented. Extensive experiments on six benchmark datasets have been conducted, and the experimental results demonstrate the superior performance of our proposed MVC-DMNMF for multi-view clustering compared to other baseline algorithms.


2017 ◽  
Author(s):  
G. Durif ◽  
L. Modolo ◽  
J. E. Mold ◽  
S. Lambert-Lacroix ◽  
F. Picard

AbstractMotivationThe development of high throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. PCA is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data.ResultsWe propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis, that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression (scRNA-seq) data.AvailabilityOur work is implemented in the pCMF R-package1.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Xihui Lin ◽  
Paul C. Boutros

Abstract Background Non-negative matrix factorization (NMF) is a technique widely used in various fields, including artificial intelligence (AI), signal processing and bioinformatics. However existing algorithms and R packages cannot be applied to large matrices due to their slow convergence or to matrices with missing entries. Besides, most NMF research focuses only on blind decompositions: decomposition without utilizing prior knowledge. Finally, the lack of well-validated methodology for choosing the rank hyperparameters also raises concern on derived results. Results We adopt the idea of sequential coordinate-wise descent to NMF to increase the convergence rate. We demonstrate that NMF can handle missing values naturally and this property leads to a novel method to determine the rank hyperparameter. Further, we demonstrate some novel applications of NMF and show how to use masking to inject prior knowledge and desirable properties to achieve a more meaningful decomposition. Conclusions We show through complexity analysis and experiments that our implementation converges faster than well-known methods. We also show that using NMF for tumour content deconvolution can achieve results similar to existing methods like ISOpure. Our proposed missing value imputation is more accurate than conventional methods like multiple imputation and comparable to missForest while achieving significantly better computational efficiency. Finally, we argue that the suggested rank tuning method based on missing value imputation is theoretically superior to existing methods. All algorithms are implemented in the R package NNLM, which is freely available on CRAN and Github.


Sign in / Sign up

Export Citation Format

Share Document