Differentiating isoform functions with collaborative matrix factorization

Bioinformatics ◽

10.1093/bioinformatics/btz847 ◽

2019 ◽

Author(s):

Keyao Wang ◽

Jun Wang ◽

Carlotta Domeniconi ◽

Xiangliang Zhang ◽

Guoxian Yu

Keyword(s):

Matrix Factorization ◽

Characteristic Curve ◽

Function Prediction ◽

Low Rank ◽

Data Matrix ◽

Supplementary Information ◽

Genomic Databases ◽

Gene Level ◽

The Matrix ◽

Level Function

Abstract Motivation Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity. Since existing functional genomic databases uniformly record the annotations at the gene-level, and rarely record the annotations at the isoform-level, differentiating isoform functions is more challenging than the traditional gene-level function prediction. Results Several approaches have been proposed to differentiate the functions of isoforms. They generally follow the multi-instance learning paradigm by viewing each gene as a bag and the spliced isoforms as its instances, and push functions of bags onto instances. These approaches implicitly assume the collected annotations of genes are complete and only integrate multiple RNA-seq datasets. As such, they have compromised performance. We propose a data integrative solution (called DisoFun) to Differentiate isoform Functions with collaborative matrix factorization. DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms. It collaboratively factorizes the isoform data matrix and gene-term data matrix (storing Gene Ontology (GO) annotations of genes) into low-rank matrices to simultaneously explore the latent key isoforms, and achieve function prediction by aggregating predictions to their originating genes. In addition, it leverages the PPI network and GO structure to further coordinate the matrix factorization. Extensive experimental results show that DisoFun improves the AUROC (area under the receiver-operating characteristic curve) and AUPRC (area under the precision-recall curve) of existing solutions by at least 7.7% and 28.9%, respectively. We further investigate DisoFun on four exemplar genes (LMNA, ADAM15, BCL2L1, and CFLAR) with known functions at the isoform-level, and observed that DisoFun can differentiate functions of their isoforms with 90.5% accuracy. Availability The code of DisoFun is available at mlda.swu.edu.cn/codes.php?name=DisoFun. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Isoform function prediction based on bi-random walks on a heterogeneous network

Bioinformatics ◽

10.1093/bioinformatics/btz535 ◽

2019 ◽

Vol 36 (1) ◽

pp. 303-310 ◽

Cited By ~ 4

Author(s):

Guoxian Yu ◽

Keyao Wang ◽

Carlotta Domeniconi ◽

Maozu Guo ◽

Jun Wang

Keyword(s):

Random Walks ◽

Heterogeneous Network ◽

Expression Profiles ◽

Function Prediction ◽

Supplementary Information ◽

Receiver Operating Curve ◽

Genomic Databases ◽

Developmental Abnormalities ◽

Functional Annotations ◽

Gene Level

Abstract Motivation Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored. The main bottleneck is that functional annotations of isoforms are generally unavailable and functional genomic databases universally store the functional annotations at the gene level. Results We propose IsoFun to accomplish Isoform Function prediction based on bi-random walks on a heterogeneous network. IsoFun firstly constructs an isoform functional association network based on the expression profiles of isoforms derived from multiple RNA-seq datasets. Next, IsoFun uses the available Gene Ontology annotations of genes, gene–gene interactions and the relations between genes and isoforms to construct a heterogeneous network. After this, IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between GO terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms. Experimental results show that IsoFun significantly outperforms the state-of-the-art algorithms and improves the area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) by 17% and 44% at the gene-level, respectively. We further validated the performance of IsoFun on the genes ADAM15 and BCL2L1. IsoFun accurately differentiates the functions of respective isoforms of these two genes. Availability and implementation The code of IsoFun is available at http://mlda.swu.edu.cn/codes.php? name=IsoFun. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A PSO Based Approach for Producing Optimized Latent Factor in Special Reference to Big Data

International Journal of Service Science Management Engineering and Technology ◽

10.4018/ijssmet.2016070104 ◽

2016 ◽

Vol 7 (3) ◽

pp. 55-70 ◽

Cited By ~ 9

Author(s):

Bharat Singh ◽

Om Prakash Vyas

Keyword(s):

Big Data ◽

Matrix Factorization ◽

Nonnegative Matrix ◽

Low Rank ◽

Data Matrix ◽

Experimental Result ◽

Low Rank Approximation ◽

Latent Factor ◽

New Approach ◽

Rank Approximation

Now a day's application deal with Big Data has tremendously been used in the popular areas. To tackle with such kind of data various approaches have been developed by researchers in the last few decades. A recent investigated techniques to factored the data matrix through a known latent factor in a lower size space is the so called matrix factorization. In addition, one of the problems with the NMF approaches, its randomized valued could not provide absolute optimization in limited iteration, but having local optimization. Due to this, the authors have proposed a new approach that considers the initial values of the decomposition to tackle the issues of computationally expensive. They have devised an algorithm for initializing the values of the decomposed matrix based on the PSO. In this paper, the auhtors have intended a genetic algorithm based technique while incorporating the nonnegative matrix factorization. Through the experimental result, they will show the proposed method converse very fast in comparison to other low rank approximation like simple NMF multiplicative, and ACLS technique.

Download Full-text

Another look at matrix correlations

Bioinformatics ◽

10.1093/bioinformatics/btz281 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4748-4753 ◽

Cited By ~ 1

Author(s):

Ahmad Borzou ◽

Razie Yousefi ◽

Rovshan G Sadygov

Keyword(s):

Matrix Decomposition ◽

Data Matrix ◽

Supplementary Information ◽

Biological Processes ◽

Single Experiment ◽

Sample Number ◽

The Matrix ◽

Rv Coefficient ◽

Systems View ◽

Data Matrices

Abstract Motivation High throughput technologies are widely employed in modern biomedical research. They yield measurements of a large number of biomolecules in a single experiment. The number of experiments usually is much smaller than the number of measurements in each experiment. The simultaneous measurements of biomolecules provide a basis for a comprehensive, systems view for describing relevant biological processes. Often it is necessary to determine correlations between the data matrices under different conditions or pathways. However, the techniques for analyzing the data with a low number of samples for possible correlations within or between conditions are still in development. Earlier developed correlative measures, such as the RV coefficient, use the trace of the product of data matrices as the most relevant characteristic. However, a recent study has shown that the RV coefficient consistently overestimates the correlations in the case of low sample numbers. To correct for this bias, it was suggested to discard the diagonal elements of the outer products of each data matrix. In this work, a principled approach based on the matrix decomposition generates three trace-independent parts for every matrix. These components are unique, and they are used to determine different aspects of correlations between the original datasets. Results Simulations show that the decomposition results in the removal of high correlation bias and the dependence on the sample number intrinsic to the RV coefficient. We then use the correlations to analyze a real proteomics dataset. Availability and implementation The python code can be downloaded from http://dynamic-proteome.utmb.edu/MatrixCorrelations.aspx. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Moving Target Detection Using Dynamic Mode Decomposition

Sensors ◽

10.3390/s18103461 ◽

2018 ◽

Vol 18 (10) ◽

pp. 3461 ◽

Cited By ~ 4

Author(s):

Jingwei Yin ◽

Bing Liu ◽

Guangping Zhu ◽

Zhinan Xie

Keyword(s):

Matrix Decomposition ◽

Dynamic Mode Decomposition ◽

Low Rank ◽

Data Matrix ◽

Dynamic Mode ◽

Moving Target ◽

Real Time Processing ◽

Sparse Decomposition ◽

Mode Decomposition ◽

The Matrix

It is challenging to detect a moving target in the reverberant environment for a long time. In recent years, a kind of method based on low-rank and sparse theory was developed to study this problem. The multiframe data containing the target echo and reverberation are arranged in a matrix, and then, the detection is achieved by low-rank and sparse decomposition of the data matrix. In this paper, we introduce a new method for the matrix decomposition using dynamic mode decomposition (DMD). DMD is usually used to calculate eigenmodes of an approximate linear model. We divided the eigenmodes into two categories to realize low-rank and sparse decomposition such that we detected the target from the sparse component. Compared with the previous methods based on low-rank and sparse theory, our method improves the computation speed by approximately 4–90-times at the expense of a slight loss of detection gain. The efficient method has a big advantage for real-time processing. This method can spare time for other stages of processing to improve the detection performance. We have validated the method with three sets of underwater acoustic data.

Download Full-text

BEM: Mining Coregulation Patterns in Transcriptomics via Boolean Matrix Factorization

Bioinformatics ◽

10.1093/bioinformatics/btz977 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4030-4037

Author(s):

Lifan Liang ◽

Kunju Zhu ◽

Songjian Lu

Keyword(s):

Matrix Factorization ◽

Cell Types ◽

Reconstruction Error ◽

Boolean Matrix ◽

Supplementary Information ◽

Rna Seq ◽

Transcriptomic Data ◽

Real World Application ◽

The Matrix ◽

Data Points

Abstract Motivation The matrix factorization is an important way to analyze coregulation patterns in transcriptomic data, which can reveal the tumor signal perturbation status and subtype classification. However, current matrix factorization methods do not provide clear bicluster structure. Furthermore, these algorithms are based on the assumption of linear combination, which may not be sufficient to capture the coregulation patterns. Results We presented a new algorithm for Boolean matrix factorization (BMF) via expectation maximization (BEM). BEM is more aligned with the molecular mechanism of transcriptomic coregulation and can scale to matrix with over 100 million data points. Synthetic experiments showed that BEM outperformed other BMF methods in terms of reconstruction error. Real-world application demonstrated that BEM is applicable to all kinds of transcriptomic data, including bulk RNA-seq, single-cell RNA-seq and spatial transcriptomic datasets. Given appropriate binarization, BEM was able to extract coregulation patterns consistent with disease subtypes, cell types or spatial anatomy. Availability and implementation Python source code of BEM is available on https://github.com/LifanLiang/EM_BMF. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Distributed-memory lattice H-matrix factorization

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019861139 ◽

2019 ◽

Vol 33 (5) ◽

pp. 1046-1063 ◽

Cited By ~ 3

Author(s):

Ichitaro Yamazaki ◽

Akihiro Ida ◽

Rio Yokota ◽

Jack Dongarra

Keyword(s):

Matrix Factorization ◽

Distributed Memory ◽

Lower Cost ◽

Linear Complexity ◽

Low Rank ◽

Dense Matrix ◽

Matrix Vector Multiplication ◽

Matrix Block ◽

The Matrix ◽

Performance Results

We parallelize the LU factorization of a hierarchical low-rank matrix ([Formula: see text]-matrix) on a distributed-memory computer. This is much more difficult than the [Formula: see text]-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the [Formula: see text]-matrix factorization. In this work, we propose to factorize the matrix using a “lattice [Formula: see text]-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the [Formula: see text]-matrix format. These blocks stored in the [Formula: see text]-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of [Formula: see text]-matrix factorization. We first compare factorization performances using the [Formula: see text]-matrix, BLR, and lattice [Formula: see text]-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the [Formula: see text]-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice [Formula: see text]-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer.

Download Full-text

Scalable Probabilistic Matrix Factorization with Graph-Based Priors

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6043 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5851-5858

Author(s):

Jonathan Strahl ◽

Jaakko Peltonen ◽

Hirsohi Mamitsuka ◽

Samuel Kaski

Keyword(s):

Matrix Factorization ◽

Prediction Accuracy ◽

Side Information ◽

Matrix Completion ◽

Real Data ◽

Data Matrix ◽

Laptop Computer ◽

Completion Problem ◽

Graphical Lasso ◽

The Matrix

In matrix factorization, available graph side-information may not be well suited for the matrix completion problem, having edges that disagree with the latent-feature relations learnt from the incomplete data matrix. We show that removing these contested edges improves prediction accuracy and scalability. We identify the contested edges through a highly-efficient graphical lasso approximation. The identification and removal of contested edges adds no computational complexity to state-of-the-art graph-regularized matrix factorization, remaining linear with respect to the number of non-zeros. Computational load even decreases proportional to the number of edges removed. Formulating a probabilistic generative model and using expectation maximization to extend graph-regularised alternating least squares (GRALS) guarantees convergence. Rich simulated experiments illustrate the desired properties of the resulting algorithm. On real data experiments we demonstrate improved prediction accuracy with fewer graph edges (empirical evidence that graph side-information is often inaccurate). A 300 thousand dimensional graph with three million edges (Yahoo music side-information) can be analyzed in under ten minutes on a standard laptop computer demonstrating the efficiency of our graph update.

Download Full-text

An Efficient Tensor Completion Method Combining Matrix Factorization and Smoothness

Wireless Communications and Mobile Computing ◽

10.1155/2021/5515446 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Leiming Tang ◽

Xunjie Cao ◽

Weiyang Chen ◽

Changbo Ye

Keyword(s):

Matrix Factorization ◽

Low Complexity ◽

Low Rank ◽

Total Variation Regularization ◽

Tensor Completion ◽

Alternating Direction ◽

Factorization Model ◽

Efficiency And Effectiveness ◽

The Matrix ◽

Accelerate Convergence

In this paper, the low-complexity tensor completion (LTC) scheme is proposed to improve the efficiency of tensor completion. On one hand, the matrix factorization model is established for complexity reduction, which adopts the matrix factorization into the model of low-rank tensor completion. On the other hand, we introduce the smoothness by total variation regularization and framelet regularization to guarantee the completion performance. Accordingly, given the proposed smooth matrix factorization (SMF) model, an alternating direction method of multiple- (ADMM-) based solution is further proposed to realize the efficient and effective tensor completion. Additionally, we employ a novel tensor initialization approach to accelerate convergence speed. Finally, simulation results are presented to confirm the system gain of the proposed LTC scheme in both efficiency and effectiveness.

Download Full-text

Low-Rank and Sparse Matrix Decomposition for Genetic Interaction Data

BioMed Research International ◽

10.1155/2015/573956 ◽

2015 ◽

Vol 2015 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Yishu Wang ◽

Dejie Yang ◽

Minghua Deng

Keyword(s):

Large Scale ◽

Genetic Interaction ◽

Sparse Matrix ◽

Protein Complexes ◽

Matrix Decomposition ◽

Low Rank ◽

Data Matrix ◽

Model Organisms ◽

The Matrix ◽

Genetic Interaction Data

Background. Epistatic miniarray profile (EMAP) studies have enabled the mapping of large-scale genetic interaction networks and generated large amounts of data in model organisms. One approach to analyze EMAP data is to identify gene modules with densely interacting genes. In addition, genetic interaction score (Sscore) reflects the degree of synergizing or mitigating effect of two mutants, which is also informative. Statistical approaches that exploit both modularity and the pairwise interactions may provide more insight into the underlying biology. However, the high missing rate in EMAP data hinders the development of such approaches. To address the above problem, we adopted the matrix decomposition methodology “low-rank and sparse decomposition” (LRSDec) to decompose EMAP data matrix into low-rank part and sparse part.Results. LRSDec has been demonstrated as an effective technique for analyzing EMAP data. We applied a synthetic dataset and an EMAP dataset studying RNA-related processes inSaccharomyces cerevisiae. Global views of the genetic cross talk between different RNA-related protein complexes and processes have been structured, and novel functions of genes have been predicted.

Download Full-text