scholarly journals Differentiating isoform functions with collaborative matrix factorization

2019 ◽  
Author(s):  
Keyao Wang ◽  
Jun Wang ◽  
Carlotta Domeniconi ◽  
Xiangliang Zhang ◽  
Guoxian Yu

Abstract Motivation Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity. Since existing functional genomic databases uniformly record the annotations at the gene-level, and rarely record the annotations at the isoform-level, differentiating isoform functions is more challenging than the traditional gene-level function prediction. Results Several approaches have been proposed to differentiate the functions of isoforms. They generally follow the multi-instance learning paradigm by viewing each gene as a bag and the spliced isoforms as its instances, and push functions of bags onto instances. These approaches implicitly assume the collected annotations of genes are complete and only integrate multiple RNA-seq datasets. As such, they have compromised performance. We propose a data integrative solution (called DisoFun) to Differentiate isoform Functions with collaborative matrix factorization. DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms. It collaboratively factorizes the isoform data matrix and gene-term data matrix (storing Gene Ontology (GO) annotations of genes) into low-rank matrices to simultaneously explore the latent key isoforms, and achieve function prediction by aggregating predictions to their originating genes. In addition, it leverages the PPI network and GO structure to further coordinate the matrix factorization. Extensive experimental results show that DisoFun improves the AUROC (area under the receiver-operating characteristic curve) and AUPRC (area under the precision-recall curve) of existing solutions by at least 7.7% and 28.9%, respectively. We further investigate DisoFun on four exemplar genes (LMNA, ADAM15, BCL2L1, and CFLAR) with known functions at the isoform-level, and observed that DisoFun can differentiate functions of their isoforms with 90.5% accuracy. Availability The code of DisoFun is available at mlda.swu.edu.cn/codes.php?name=DisoFun. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Vol 36 (1) ◽  
pp. 303-310 ◽  
Author(s):  
Guoxian Yu ◽  
Keyao Wang ◽  
Carlotta Domeniconi ◽  
Maozu Guo ◽  
Jun Wang

Abstract Motivation Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored. The main bottleneck is that functional annotations of isoforms are generally unavailable and functional genomic databases universally store the functional annotations at the gene level. Results We propose IsoFun to accomplish Isoform Function prediction based on bi-random walks on a heterogeneous network. IsoFun firstly constructs an isoform functional association network based on the expression profiles of isoforms derived from multiple RNA-seq datasets. Next, IsoFun uses the available Gene Ontology annotations of genes, gene–gene interactions and the relations between genes and isoforms to construct a heterogeneous network. After this, IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between GO terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms. Experimental results show that IsoFun significantly outperforms the state-of-the-art algorithms and improves the area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) by 17% and 44% at the gene-level, respectively. We further validated the performance of IsoFun on the genes ADAM15 and BCL2L1. IsoFun accurately differentiates the functions of respective isoforms of these two genes. Availability and implementation The code of IsoFun is available at http://mlda.swu.edu.cn/codes.php? name=IsoFun. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Irzam Sarfraz ◽  
Muhammad Asif ◽  
Joshua D Campbell

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Bharat Singh ◽  
Om Prakash Vyas

Now a day's application deal with Big Data has tremendously been used in the popular areas. To tackle with such kind of data various approaches have been developed by researchers in the last few decades. A recent investigated techniques to factored the data matrix through a known latent factor in a lower size space is the so called matrix factorization. In addition, one of the problems with the NMF approaches, its randomized valued could not provide absolute optimization in limited iteration, but having local optimization. Due to this, the authors have proposed a new approach that considers the initial values of the decomposition to tackle the issues of computationally expensive. They have devised an algorithm for initializing the values of the decomposed matrix based on the PSO. In this paper, the auhtors have intended a genetic algorithm based technique while incorporating the nonnegative matrix factorization. Through the experimental result, they will show the proposed method converse very fast in comparison to other low rank approximation like simple NMF multiplicative, and ACLS technique.


2019 ◽  
Vol 35 (22) ◽  
pp. 4748-4753 ◽  
Author(s):  
Ahmad Borzou ◽  
Razie Yousefi ◽  
Rovshan G Sadygov

Abstract Motivation High throughput technologies are widely employed in modern biomedical research. They yield measurements of a large number of biomolecules in a single experiment. The number of experiments usually is much smaller than the number of measurements in each experiment. The simultaneous measurements of biomolecules provide a basis for a comprehensive, systems view for describing relevant biological processes. Often it is necessary to determine correlations between the data matrices under different conditions or pathways. However, the techniques for analyzing the data with a low number of samples for possible correlations within or between conditions are still in development. Earlier developed correlative measures, such as the RV coefficient, use the trace of the product of data matrices as the most relevant characteristic. However, a recent study has shown that the RV coefficient consistently overestimates the correlations in the case of low sample numbers. To correct for this bias, it was suggested to discard the diagonal elements of the outer products of each data matrix. In this work, a principled approach based on the matrix decomposition generates three trace-independent parts for every matrix. These components are unique, and they are used to determine different aspects of correlations between the original datasets. Results Simulations show that the decomposition results in the removal of high correlation bias and the dependence on the sample number intrinsic to the RV coefficient. We then use the correlations to analyze a real proteomics dataset. Availability and implementation The python code can be downloaded from http://dynamic-proteome.utmb.edu/MatrixCorrelations.aspx. Supplementary information Supplementary data are available at Bioinformatics online.


Sensors ◽  
2018 ◽  
Vol 18 (10) ◽  
pp. 3461 ◽  
Author(s):  
Jingwei Yin ◽  
Bing Liu ◽  
Guangping Zhu ◽  
Zhinan Xie

It is challenging to detect a moving target in the reverberant environment for a long time. In recent years, a kind of method based on low-rank and sparse theory was developed to study this problem. The multiframe data containing the target echo and reverberation are arranged in a matrix, and then, the detection is achieved by low-rank and sparse decomposition of the data matrix. In this paper, we introduce a new method for the matrix decomposition using dynamic mode decomposition (DMD). DMD is usually used to calculate eigenmodes of an approximate linear model. We divided the eigenmodes into two categories to realize low-rank and sparse decomposition such that we detected the target from the sparse component. Compared with the previous methods based on low-rank and sparse theory, our method improves the computation speed by approximately 4–90-times at the expense of a slight loss of detection gain. The efficient method has a big advantage for real-time processing. This method can spare time for other stages of processing to improve the detection performance. We have validated the method with three sets of underwater acoustic data.


2020 ◽  
Vol 36 (13) ◽  
pp. 4030-4037
Author(s):  
Lifan Liang ◽  
Kunju Zhu ◽  
Songjian Lu

Abstract Motivation The matrix factorization is an important way to analyze coregulation patterns in transcriptomic data, which can reveal the tumor signal perturbation status and subtype classification. However, current matrix factorization methods do not provide clear bicluster structure. Furthermore, these algorithms are based on the assumption of linear combination, which may not be sufficient to capture the coregulation patterns. Results We presented a new algorithm for Boolean matrix factorization (BMF) via expectation maximization (BEM). BEM is more aligned with the molecular mechanism of transcriptomic coregulation and can scale to matrix with over 100 million data points. Synthetic experiments showed that BEM outperformed other BMF methods in terms of reconstruction error. Real-world application demonstrated that BEM is applicable to all kinds of transcriptomic data, including bulk RNA-seq, single-cell RNA-seq and spatial transcriptomic datasets. Given appropriate binarization, BEM was able to extract coregulation patterns consistent with disease subtypes, cell types or spatial anatomy. Availability and implementation Python source code of BEM is available on https://github.com/LifanLiang/EM_BMF. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Ichitaro Yamazaki ◽  
Akihiro Ida ◽  
Rio Yokota ◽  
Jack Dongarra

We parallelize the LU factorization of a hierarchical low-rank matrix ([Formula: see text]-matrix) on a distributed-memory computer. This is much more difficult than the [Formula: see text]-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the [Formula: see text]-matrix factorization. In this work, we propose to factorize the matrix using a “lattice [Formula: see text]-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the [Formula: see text]-matrix format. These blocks stored in the [Formula: see text]-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of [Formula: see text]-matrix factorization. We first compare factorization performances using the [Formula: see text]-matrix, BLR, and lattice [Formula: see text]-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the [Formula: see text]-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice [Formula: see text]-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer.


2020 ◽  
Vol 34 (04) ◽  
pp. 5851-5858
Author(s):  
Jonathan Strahl ◽  
Jaakko Peltonen ◽  
Hirsohi Mamitsuka ◽  
Samuel Kaski

In matrix factorization, available graph side-information may not be well suited for the matrix completion problem, having edges that disagree with the latent-feature relations learnt from the incomplete data matrix. We show that removing these contested edges improves prediction accuracy and scalability. We identify the contested edges through a highly-efficient graphical lasso approximation. The identification and removal of contested edges adds no computational complexity to state-of-the-art graph-regularized matrix factorization, remaining linear with respect to the number of non-zeros. Computational load even decreases proportional to the number of edges removed. Formulating a probabilistic generative model and using expectation maximization to extend graph-regularised alternating least squares (GRALS) guarantees convergence. Rich simulated experiments illustrate the desired properties of the resulting algorithm. On real data experiments we demonstrate improved prediction accuracy with fewer graph edges (empirical evidence that graph side-information is often inaccurate). A 300 thousand dimensional graph with three million edges (Yahoo music side-information) can be analyzed in under ten minutes on a standard laptop computer demonstrating the efficiency of our graph update.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Leiming Tang ◽  
Xunjie Cao ◽  
Weiyang Chen ◽  
Changbo Ye

In this paper, the low-complexity tensor completion (LTC) scheme is proposed to improve the efficiency of tensor completion. On one hand, the matrix factorization model is established for complexity reduction, which adopts the matrix factorization into the model of low-rank tensor completion. On the other hand, we introduce the smoothness by total variation regularization and framelet regularization to guarantee the completion performance. Accordingly, given the proposed smooth matrix factorization (SMF) model, an alternating direction method of multiple- (ADMM-) based solution is further proposed to realize the efficient and effective tensor completion. Additionally, we employ a novel tensor initialization approach to accelerate convergence speed. Finally, simulation results are presented to confirm the system gain of the proposed LTC scheme in both efficiency and effectiveness.


2015 ◽  
Vol 2015 ◽  
pp. 1-11 ◽  
Author(s):  
Yishu Wang ◽  
Dejie Yang ◽  
Minghua Deng

Background. Epistatic miniarray profile (EMAP) studies have enabled the mapping of large-scale genetic interaction networks and generated large amounts of data in model organisms. One approach to analyze EMAP data is to identify gene modules with densely interacting genes. In addition, genetic interaction score (Sscore) reflects the degree of synergizing or mitigating effect of two mutants, which is also informative. Statistical approaches that exploit both modularity and the pairwise interactions may provide more insight into the underlying biology. However, the high missing rate in EMAP data hinders the development of such approaches. To address the above problem, we adopted the matrix decomposition methodology “low-rank and sparse decomposition” (LRSDec) to decompose EMAP data matrix into low-rank part and sparse part.Results. LRSDec has been demonstrated as an effective technique for analyzing EMAP data. We applied a synthetic dataset and an EMAP dataset studying RNA-related processes inSaccharomyces cerevisiae. Global views of the genetic cross talk between different RNA-related protein complexes and processes have been structured, and novel functions of genes have been predicted.


Sign in / Sign up

Export Citation Format

Share Document