Sparse data embedding and prediction by tropical matrix factorization

Abstract Background Matrix factorization methods are linear models, with limited capability to model complex relations. In our work, we use tropical semiring to introduce non-linearity into matrix factorization models. We propose a method called Sparse Tropical Matrix Factorization () for the estimation of missing (unknown) values in sparse data. Results We evaluate the efficiency of the method on both synthetic data and biological data in the form of gene expression measurements downloaded from The Cancer Genome Atlas (TCGA) database. Tests on unique synthetic data showed that approximation achieves a higher correlation than non-negative matrix factorization (), which is unable to recover patterns effectively. On real data, outperforms on six out of nine gene expression datasets. While assumes normal distribution and tends toward the mean value, can better fit to extreme values and distributions. Conclusion is the first work that uses tropical semiring on sparse data. We show that in certain cases semirings are useful because they consider the structure, which is different and simpler to understand than it is with standard linear algebra.

Download Full-text

Application of a Deep Matrix Factorization Model on Integrated Gene Expression Data

Current Bioinformatics ◽

10.2174/1574893614666191017094331 ◽

2020 ◽

Vol 15 (4) ◽

pp. 359-367

Author(s):

Yong-Jing Hao ◽

Mi-Xiao Hou ◽

Ying-Lian Gao ◽

Jin-Xing Liu ◽

Xiang-Zhen Kong

Keyword(s):

Gene Expression ◽

Differentially Expressed Genes ◽

Gene Expression Data ◽

Matrix Factorization ◽

Coefficient Matrix ◽

The Cancer Genome Atlas ◽

Differentially Expressed ◽

Complex Data ◽

Expression Data ◽

Factorization Model

Background: Non-negative Matrix Factorization (NMF) has been extensively used in gene expression data. However, most NMF-based methods have single-layer structures, which may achieve poor performance for complex data. Deep learning, with its carefully designed hierarchical structure, has shown significant advantages in learning data features. Objective: In bioinformatics, on the one hand, to discover differentially expressed genes in gene expression data; on the other hand, to obtain higher sample clustering results. It can provide the reference value for the prevention and treatment of cancer. Method: In this paper, we apply a deep NMF method called Deep Semi-NMF on the integrated gene expression data. In each layer, the coefficient matrix is directly decomposed into the basic and coefficient matrix of the next layer. We apply this factorization model on The Cancer Genome Atlas (TCGA) genomic data. Results: The experimental results demonstrate the superiority of Deep Semi-NMF method in identifying differentially expressed genes and clustering samples. Conclusion: The Deep Semi-NMF model decomposes a matrix into multiple matrices and multiplies them to form a matrix. It can also improve the clustering performance of samples while digging out more accurate key genes for disease treatment.

Download Full-text

The covariance shift (C-SHIFT) algorithm for normalizing biological data

10.1101/2020.04.13.038463 ◽

2020 ◽

Author(s):

Evgenia Chunikhina ◽

Paul Logan ◽

Yevgeniy Kovchegov ◽

Anatoly Yambartsev ◽

Debashis Mondal ◽

...

Keyword(s):

Gene Expression ◽

Covariance Matrix ◽

Gene Expression Data ◽

Synthetic Data ◽

Biological Data ◽

Optimization Techniques ◽

Expression Data ◽

Absolute Deviation ◽

Normalization Methods ◽

Gene Network Analysis

AbstractOmics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these non-biological errors prior to any statistical analysis. One of the reasons for normalizing data is the need for recovering the covariance matrix used in gene network analysis. In this paper, we introduce a novel normalization technique, called the covariance shift (C-SHIFT) method. This normalization algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise (in biology, known as the bias). Thus, it is perfectly suited for the analysis of logarithmic gene expression data. Numerical experiments on synthetic data demonstrate the method’s advantage over the classical normalization techniques. Namely, the comparison is made with rank, quantile, cyclic LOESS (locally estimated scatterplot smoothing), and MAD (median absolute deviation) normalization methods.

Download Full-text

Gene Expression Analysis through Parallel Non-Negative Matrix Factorization

Computation ◽

10.3390/computation9100106 ◽

2021 ◽

Vol 9 (10) ◽

pp. 106

Author(s):

Angelica Alejandra Serrano-Rubio ◽

Guillermo B. Morales-Luna ◽

Amilcar Meneses-Viveros

Keyword(s):

Gene Expression ◽

Expression Analysis ◽

Matrix Factorization ◽

Clustering Algorithm ◽

Gene Selection ◽

Clustering Algorithms ◽

Biological Data ◽

Dimensional Structure ◽

Computational Time ◽

Non Negative Matrix Factorization

Genetic expression analysis is a principal tool to explain the behavior of genes in an organism when exposed to different experimental conditions. In the state of art, many clustering algorithms have been proposed. It is overwhelming the amount of biological data whose high-dimensional structure exceeds mostly current computational architectures. The computational time and memory consumption optimization actually become decisive factors in choosing clustering algorithms. We propose a clustering algorithm based on Non-negative Matrix Factorization and K-means to reduce data dimensionality but whilst preserving the biological context and prioritizing gene selection, and it is implemented within parallel GPU-based environments through the CUDA library. A well-known dataset is used in our tests and the quality of the results is measured through the Rand and Accuracy Index. The results show an increase in the acceleration of 6.22× compared to the sequential version. The algorithm is competitive in the biological datasets analysis and it is invariant with respect to the classes number and the size of the gene expression matrix.

Download Full-text

A Sparse and Low-Rank Regression Model for Identifying the Relationships Between DNA Methylation and Gene Expression Levels in Gastric Cancer and the Prediction of Prognosis

Genes ◽

10.3390/genes12060854 ◽

2021 ◽

Vol 12 (6) ◽

pp. 854

Author(s):

Yishu Wang ◽

Lingyun Xu ◽

Dongmei Ai

Keyword(s):

Gene Expression ◽

Gastric Cancer ◽

Dna Methylation ◽

Regression Model ◽

The Cancer Genome Atlas ◽

Low Rank ◽

Expression Levels ◽

Rank Regression ◽

Prediction Of Prognosis ◽

Gene Expression Levels

DNA methylation is an important regulator of gene expression that can influence tumor heterogeneity and shows weak and varying expression levels among different genes. Gastric cancer (GC) is a highly heterogeneous cancer of the digestive system with a high mortality rate worldwide. The heterogeneous subtypes of GC lead to different prognoses. In this study, we explored the relationships between DNA methylation and gene expression levels by introducing a sparse low-rank regression model based on a GC dataset with 375 tumor samples and 32 normal samples from The Cancer Genome Atlas database. Differences in the DNA methylation levels and sites were found to be associated with differences in the expressed genes related to GC development. Overall, 29 methylation-driven genes were found to be related to the GC subtypes, and in the prognostic model, we explored five prognoses related to the methylation sites. Finally, based on a low-rank matrix, seven subgroups were identified with different methylation statuses. These specific classifications based on DNA methylation levels may help to account for heterogeneity and aid in personalized treatments.

Download Full-text

Identification of glycolysis related pathways in pancreatic adenocarcinoma and liver hepatocellular carcinoma based on TCGA and GEO datasets

Cancer Cell International ◽

10.1186/s12935-021-01809-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Ji Li ◽

Chen Zhu ◽

Peipei Yue ◽

Tianyu Zheng ◽

Yan Li ◽

...

Keyword(s):

Gene Expression ◽

Hepatocellular Carcinoma ◽

Energy Metabolism ◽

Pancreatic Adenocarcinoma ◽

Digestive System ◽

Prognostic Significance ◽

R Package ◽

The Cancer Genome Atlas ◽

System Structure ◽

Liver Hepatocellular Carcinoma

Abstract Background Abnormal energy metabolism is one of the characteristics of tumor cells, and it is also a research hotspot in recent years. Due to the complexity of digestive system structure, the frequency of tumor is relatively high. We aim to clarify the prognostic significance of energy metabolism in digestive system tumors and the underlying mechanisms. Methods Gene set variance analysis (GSVA) R package was used to establish the metabolic score, and the score was used to represent the metabolic level. The relationship between the metabolism and prognosis of digestive system tumors was explored using the Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. Volcano plots and gene ontology (GO) analyze were used to show different genes and different functions enriched between different glycolysis levels, and GSEA was used to analyze the pathway enrichment. Nomogram was constructed by R package based on gene characteristics and clinical parameters. qPCR and Western Blot were applied to analyze gene expression. All statistical analyses were conducted using SPSS, GraphPad Prism 7, and R software. All validated experiments were performed three times independently. Results High glycolysis metabolism score was significantly associated with poor prognosis in pancreatic adenocarcinoma (PAAD) and liver hepatocellular carcinoma (LIHC). The STAT3 (signal transducer and activator of transcription 3) and YAP1 (Yes1-associated transcriptional regulator) pathways were the most critical signaling pathways in glycolysis modulation in PAAD and LIHC, respectively. Interestingly, elevated glycolysis levels could also enhance STAT3 and YAP1 activity in PAAD and LIHC cells, respectively, forming a positive feedback loop. Conclusions Our results may provide new insights into the indispensable role of glycolysis metabolism in digestive system tumors and guide the direction of future metabolism–signaling target combined therapy.

Download Full-text

A zero-inflated non-negative matrix factorization for the deconvolution of mixed signals of biological data

The International Journal of Biostatistics ◽

10.1515/ijb-2020-0039 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Yixin Kong ◽

Ariangela Kozik ◽

Cindy H. Nakatsu ◽

Yava L. Jones-Hall ◽

Hyonho Chun

Keyword(s):

Matrix Factorization ◽

Factor Model ◽

R Package ◽

Biological Data ◽

Superior Performance ◽

Sequencing Data ◽

Fecal Microbiome ◽

Brain Gene Expression ◽

Cell Transcriptome ◽

Non Negative Matrix Factorization

Abstract A latent factor model for count data is popularly applied in deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the accuracy of the estimates could be much improved. However, the advantage quickly disappears in the presence of excessive zeros. To correctly account for this phenomenon in both mixed and pure samples, we propose a zero-inflated non-negative matrix factorization and derive an effective multiplicative parameter updating rule. In simulation studies, our method yielded the smallest bias. We applied our approach to brain gene expression as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF.

Download Full-text

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

BMC Bioinformatics ◽

10.1186/s12859-020-03810-0 ◽

2020 ◽

Vol 21 (S18) ◽

Author(s):

Sudipta Acharya ◽

Laizhong Cui ◽

Yi Pan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Marker Gene ◽

Biological Data ◽

Protein Interaction Data ◽

Marker Genes ◽

Data Sets ◽

Gene Markers ◽

Multi Objective

Abstract Background In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. Results In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. Conclusion A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.

Download Full-text

A novel prognostic model based on epithelial-mesenchymal transition-related genes predicts patient survival in gastric cancer

World Journal of Surgical Oncology ◽

10.1186/s12957-021-02329-9 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Wanting Song ◽

Yi Bai ◽

Jialin Zhu ◽

Fanxin Zeng ◽

Chunmeng Yang ◽

...

Keyword(s):

Gene Expression ◽

Gastric Cancer ◽

Risk Model ◽

Cox Regression ◽

Epithelial Mesenchymal Transition ◽

Risk Groups ◽

The Cancer Genome Atlas ◽

Critical Function ◽

Mesenchymal Transition ◽

Patient Prognosis

Abstract Background Gastric cancer (GC) represents a major malignancy and is the third deathliest cancer globally. Several lines of evidence indicate that the epithelial-mesenchymal transition (EMT) has a critical function in the development of gastric cancer. Although plentiful molecular biomarkers have been identified, a precise risk model is still necessary to help doctors determine patient prognosis in GC. Methods Gene expression data and clinical information for GC were acquired from The Cancer Genome Atlas (TCGA) database and 200 EMT-related genes (ERGs) from the Molecular Signatures Database (MSigDB). Then, ERGs correlated with patient prognosis in GC were assessed by univariable and multivariable Cox regression analyses. Next, a risk score formula was established for evaluating patient outcome in GC and validated by survival and ROC curves. In addition, Kaplan-Meier curves were generated to assess the associations of the clinicopathological data with prognosis. And a cohort from the Gene Expression Omnibus (GEO) database was used for validation. Results Six EMT-related genes, including CDH6, COL5A2, ITGAV, MATN3, PLOD2, and POSTN, were identified. Based on the risk model, GC patients were assigned to the high- and low-risk groups. The results revealed that the model had good performance in predicting patient prognosis in GC. Conclusions We constructed a prognosis risk model for GC. Then, we verified the performance of the model, which may help doctors predict patient prognosis.

Download Full-text

Novel Epigenetic Eight-Gene Signature Predictive of Poor Prognosis and MSI-Like Phenotype in Human Metastatic Colorectal Carcinomas

Cancers ◽

10.3390/cancers13010158 ◽

2021 ◽

Vol 13 (1) ◽

pp. 158

Author(s):

Valentina Condelli ◽

Giovanni Calice ◽

Alessandra Cassano ◽

Michele Basso ◽

Maria Grazia Rodriquenz ◽

...

Keyword(s):

Gene Expression ◽

Poor Prognosis ◽

Cpg Island ◽

Colon Adenocarcinoma ◽

Gene Signature ◽

Gene Expression Omnibus ◽

The Cancer Genome Atlas ◽

Cpg Island Methylator Phenotype ◽

Prognostic Information ◽

Global Methylation

Epigenetics is involved in tumor progression and drug resistance in human colorectal carcinoma (CRC). This study addressed the hypothesis that the DNA methylation profiling may predict the clinical behavior of metastatic CRCs (mCRCs). The global methylation profile of two human mCRC subgroups with significantly different outcome was analyzed and compared with gene expression and methylation data from The Cancer Genome Atlas COlon ADenocarcinoma (TCGA COAD) and the NCBI GENE expression Omnibus repository (GEO) GSE48684 mCRCs datasets to identify a prognostic signature of functionally methylated genes. A novel epigenetic signature of eight hypermethylated genes was characterized that was able to identify mCRCs with poor prognosis, which had a CpG-island methylator phenotype (CIMP)-high and microsatellite instability (MSI)-like phenotype. Interestingly, methylation events were enriched in genes located on the q-arm of chromosomes 13 and 20, two chromosomal regions with gain/loss alterations associated with adenoma-to-carcinoma progression. Finally, the expression of the eight-genes signature and MSI-enriching genes was confirmed in oxaliplatin- and irinotecan-resistant CRC cell lines. These data reveal that the hypermethylation of specific genes may provide prognostic information that is able to identify a subgroup of mCRCs with poor prognosis.

Download Full-text

Screening of gene markers related to the prognosis of metastatic skin cutaneous melanoma based on Logit regression and survival analysis

BMC Medical Genomics ◽

10.1186/s12920-021-00923-0 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Guoliang Jia ◽

Zheyu Song ◽

Zhonghang Xu ◽

Youmao Tao ◽

Yuanyu Wu ◽

...

Keyword(s):

Gene Expression ◽

Cutaneous Melanoma ◽

Expression Profiles ◽

The Cancer Genome Atlas ◽

Training Dataset ◽

Validation Dataset ◽

Survival Prognosis ◽

Gene Markers ◽

Clinical Prognosis ◽

Logit Regression

Abstract Background Bioinformatics was used to analyze the skin cutaneous melanoma (SKCM) gene expression profile to provide a theoretical basis for further studying the mechanism underlying metastatic SKCM and the clinical prognosis. Methods We downloaded the gene expression profiles of 358 metastatic and 102 primary (nonmetastatic) CM samples from The Cancer Genome Atlas (TCGA) database as a training dataset and the GSE65904 dataset from the National Center for Biotechnology Information database as a validation dataset. Differentially expressed genes (DEGs) were screened using the limma package of R3.4.1, and prognosis-related feature DEGs were screened using Logit regression (LR) and survival analyses. We also used the STRING online database, Cytoscape software, and Database for Annotation, Visualization and Integrated Discovery software for protein–protein interaction network, Gene Ontology, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses based on the screened DEGs. Results Of the 876 DEGs selected, 11 (ZNF750, NLRP6, TGM3, KRTDAP, CAMSAP3, KRT6C, CALML5, SPRR2E, CD3G, RTP5, and FAM83C) were screened using LR analysis. The survival prognosis of nonmetastatic group was better compared to the metastatic group between the TCGA training and validation datasets. The 11 DEGs were involved in 9 KEGG signaling pathways, and of these 11 DEGs, CALML5 was a feature DEG involved in the melanogenesis pathway, 12 targets of which were collected. Conclusion The feature DEGs screened, such as CALML5, are related to the prognosis of metastatic CM according to LR. Our results provide new ideas for exploring the molecular mechanism underlying CM metastasis and finding new diagnostic prognostic markers.

Download Full-text