scholarly journals SOMDE: A scalable method for identifying spatially variable genes with self-organizing map

2020 ◽  
Author(s):  
Minsheng Hao ◽  
Kui Hua ◽  
Xuegong Zhang

AbstractRecent developments of spatial transcriptomic sequencing technologies provide powerful tools for understanding cells in the physical context of tissue micro-environments. A fundamental task in spatial gene expression analysis is to identify genes with spatially variable expression patterns, or spatially variable genes (SVgenes). Several computational methods have been developed for this task. Their high computational complexity limited their scalability to the latest and future large-scale spatial expression data.We present SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses selforganizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian Process to fit the node-level spatial gene expression to identify SVgenes. Experiments show that SOMDE is about 5-50 times faster than existing methods with comparable results. The adjustable resolution of SOMDE makes it the only method that can give results in ~5 minutes in large datasets of more than 20,000 sequencing sites. SOMDE is available as a python package on PyPI at https://pypi.org/project/somde.

2009 ◽  
Vol 07 (04) ◽  
pp. 645-661 ◽  
Author(s):  
XIN CHEN

There is an increasing interest in clustering time course gene expression data to investigate a wide range of biological processes. However, developing a clustering algorithm ideal for time course gene express data is still challenging. As timing is an important factor in defining true clusters, a clustering algorithm shall explore expression correlations between time points in order to achieve a high clustering accuracy. Moreover, inter-cluster gene relationships are often desired in order to facilitate the computational inference of biological pathways and regulatory networks. In this paper, a new clustering algorithm called CurveSOM is developed to offer both features above. It first presents each gene by a cubic smoothing spline fitted to the time course expression profile, and then groups genes into clusters by applying a self-organizing map-based clustering on the resulting splines. CurveSOM has been tested on three well-studied yeast cell cycle datasets, and compared with four popular programs including Cluster 3.0, GENECLUSTER, MCLUST, and SSClust. The results show that CurveSOM is a very promising tool for the exploratory analysis of time course expression data, as it is not only able to group genes into clusters with high accuracy but also able to find true time-shifted correlations of expression patterns across clusters.


2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Rachel Caldwell ◽  
Yan-Xia Lin ◽  
Ren Zhang

There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist. Advances in high-quality sequencing technologies and large-scale resource datasets have increased the understanding of relationships and cross-referencing of expression data to the large genome data. Although a negative correlation between expression level and gene (especially transcript) length has been generally accepted, there have been some conflicting results arising from the literature concerning the impacts of different regions of genes, and the underlying reason is not well understood. The research aims to apply quantile regression techniques for statistical analysis of coding and noncoding sequence length and gene expression data in the plant,Arabidopsis thaliana, and fruit fly,Drosophila melanogaster, to determine if a relationship exists and if there is any variation or similarities between these species. The quantile regression analysis found that the coding sequence length and gene expression correlations varied, and similarities emerged for the noncoding sequence length (5′ and 3′ UTRs) between animal and plant species. In conclusion, the information described in this study provides the basis for further exploration into gene regulation with regard to coding and noncoding sequence length.


2021 ◽  
Author(s):  
Xiangyu Liu ◽  
Zhengchang Su ◽  
Guojun Li

Abstract Background: Identifying significant biclusters of genes with specific expression patterns is an effective approach to reveal functionally correlated genes in gene expression data. However, existing algorithms are limited to finding either broad or narrow biclusters but both due to failure of balancing between effectiveness and efficiency. Methods: We developed a new algorithm ARBic which can accurately identify any meaningful biclusters of shape no matter broad or narrow in a large scale gene expression data matrix, even when the values in the biclusters to be identified have the same distribution as that the background data has. ARBic is developed by integrating column-based and row-based strategies into biclustering procedure. The column-based strategy borrowed from ReBic, a recently published biclustering tool, prefers to narrow bicluters. The row-based strategy newly designed in this article by repeatedly finding a longest path in a specific directed graph prefers to broader ones. Result and Conclusion: When tested and compared to other seven salient biclustering algorithms on simulated datasets, ARBic achieved recovery, relevance and f1-scores 29% higher than the second best algorithm. Furthermore, ARBic substantially outperforms all of them on real datasets and robusts to noises, shapes of biclusters and types of datasets.Code: https://github.com/holyzews/ARBicData: https://doi.org/10.5281/zenodo.5121018


2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Ning Ye ◽  
Hengfu Yin ◽  
Jingjing Liu ◽  
Xiaogang Dai ◽  
Tongming Yin

The huge amount of gene expression data generated by microarray and next-generation sequencing technologies present challenges to exploit their biological meanings. When searching for the coexpression genes, the data mining process is largely affected by selection of algorithms. Thus, it is highly desirable to provide multiple options of algorithms in the user-friendly analytical toolkit to explore the gene expression signatures. For this purpose, we developed GESearch, an interactive graphical user interface (GUI) toolkit, which is written in MATLAB and supports a variety of gene expression data files. This analytical toolkit provides four models, including the mean, the regression, the delegate, and the ensemble models, to identify the coexpression genes, and enables the users to filter data and to select gene expression patterns by browsing the display window or by importing knowledge-based genes. Subsequently, the utility of this analytical toolkit is demonstrated by analyzing two sets of real-life microarray datasets from cell-cycle experiments. Overall, we have developed an interactive GUI toolkit that allows for choosing multiple algorithms for analyzing the gene expression signatures.


2019 ◽  
Author(s):  
Lara H Urban ◽  
Christian W Remmele ◽  
Marcus Dittrich ◽  
Roland F Schwarz ◽  
Tobias Müller

Abstract Objective The biological interpretation of gene expression measurements is a challenging task. While ordination methods are routinely used to identify clusters of samples or co-expressed genes, these methods do not take sample or gene annotations into account. We aim to provide a tool that allows users of all backgrounds to assess and visualize the intrinsic correlation structure of complex annotated gene expression data and discover the covariates that jointly affect expression patterns. Results The Bioconductor package covRNA provides a convenient and fast interface for testing and visualizing complex relationships between sample and gene covariates mediated by gene expression data in an entirely unsupervised setting. The relationships between sample and gene covariates are tested by statistical permutation tests and visualized by ordination. The methods are inspired by the fourthcorner and RLQ analyses used in ecological research for the analysis of species abundance data, that we modified to make them suitable for the distributional characteristics of both, RNA-Seq read counts and microarray intensities, and to provide a high-performance parallelized implementation for the analysis of large-scale gene expression data on multi-core computational systems. CovRNA provides additional modules for unsupervised gene filtering and plotting functions to ensure a smooth and coherent analysis workflow.


Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 772
Author(s):  
Seonghun Kim ◽  
Seockhun Bae ◽  
Yinhua Piao ◽  
Kyuri Jo

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.


2008 ◽  
Vol 5 (2) ◽  
Author(s):  
Li Teng ◽  
Laiwan Chan

SummaryTraditional analysis of gene expression profiles use clustering to find groups of coexpressed genes which have similar expression patterns. However clustering is time consuming and could be diffcult for very large scale dataset. We proposed the idea of Discovering Distinct Patterns (DDP) in gene expression profiles. Since patterns showing by the gene expressions reveal their regulate mechanisms. It is significant to find all different patterns existing in the dataset when there is little prior knowledge. It is also a helpful start before taking on further analysis. We propose an algorithm for DDP by iteratively picking out pairs of gene expression patterns which have the largest dissimilarities. This method can also be used as preprocessing to initialize centers for clustering methods, like K-means. Experiments on both synthetic dataset and real gene expression datasets show our method is very effective in finding distinct patterns which have gene functional significance and is also effcient.


2005 ◽  
Vol 289 (4) ◽  
pp. L545-L553 ◽  
Author(s):  
Joseph Zabner ◽  
Todd E. Scheetz ◽  
Hakeem G. Almabrazi ◽  
Thomas L. Casavant ◽  
Jian Huang ◽  
...  

Cystic fibrosis (CF) is caused by mutations in the cystic fibrosis transmembrane conductance regulator (CFTR), an epithelial chloride channel regulated by phosphorylation. Most of the disease-associated morbidity is the consequence of chronic lung infection with progressive tissue destruction. As an approach to investigate the cellular effects of CFTR mutations, we used large-scale microarray hybridization to contrast the gene expression profiles of well-differentiated primary cultures of human CF and non-CF airway epithelia grown under resting culture conditions. We surveyed the expression profiles for 10 non-CF and 10 ΔF508 homozygote samples. Of the 22,283 genes represented on the Affymetrix U133A GeneChip, we found evidence of significant changes in expression in 24 genes by two-sample t-test ( P < 0.00001). A second, three-filter method of comparative analysis found no significant differences between the groups. The levels of CFTR mRNA were comparable in both groups. There were no significant differences in the gene expression patterns between male and female CF specimens. There were 18 genes with significant increases and 6 genes with decreases in CF relative to non-CF samples. Although the function of many of the differentially expressed genes is unknown, one transcript that was elevated in CF, the KCl cotransporter (KCC4), is a candidate for further study. Overall, the results indicate that CFTR dysfunction has little direct impact on airway epithelial gene expression in samples grown under these conditions.


2021 ◽  
Author(s):  
Taylor Reiter ◽  
Rachel Montpetit ◽  
Ron Runnebaum ◽  
C. Titus Brown ◽  
Ben Montpetit

AbstractGrapes grown in a particular geographic region often produce wines with consistent characteristics, suggesting there are site-specific factors driving recurrent fermentation outcomes. However, our understanding of the relationship between site-specific factors, microbial metabolism, and wine fermentation outcomes are not well understood. Here, we used differences in Saccharomyces cerevisiae gene expression as a biosensor for differences among Pinot noir fermentations from 15 vineyard sites. We profiled time series gene expression patterns of primary fermentations, but fermentations proceeded at different rates, making analyzes of these data with conventional differential expression tools difficult. This led us to develop a novel approach that combines diffusion mapping with continuous differential expression analysis. Using this method, we identified vineyard specific deviations in gene expression, including changes in gene expression correlated with the activity of the non-Saccharomyces yeast Hanseniaspora uvarum, as well as with initial nitrogen concentrations in grape musts. These results highlight novel relationships between site-specific variables and Saccharomyces cerevisiae gene expression that are linked to repeated wine fermentation outcomes. In addition, we demonstrate that our analysis approach can extract biologically relevant gene expression patterns in other contexts (e.g., hypoxic response of Saccharomyces cerevisiae), indicating that this approach offers a general method for investigating asynchronous time series gene expression data.ImportanceWhile it is generally accepted that foods, in particular wine, possess sensory characteristics associated with or derived from their place of origin, we lack knowledge of the biotic and abiotic factors central to this phenomenon. We have used Saccharomyces cerevisiae gene expression as a biosensor to capture differences in fermentations of Pinot noir grapes from 15 vineyards across two vintages. We find that gene expression by non-Saccharomyces yeasts and initial nitrogen content in the grape must correlates with differences in gene expression among fermentations from these vintages. These findings highlight important relationships between site-specific variables and gene expression that can be used to understand, or possibly modify, wine fermentation outcomes. Our work also provides a novel analysis method for investigating asynchronous gene expression data sets that is able to reveal both global shifts and subtle differences in gene expression due to varied cell – environment interactions.


Sign in / Sign up

Export Citation Format

Share Document