ARBic: An All-Round Biclustering Algorithm for Analyzing Gene Expression Data

Mapping Intimacies ◽

10.21203/rs.3.rs-936551/v1 ◽

2021 ◽

Author(s):

Xiangyu Liu ◽

Zhengchang Su ◽

Guojun Li

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Expression Patterns ◽

Data Matrix ◽

Expression Data ◽

Specific Expression ◽

Longest Path ◽

Background Data ◽

Effectiveness And Efficiency

Abstract Background: Identifying significant biclusters of genes with specific expression patterns is an effective approach to reveal functionally correlated genes in gene expression data. However, existing algorithms are limited to finding either broad or narrow biclusters but both due to failure of balancing between effectiveness and efficiency. Methods: We developed a new algorithm ARBic which can accurately identify any meaningful biclusters of shape no matter broad or narrow in a large scale gene expression data matrix, even when the values in the biclusters to be identified have the same distribution as that the background data has. ARBic is developed by integrating column-based and row-based strategies into biclustering procedure. The column-based strategy borrowed from ReBic, a recently published biclustering tool, prefers to narrow bicluters. The row-based strategy newly designed in this article by repeatedly finding a longest path in a specific directed graph prefers to broader ones. Result and Conclusion: When tested and compared to other seven salient biclustering algorithms on simulated datasets, ARBic achieved recovery, relevance and f1-scores 29% higher than the second best algorithm. Furthermore, ARBic substantially outperforms all of them on real datasets and robusts to noises, shapes of biclusters and types of datasets.Code: https://github.com/holyzews/ARBicData: https://doi.org/10.5281/zenodo.5121018

Download Full-text

covRNA - Discovering covariate associations in large-scale gene expression data

10.21203/rs.2.17618/v1 ◽

2019 ◽

Author(s):

Lara H Urban ◽

Christian W Remmele ◽

Marcus Dittrich ◽

Roland F Schwarz ◽

Tobias Müller

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

High Performance ◽

Large Scale ◽

Expression Patterns ◽

Species Abundance ◽

Expression Data ◽

Analysis Workflow ◽

Biological Interpretation ◽

Complex Relationships

Abstract Objective The biological interpretation of gene expression measurements is a challenging task. While ordination methods are routinely used to identify clusters of samples or co-expressed genes, these methods do not take sample or gene annotations into account. We aim to provide a tool that allows users of all backgrounds to assess and visualize the intrinsic correlation structure of complex annotated gene expression data and discover the covariates that jointly affect expression patterns. Results The Bioconductor package covRNA provides a convenient and fast interface for testing and visualizing complex relationships between sample and gene covariates mediated by gene expression data in an entirely unsupervised setting. The relationships between sample and gene covariates are tested by statistical permutation tests and visualized by ordination. The methods are inspired by the fourthcorner and RLQ analyses used in ecological research for the analysis of species abundance data, that we modified to make them suitable for the distributional characteristics of both, RNA-Seq read counts and microarray intensities, and to provide a high-performance parallelized implementation for the analysis of large-scale gene expression data on multi-core computational systems. CovRNA provides additional modules for unsupervised gene filtering and plotting functions to ensure a smooth and coherent analysis workflow.

Download Full-text

Graph Convolutional Network for Drug Response Prediction Using Gene Expression Data

Mathematics ◽

10.3390/math9070772 ◽

2021 ◽

Vol 9 (7) ◽

pp. 772

Author(s):

Seonghun Kim ◽

Seockhun Bae ◽

Yinhua Piao ◽

Kyuri Jo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Drug Response ◽

Response Prediction ◽

Biological Data ◽

Expression Data ◽

Convolutional Network ◽

Essential Information ◽

Protein Protein Interaction

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.

Download Full-text

GENE DISCOVERY METHODS FROM LARGE-SCALE GENE EXPRESSION DATA

Quantum Bio-Informatics III ◽

10.1142/9789814304061_0040 ◽

2010 ◽

Author(s):

AKIFUMI SHIMIZU ◽

KENTARO YANO

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gene Discovery ◽

Expression Data

Download Full-text

Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types

10.1101/103069 ◽

2017 ◽

Cited By ~ 19

Author(s):

Hilary K. Finucane ◽

Yakir A. Reshef ◽

Verneri Anttila ◽

Kamil Slowikowski ◽

Alexander Gusev ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Complex Disease ◽

Genome Wide Association Study ◽

Ex Vivo ◽

Cell Types ◽

Inhibitory Neurons ◽

Biliary Cirrhosis ◽

Expression Data ◽

Specific Expression

ABSTRACTGenetics can provide a systematic approach to discovering the tissues and cell types relevant for a complex disease or trait. Identifying these tissues and cell types is critical for following up on non-coding allelic function, developing ex-vivo models, and identifying therapeutic targets. Here, we analyze gene expression data from several sources, including the GTEx and PsychENCODE consortia, together with genome-wide association study (GWAS) summary statistics for 48 diseases and traits with an average sample size of 169,331, to identify disease-relevant tissues and cell types. We develop and apply an approach that uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We detect tissue-specific enrichments at FDR < 5% for 34 diseases and traits across a broad range of tissues that recapitulate known biology. In our analysis of traits with observed central nervous system enrichment, we detect an enrichment of neurons over other brain cell types for several brain-related traits, enrichment of inhibitory over excitatory neurons for bipolar disorder but excitatory over inhibitory neurons for schizophrenia and body mass index, and enrichments in the cortex for schizophrenia and in the striatum for migraine. In our analysis of traits with observed immunological enrichment, we identify enrichments of T cells for asthma and eczema, B cells for primary biliary cirrhosis, and myeloid cells for Alzheimer's disease, which we validated with independent chromatin data. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signal.

Download Full-text

LSTrAP-Crowd: Prediction of novel components of bacterial ribosomes with crowd-sourced analysis of RNA sequencing data

10.1101/2020.04.20.005249 ◽

2020 ◽

Author(s):

Benedict Hew ◽

Qiao Wen Tan ◽

William Goh ◽

Jonathan Wei Xiong Ng ◽

Kenny Koh ◽

...

Keyword(s):

Gene Expression ◽

Protein Synthesis ◽

Rna Sequencing ◽

Gene Expression Data ◽

Large Scale ◽

Bacterial Resistance ◽

Expression Data ◽

Sequencing Data ◽

Novel Proteins ◽

Novel Antibiotics

AbstractBacterial resistance to antibiotics is a growing problem that is projected to cause more deaths than cancer in 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the bacterial ribosomes, proteins that are involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. The data can be used to identify other vulnerabilities or bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowdsourced.

Download Full-text

SOMDE: A scalable method for identifying spatially variable genes with self-organizing map

10.1101/2020.12.10.419549 ◽

2020 ◽

Author(s):

Minsheng Hao ◽

Kui Hua ◽

Xuegong Zhang

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Patterns ◽

Self Organizing Map ◽

Expression Data ◽

Spatial Expression ◽

Variable Expression ◽

Sequencing Technologies ◽

Physical Context ◽

Variable Genes

AbstractRecent developments of spatial transcriptomic sequencing technologies provide powerful tools for understanding cells in the physical context of tissue micro-environments. A fundamental task in spatial gene expression analysis is to identify genes with spatially variable expression patterns, or spatially variable genes (SVgenes). Several computational methods have been developed for this task. Their high computational complexity limited their scalability to the latest and future large-scale spatial expression data.We present SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses selforganizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian Process to fit the node-level spatial gene expression to identify SVgenes. Experiments show that SOMDE is about 5-50 times faster than existing methods with comparable results. The adjustable resolution of SOMDE makes it the only method that can give results in ~5 minutes in large datasets of more than 20,000 sequencing sites. SOMDE is available as a python package on PyPI at https://pypi.org/project/somde.

Download Full-text

Defining transcription modules using large-scale gene expression data

Bioinformatics ◽

10.1093/bioinformatics/bth166 ◽

2004 ◽

Vol 20 (13) ◽

pp. 1993-2003 ◽

Cited By ~ 216

Author(s):

J. Ihmels ◽

S. Bergmann ◽

N. Barkai

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Expression Data

Download Full-text

Large-Scale Integration of MicroRNA and Gene Expression Data for Identification of Enriched MicroRNA–mRNA Associations in Biological Systems

Methods in Molecular Biology - MicroRNAs and the Immune System ◽

10.1007/978-1-60761-811-9_20 ◽

2010 ◽

pp. 297-315 ◽

Cited By ~ 28

Author(s):

Preethi H. Gunaratne ◽

Chad J. Creighton ◽

Michael Watson ◽

Jayantha B. Tennakoon

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Biological Systems ◽

Expression Data ◽

Large Scale Integration ◽

Scale Integration

Download Full-text

Building Gene Networks by Analyzing Gene Expression Profiles

Advanced Methodologies and Technologies in Medicine and Healthcare - Advances in Medical Diagnosis, Treatment, and Care ◽

10.4018/978-1-5225-7489-7.ch003 ◽

2019 ◽

pp. 27-44

Author(s):

Crescenzio Gallo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Dna Microarrays ◽

Expression Profiles ◽

Expression Patterns ◽

Gene Expression Profiles ◽

Expression Data ◽

Gene Expressions ◽

Over Time

The possible applications of modeling and simulation in the field of bioinformatics are very extensive, ranging from understanding basic metabolic paths to exploring genetic variability. Experimental results carried out with DNA microarrays allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. In this chapter, the authors examine various methods for analyzing gene expression data, addressing the important topics of (1) selecting the most differentially expressed genes, (2) grouping them by means of their relationships, and (3) classifying samples based on gene expressions.

Download Full-text

Processing Large-Scale, High-Dimension Genetic and Gene Expression Data

Handbook on Analyzing Human Genetic Data ◽

10.1007/978-3-540-69264-5_11 ◽

2009 ◽

pp. 307-330

Author(s):

Cliona Molony ◽

Solveig K. Sieberts ◽

Eric E. Schadt

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

High Dimension ◽

Large Scale ◽

Expression Data

Download Full-text