scholarly journals Vertical integration methods for gene expression data analysis

Author(s):  
Mengyun Wu ◽  
Huangdi Yi ◽  
Shuangge Ma

Abstract Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a ‘lack of information’ problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ewe Seng Ch’ng

AbstractDistinguishing bladder urothelial carcinomas from prostate adenocarcinomas for poorly differentiated carcinomas derived from the bladder neck entails the use of a panel of lineage markers to help make this distinction. Publicly available The Cancer Genome Atlas (TCGA) gene expression data provides an avenue to examine utilities of these markers. This study aimed to verify expressions of urothelial and prostate lineage markers in the respective carcinomas and to seek the relative importance of these markers in making this distinction. Gene expressions of these markers were downloaded from TCGA Pan-Cancer database for bladder and prostate carcinomas. Differential gene expressions of these markers were analyzed. Standard linear discriminant analyses were applied to establish the relative importance of these markers in lineage determination and to construct the model best in making the distinction. This study shows that all urothelial lineage genes except for the gene for uroplakin III were significantly expressed in bladder urothelial carcinomas (p < 0.001). In descending order of importance to distinguish from prostate adenocarcinomas, genes for uroplakin II, S100P, GATA3 and thrombomodulin had high discriminant loadings (> 0.3). All prostate lineage genes were significantly expressed in prostate adenocarcinomas(p < 0.001). In descending order of importance to distinguish from bladder urothelial carcinomas, genes for NKX3.1, prostate specific antigen (PSA), prostate-specific acid phosphatase, prostein, and prostate-specific membrane antigen had high discriminant loadings (> 0.3). Combination of gene expressions for uroplakin II, S100P, NKX3.1 and PSA approached 100% accuracy in tumor classification both in the training and validation sets. Mining gene expression data, a combination of four lineage markers helps distinguish between bladder urothelial carcinomas and prostate adenocarcinomas.


Author(s):  
Crescenzio Gallo

The possible applications of modeling and simulation in the field of bioinformatics are very extensive, ranging from understanding basic metabolic paths to exploring genetic variability. Experimental results carried out with DNA microarrays allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. In this chapter, the authors examine various methods for analyzing gene expression data, addressing the important topics of (1) selecting the most differentially expressed genes, (2) grouping them by means of their relationships, and (3) classifying samples based on gene expressions.


2012 ◽  
Vol 4 (2) ◽  
pp. 300-318 ◽  
Author(s):  
Xinlei Wang ◽  
Min Chen ◽  
Arkady B. Khodursky ◽  
Guanghua Xiao

2009 ◽  
Vol 10 (1) ◽  
pp. 158 ◽  
Author(s):  
Jaume Mercade ◽  
Antonio Espinosa ◽  
Jose Enrique Adsuara ◽  
Rosa Adrados ◽  
Jordi Segura ◽  
...  

2012 ◽  
Vol 28 (17) ◽  
pp. 2283-2284
Author(s):  
Peter M. Krempl ◽  
Juergen Mairhofer ◽  
Gerald Striedner ◽  
Gerhard G. Thallinger

2019 ◽  
Author(s):  
Gregory J. Hunt ◽  
Johann A. Gagnon-Bartsch

ABSTRACTComplex tissues are composed of a large number of different types of cells, each involved in a multitude of biological processes. Consequently, an important component to understanding such processes is understanding the cell-type composition of the tissues. Estimating cell type composition using high-throughput gene expression data is known as cell-type deconvolution. In this paper, we first summarize the extensive deconvolution literature by identifying a common regression-like approach to deconvolution. We call this approach the Unified Deconvolution-as-Regression (UDAR) framework. While methods that fall under this framework all use a similar model, they fit using data on different scales. Two popular scales for gene expression data are logarithmic and linear. Unfortunately, each of these scales has problems in the UDAR framework. Using log-scale gene expressions proposes a biologically implausible model and using linear-scale gene expressions will lead to statistically inefficient estimators. To overcome these problems, we propose a new approach for cell-type deconvolution that works on a hybrid of the two scales. This new approach is biologically plausible and improves statistical efficiency. We compare the hybrid approach to other methods on simulations as well as a collection of eleven real benchmark datasets. Here, we find the hybrid approach to be accurate and robust.deconvolution, gene expression, microarray, RNA-seq


2016 ◽  
Author(s):  
Alina Frolova ◽  
Vladyslav Bondarenko ◽  
Maria Obolenska

AbstractBackgroundAccording to major public repositories statistics an overwhelming majority of the existing and newly uploaded data originates from microarray experiments. Unfortunately, the potential of this data to bring new insights is limited by the effects of individual study-specific biases due to small number of biological samples. Increasing sample size by direct microarray data integration increases the statistical power to obtain a more precise estimate of gene expression in a population of individuals resulting in lower false discovery rates. However, despite numerous recommendations for gene expression data integration, there is a lack of a systematic comparison of different processing approaches aimed to asses microarray platforms diversity and ambiguous probesets to genes correspondence, leading to low number of studies applying integration.ResultsHere, we investigated five different approaches of the microarrays data processing in comparison with RNA-seq data on breast cancer samples. We aimed to evaluate different probesets annotations as well as different procedures of choosing between probesets mapped to the same gene. We show that pipelines rankings are mostly preserved across Affymetrix and Illumina platforms. BrainArray approach based on updated annotation and redesigned probesets definition and choosing probeset with the maximum average signal across the samples have best correlation with RNA-seq, while averaging probesets signals as well as scoring the quality of probes sequences mapping to the transcripts of the targeted gene have worse correlation. Finally, randomly selecting probeset among probesets mapped to the same gene significantly decreases the correlation with RNA-seq.ConclusionWe show that methods, which rely on actual probesets signal intensities, are advantageous to methods considering biological characteristics of the probes sequences only and that cross-platform integration of datasets improves correlation with the RNA-seq data. We consider the results obtained in this paper contributive to the integrative analysis as a worthwhile alternative to the classical meta-analysis of the multiple gene expression datasets.


Sign in / Sign up

Export Citation Format

Share Document