scholarly journals Detect tissue heterogeneity in gene expression data with BioQC

BMC Genomics ◽  
2017 ◽  
Vol 18 (1) ◽  
Author(s):  
Jitao David Zhang ◽  
Klas Hatje ◽  
Gregor Sturm ◽  
Clemens Broger ◽  
Martin Ebeling ◽  
...  
2020 ◽  
Author(s):  
Gregor Sturm ◽  
Markus List ◽  
Jitao David Zhang

Background: Lack of reproducibility in gene expression studies has recently attracted much attention in and beyond the biomedical research community. Previous efforts have identified many underlying factors, such as batch effects and incorrect sample annotations. Recently, tissue heterogeneity, a consequence of unintended profiling of cells of other origins than the tissue of interest, was proposed as a source of variance that exacerbates irreproducibility and is commonly ignored. Results: Here, we systematically analyzed 2,692 publicly available gene expression datasets including 78,332 samples for tissue heterogeneity. We found a prevalence of tissue heterogeneity in gene expression data that affects on average 5-15% of the samples, depending on the tissue type. We distinguish cases of severe heterogeneity, which may be caused by mistakes in annotation or sample handling, from cases of moderate heterogeneity, which are more likely caused by tissue infiltration or sample contamination. Conclusions: Tissue heterogeneity is a widespread issue in publicly available gene expression datasets and thus an important source of variance that should not be ignored. We advocate the application of quality control methods such as BioQC to detect tissue heterogeneity prior to mining or analysing gene expression data.


2021 ◽  
Vol 3 (3) ◽  
Author(s):  
Gregor Sturm ◽  
Markus List ◽  
Jitao David Zhang

Abstract Lack of reproducibility in gene expression studies is a serious issue being actively addressed by the biomedical research community. Besides established factors such as batch effects and incorrect sample annotations, we recently reported tissue heterogeneity, a consequence of unintended profiling of cells of other origins than the tissue of interest, as a source of variance. Although tissue heterogeneity exacerbates irreproducibility, its prevalence in gene expression data remains unknown. Here, we systematically analyse 2 667 publicly available gene expression datasets covering 76 576 samples. Using two independent data compendia and a reproducible, open-source software pipeline, we find a prevalence of tissue heterogeneity in gene expression data that affects between 1 and 40% of the samples, depending on the tissue type. We discover both cases of severe heterogeneity, which may be caused by mistakes in annotation or sample handling, and cases of moderate heterogeneity, which are likely caused by tissue infiltration or sample contamination. Our analysis establishes tissue heterogeneity as a widespread phenomenon in publicly available gene expression datasets, which constitutes an important source of variance that should not be ignored. Consequently, we advocate the application of quality-control methods such as BioQC to detect tissue heterogeneity prior to mining or analysing gene expression data.


BMC Genomics ◽  
2018 ◽  
Vol 19 (1) ◽  
Author(s):  
Jitao David Zhang ◽  
Klas Hatje ◽  
Gregor Sturm ◽  
Clemens Broger ◽  
Martin Ebeling ◽  
...  

2006 ◽  
Vol 45 (05) ◽  
pp. 557-563 ◽  
Author(s):  
M. Jacobsen ◽  
D. Repsilber ◽  
A. Gutschmidt ◽  
A. Neher ◽  
K. Feldmann ◽  
...  

Summary Objectives: Microarray analysis requires standardized specimens and evaluation procedures to achieve acceptable results. A major limitation of this method is caused by heterogeneity in the cellular composition of tissue specimens, which frequently confounds data analysis. We introduce a linear model to deconfound gene expression data from tissue heterogeneity for genes exclusively expressed by a single cell type. Methods: Gene expression data are deconfounded from tissue heterogeneity effects by analyzing them using an appropriate linear regression model. In our illustrating data set tissue heterogeneity is being measured using flow cytometry. Gene expression data are determined in parallel by real time quantitative polymerase chain reaction (qPCR) and microarray analyses. Verification of deconfounding is enabled using protein quantification for the respective marker genes. Results: For our illustrating dataset, quantification of cell type proportions for peripheral blood mononuclear cells (PBMC) from tuberculosis patients and controls revealed differences in B cell and monocyte proportions between both study groups, and thus heterogeneity for the tissue under investigation. Gene expression analyses reflected these differences in celltype distribution. Fitting an appropriate linear model allowed us to deconfound measured transcriptome levels from tissue heterogeneity effects. In the case of monocytes, additional differential expression on the single cell level could be proposed. Protein quantification verified these deconfounded results. Conclusions: Deconfounding of transcriptome analyses for cellular heterogeneity greatly improves interpretability, and hence the validity of transcriptome profiling results.


Sign in / Sign up

Export Citation Format

Share Document