scholarly journals Monitoring of Technical Variation in Quantitative High-Throughput Datasets

2013 ◽  
Vol 12 ◽  
pp. CIN.S12862 ◽  
Author(s):  
Martin Lauss ◽  
Ilhami Visne ◽  
Albert Kriegner ◽  
Markus Ringnér ◽  
Göran Jönsson ◽  
...  

High-dimensional datasets can be confounded by variation from technical sources, such as batches. Undetected batch effects can have severe consequences for the validity of a study's conclusion(s). We evaluate high-throughput RNAseq and miRNAseq as well as DNA methylation and gene expression microarray datasets, mainly from the Cancer Genome Atlas (TCGA) project, in respect to technical and biological annotations. We observe technical bias in these datasets and discuss corrective interventions. We then suggest a general procedure to control study design, detect technical bias using linear regression of principal components, correct for batch effects, and re-evaluate principal components. This procedure is implemented in the R package swamp, and as graphical user interface software. In conclusion, high-throughput platforms that generate continuous measurements are sensitive to various forms of technical bias. For such data, monitoring of technical variation is an important analysis step.

2018 ◽  
Vol 28 (7) ◽  
pp. 2137-2149 ◽  
Author(s):  
Wei Wei ◽  
Zequn Sun ◽  
Willian A da Silveira ◽  
Zhenning Yu ◽  
Andrew Lawson ◽  
...  

Identification of cancer patient subgroups using high throughput genomic data is of critical importance to clinicians and scientists because it can offer opportunities for more personalized treatment and overlapping treatments of cancers. In spite of tremendous efforts, this problem still remains challenging because of low reproducibility and instability of identified cancer subgroups and molecular features. In order to address this challenge, we developed Integrative Genomics Robust iDentification of cancer subgroups (InGRiD), a statistical approach that integrates information from biological pathway databases with high-throughput genomic data to improve the robustness for identification and interpretation of molecularly-defined subgroups of cancer patients. We applied InGRiD to the gene expression data of high-grade serous ovarian cancer from The Cancer Genome Atlas and the Australian Ovarian Cancer Study. The results indicate clear benefits of the pathway-level approaches over the gene-level approaches. In addition, using the proposed InGRiD framework, we also investigate and address the issue of gene sharing among pathways, which often occurs in practice, to further facilitate biological interpretation of key molecular features associated with cancer progression. The R package “InGRiD” implementing the proposed approach is currently available in our research group GitHub webpage ( https://dongjunchung.github.io/INGRID/ ).


2019 ◽  
Author(s):  
Wikum Dinalankara ◽  
Qian Ke ◽  
Donald Geman ◽  
Luigi Marchionni

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0249002
Author(s):  
Wikum Dinalankara ◽  
Qian Ke ◽  
Donald Geman ◽  
Luigi Marchionni

Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.


Genetics ◽  
2019 ◽  
Vol 213 (4) ◽  
pp. 1209-1224 ◽  
Author(s):  
Juho A. J. Kontio ◽  
Mikko J. Sillanpää

Gaussian process (GP)-based automatic relevance determination (ARD) is known to be an efficient technique for identifying determinants of gene-by-gene interactions important to trait variation. However, the estimation of GP models is feasible only for low-dimensional datasets (∼200 variables), which severely limits application of the GP-based ARD method for high-throughput sequencing data. In this paper, we provide a nonparametric prescreening method that preserves virtually all the major benefits of the GP-based ARD method and extends its scalability to the typical high-dimensional datasets used in practice. In several simulated test scenarios, the proposed method compared favorably with existing nonparametric dimension reduction/prescreening methods suitable for higher-order interaction searches. As a real-data example, the proposed method was applied to a high-throughput dataset downloaded from the cancer genome atlas (TCGA) with measured expression levels of 16,976 genes (after preprocessing) from patients diagnosed with acute myeloid leukemia.


2019 ◽  
Author(s):  
Shaolong Cao ◽  
Zeya Wang ◽  
Fan Gao ◽  
Jingxiao Chen ◽  
Feng Zhang ◽  
...  

AbstractThe deconvolution of transcriptomic data from heterogeneous tissues in cancer studies remains challenging. Available software faces difficulties for accurately estimating both component-specific proportions and expression profiles for individual samples. To address these challenges, we present a new R-implementation pipeline for the more accurate and efficient transcriptome deconvolution of high dimensional data from mixtures of more than two components. The pipeline utilizes the computationally efficient DeMixT R-package with OpenMP and additional cancer-specific biological information to perform three-component deconvolution without requiring data from the immune profiles. It enables a wide application of DeMixT to gene expression datasets available from cancer consortium such as the Cancer Genome Atlas (TCGA) projects, where, other than the mixed tumor samples, a handful of normal samples are profiled in multiple cancer types. We have applied this pipeline to two TCGA datasets in colorectal adenocarcinoma (COAD) and prostate adenocarcinoma (PRAD). In COAD, we found varying distributions of immune proportions across the Consensus Molecular Subtypes, from the highest to the lowest being CMS1, CMS3, CMS4 and CMS2. In PRAD, we found the immune proportions are associated with progression-free survival (p<0.01) and negatively correlated with Gleason scores (p<0.001). Our DeMixT-centered analysis protocol opens up new opportunities to investigate the tumor-stroma-immune microenvironment, by providing both proportions and component-specific expressions, and thus better define the underlying biology of cancer progression.Availability and implementation: An R package, scripts and data are available: https://github.com/wwylab/DeMixTallmaterials.


2022 ◽  
Vol 12 ◽  
Author(s):  
Guoda Song ◽  
Yucong Zhang ◽  
Hao Li ◽  
Zhuo Liu ◽  
Wen Song ◽  
...  

Background: Ubiquitin and ubiquitin-like (UB/UBL) conjugations are one of the most important post-translational modifications and involve in the occurrence of cancers. However, the biological function and clinical significance of ubiquitin related genes (URGs) in prostate cancer (PCa) are still unclear.Methods: The transcriptome data and clinicopathological data were downloaded from The Cancer Genome Atlas (TCGA), which was served as training cohort. The GSE21034 dataset was used to validate. The two datasets were removed batch effects and normalized using the “sva” R package. Univariate Cox, LASSO Cox, and multivariate Cox regression were performed to identify a URGs prognostic signature. Then Kaplan-Meier curve and receiver operating characteristic (ROC) curve analyses were used to evaluate the performance of the URGs signature. Thereafter, a nomogram was constructed and evaluated.Results: A six-URGs signature was established to predict biochemical recurrence (BCR) of PCa, which included ARIH2, FBXO6, GNB4, HECW2, LZTR1 and RNF185. Kaplan-Meier curve and ROC curve analyses revealed good performance of the prognostic signature in both training cohort and validation cohort. Univariate and multivariate Cox analyses showed the signature was an independent prognostic factor for BCR of PCa in training cohort. Then a nomogram based on the URGs signature and clinicopathological factors was established and showed an accurate prediction for prognosis in PCa.Conclusion: Our study established a URGs prognostic signature and constructed a nomogram to predict the BCR of PCa. This study could help with individualized treatment and identify PCa patients with high BCR risks.


2014 ◽  
Author(s):  
Belinda Phipson ◽  
Alicia Oshlack

Methylation of DNA is known to be essential to development and dramatically altered in cancers. The Illumina HumanMethylation450 BeadChip has been used extensively as a cost-effective way to profile nearly half a million CpG sites across the human genome. Here we present DiffVar, a novel method to test for differential variability between sample groups. DiffVar employs an empirical Bayes model framework that can take into account any experimental design and is robust to outliers. We applied DiffVar to several datasets from The Cancer Genome Atlas, as well as an aging dataset. DiffVar is available in the missMethyl Bioconductor R package.


2020 ◽  
Author(s):  
Martin Pirkl ◽  
Niko Beerenwinkel

AbstractMotivationCancer is one of the most prevalent diseases in the world. Tumors arise due to important genes changing their activity, e.g., when inhibited or over-expressed. But these gene perturbations are difficult to observe directly. Molecular profiles of tumors can provide indirect evidence of gene perturbations. However, inferring perturbation profiles from molecular alterations is challenging due to error-prone molecular measurements and incomplete coverage of all possible molecular causes of gene perturbations.ResultsWe have developed a novel mathematical method to analyze cancer driver genes and their patient-specific perturbation profiles. We combine genetic aberrations with gene expression data in a causal network derived across patients to infer unobserved perturbations. We show that our method can predict perturbations in simulations, CRISPR perturbation screens, and breast cancer samples from The Cancer Genome Atlas.AvailabilityThe method is available as the R-package nempi at https://github.com/cbg-ethz/[email protected], [email protected]


2016 ◽  
Author(s):  
Alexandra R. Buckley ◽  
Kristopher A. Standish ◽  
Kunal Bhutani ◽  
Trey Ideker ◽  
Hannah Carter ◽  
...  

AbstractThe degree to which germline variation drives cancer development and shapes tumor phenotypes remains largely unexplored, possibly due to a lack of large scale publicly available germline data for a cancer cohort. Here we called germline variants on 9,618 cases from The Cancer Genome Atlas (TCGA) database representing 31 cancer types. We identified batch effects affecting loss of function (LOF) variant calls that can be traced back to differences in the way the sequence data were generated both within and across cancer types. Overall, LOF indel calls were more sensitive to technical artifacts than LOF Single Nucleotide Variant (SNV) calls. In particular, whole genome amplification of DNA prior to sequencing led to an artificially increased burden of LOF indel calls, which confounded association analyses relating germline variants to tumor type despite stringent indel filtering strategies. Due to the inherent noise we chose to remove all 614 amplified DNA samples, including all acute myeloid leukemia and virtually all ovarian cancer samples, from the final dataset. This study demonstrates how insufficient quality control can lead to false positive germlinetumor type associations and draws attention to the need to be sensitive to problems associated with a lack of uniformity in data generation in TCGA data.Author SummaryCancer research to date has largely focused on genetic aberrations specific to tumor tissue. In contrast, the degree to which germline, or inherited, variation contributes to tumorigenesis remains unclear, possibly due to a lack of accessible germline variant data. In this study we identify germline variants in 9,618 samples using raw germline exome data from The Cancer Genome Atlas (TCGA). There are substantial differences in the way exome sequence data was generated both across and within cancer types in TCGA. We observe that differences in sequence data generation introduced batch effects, or variation that is due to technical factors not true biological variation, in our variant data. Most notably, we observe that amplification of DNA prior to sequencing resulted in an excess of predicted damaging indel variants. We show how these batch effects can confound germline association analyses if not properly addressed. Our study highlights the difficulties of working with large public genomic datasets like TCGA where samples are collected over time and across data centers, and particularly cautions the use of amplified DNA samples for genetic association analyses.


2020 ◽  
Author(s):  
Jinlong Cao ◽  
Jianpeng Li ◽  
Xin Yang ◽  
Pan Li ◽  
Zhiqiang Yao ◽  
...  

Abstract Background: Cancer is often defined as a disease of aging. The majority of patients with urogenital cancers are the elderly, whose clinical characteristics are greatly affected by age and aging. Here, we aimed to explore age-related biological changes in three major urogenital cancers by integrative bioinformatics analysis.Methods: First, mRNA (count format) and clinical data for bladder cancer, prostate cancer and renal cell carcinoma were downloaded from the Cancer Genome Atlas (TCGA) portal. The expressions of 64 cells were obtained by xCell deconvolution method. EdgeR package and limma package were used to analyze differentially expressed genes and cells in the young group and the old group, respectively. ClusterProfiler R package and clueGO plugin were used for enrichment analysis, and cytohubba plugin was used for hub genes analysis. Then co-expression analysis and chromosome distribution for hub genes were analyzed and demonstrated by RIdeogram R package. The clinical correlation of hub genes and key cells was analyzed by Graphpad Prism software. Finally, the correlation between hub genes and key cells was explored by corrplot R package.Results: We screened and identified 14 hub genes and 4 key cells related to age and urogenital cancers. The age-related differentially expressed genes and co-expressed genes were mainly enriched in muscle movement (Cl-, Ca2+), inflammatory response, antibacterial humoral immune response, substance metabolism and transport, redox reaction, etc. Most of the age-related genes are on chromosome 17. Moreover, the correlation between cells and genes was analyzed. Conclusion: Our study analyzed age-related genes and cells in the tumor microenvironment of urogenital cancers, and explored the pathways involved. This could contribute to personalized therapy for patients of different ages and a new understanding of the potential relationship between the aging microenvironment and urogenital cancers.


Sign in / Sign up

Export Citation Format

Share Document