AbstractBackgroundThere exist massive transcriptome profiles in the form of microarray, enabling reuse. The challenge is that they are processed with diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset or cross-cancer analyses. If there exists a single, integrated data source consisting of thousands of samples, similar to TCGA, data-reuse will be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy.FindingsWe present 11 merged microarray-acquired datasets (MMDs) of major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Highly concordant MMD-derived patterns of genome-wide differential gene expression were observed with matching TCGA cohorts. Using machine learning algorithms, we show that clinical models trained from all MMDs, except breast MMD, can be directly applied to RNA-seq-acquired TCGA data with an average accuracy of 0.96 in classifying cancer. Machine learning optimized MMD further aids to reveal immune landscape of human cancers critically needed in disease management and clinical interventions.ConclusionsTo facilitate large-scale meta-analysis, we generated a newly curated, unified, large-scale MMD across 11 cancer types. Besides TCGA, this single data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.