The Impact of Digital Histopathology Batch Effect on Deep Learning Model Accuracy and Bias
AbstractThe Cancer Genome Atlas (TCGA) is one of the largest biorepositories of digital histology. Deep learning (DL) models have been trained on TCGA to predict numerous features directly from histology, including survival, gene expression patterns, and driver mutations. However, we demonstrate that these features vary substantially across tissue submitting sites in TCGA for over 3,000 patients with six cancer subtypes. Additionally, we show that histologic image differences between submitting sites can easily be identified with DL. This site detection remains possible despite commonly used color normalization and augmentation methods, and we quantify the digital image characteristics constituting this histologic batch effect. As an example, we show that patient ethnicity within the TCGA breast cancer cohort can be inferred from histology due to site-level batch effect, which must be accounted for to ensure equitable application of DL. Batch effect also leads to overoptimistic estimates of model performance, and we propose a quadratic programming method to guide validation that abrogates this bias.