Background: Routine clinical data from clinical charts are indispensable for retrospective and prospective observational studies and clinical trials. Their reproducibility is often not assessed.
Objective: To develop a prostate cancer-specific database with a defined source hierarchy for clinical annotations in conjunction with molecular profiling and to evaluate data reproducibility.
Design, setting, and participants: For men with prostate cancer and clinical-grade paired tumor-normal sequencing, we performed team-based retrospective data collection from the electronic medical record at a comprehensive cancer center. We developed an open-source R package for data processing. We assessed reproducibility using blinded repeat annotation by a reference medical oncologist.
Outcome measurements and statistical analysis: We evaluated completeness of data elements, reproducibility of team-based annotation compared to the reference, and impact of measurement error on bias in survival analyses.
Results and limitations: Data elements on demographics, diagnosis and staging, disease state at the time of procuring a genomically characterized sample, and clinical outcomes were piloted and then abstracted for 2,261 patients (with 2,631 samples). Completeness of data elements was generally high. Comparing to the repeat annotation by a medical oncologist blinded to the database (100 patients/samples), reproducibility of annotations was high to very high; T stage, metastasis date, and presence and date of castration resistance had lower reproducibility. Impact of measurement error on estimates for strong prognostic factors was modest.
Conclusions: With a prostate cancer-specific data dictionary and quality control measures, manual clinical annotations by a multidisciplinary team can be scalable and reproducible. The data dictionary and the R package for reproducible data processing are freely available to increase data quality in clinical prostate cancer research.