761Characterisation and clustering of diseases by their association with well-known risk factors
Abstract Background Data-driven classifications are improving statistical power, refining prognoses, and improving our understanding of autoimmune, respiratory, infectious, and neurological diseases. Classifications have used molecular information, age of incidence, and sequences of disease onset (“disease trajectories”). Here we consider whether associations with easily-measured established risk factors such as height and BMI can usefully characterise disease. Methods UK Biobank data and their linked hospital episode statistics were used to study 172 common age-related diseases. A proportional hazards model was used to estimate associations with potential risk-factors and to adjust for well-known confounders. Diseases were compared and hierarchically clustered using novel but rigorous multivariate statistical methods. Results For diseases affecting both sexes, over 38% can be uniquely identified by their associations with risk factors. Equivalent diseases often clustered adjacently. After an FDR multiple-testing adjustment, roughly 5% have statistically significant differences. Similar remarks applied to several symptoms of unknown cause. Many clustered diseases are associated with a shared, known pathogenesis, others suggest likely but unconfirmed causes. Conclusions Risk factors for disease can be surprisingly precise and can be used to cluster diseases in a meaningful way. Risk factors for men and women may differ for some diseases. Several symptoms of unknown cause have disease-specific, statistically significant risk factors. Key messages Big datasets and modern statistics are providing new insights into the relationships between diseases and their associations with risk-factors. Diseases can be identified and clustered by their associations with well-known risk factors.