A supervised machine learning study of online discussion forums about type-2 diabetes

Author(s):  
Jonathan-Raphael Reichert ◽  
Klaus Langholz Kristensen ◽  
Raghava Rao Mukkamala ◽  
Ravi Vatrapu
2020 ◽  
Author(s):  
Benjamin Lam ◽  
Michael Catt ◽  
Sophie Cassidy ◽  
Jaume Bacardit ◽  
Philip Darke ◽  
...  

BACKGROUND Between 2013 and 2015, the UK Biobank collected accelerometer traces using wrist-worn triaxial accelerometers for 103,712 volunteers aged between 40 and 69, for one week each. This dataset has been used in the past to verify that individuals with chronic diseases exhibit reduced activity levels compared to healthy populations. Yet, the dataset is likely to be noisy, as the devices were allocated to participants without a specific set of inclusion criteria, and the traces reflect uncontrolled free-living conditions. OBJECTIVE To determine the extent to which accelerometer traces can be used to distinguish individuals with Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations. METHODS Supervised machine learning classifiers were trained using the different sets of features, to segregate T2D positive individuals from normoglycaemic individuals. Multiple criteria, based on a combination of self-assessment UKBiobank variables and primary care health records linked to the participants in UKBiobank, were used to identify 3,103 individuals in this population who have T2D. The remaining non-diabetic 19,852 participants were further scored on their physical activity impairment severity levels based on other conditions found in their primary care data, and those likely to have been physically impaired at the time were excluded. Physical activity features were first extracted from the raw accelerometer traces dataset for each participant, using an algorithm that extends the previously developed Biobank Accelerometry Analysis toolkit from Oxford University [1]. These features were complemented by a selected collection of socio-demographic and lifestyle features available from UK Biobank. RESULTS Three types of classifiers were tested, with AUC close to[0.86; 95% CI: .85-.87] for all three, and F1 scores in the range [.80,.82] for T2D positives and [.73,.74] for controls. Results obtained using non-physically impaired controls were compared to highly physically impaired controls, to test the hypothesis that non-diabetes conditions reduce classifier performance. Models built using a training set that includes highly impaired controls with other conditions had worse performance: AUC [.75-.77; 95% CI: .74-.78] and F1 in the range [.76-.77] (positives) and [.63,.65] (controls). CONCLUSIONS Granular measures of free-living physical activity can be used to successfully train machine learning models that are able to discriminate between T2D and normoglycaemic controls, albeit with limitations due to the intrinsic noise in the datasets. In a broader, clinical perspective, these findings motivate further research into the use of physical activity traces as a means to screen individuals at risk of diabetes and for early detection, in conjunction with routinely used risk scores, provided that appropriate quality control is enforced on the data collection protocol in order to improve the signal-to-noise ratio. CLINICALTRIAL


Sign in / Sign up

Export Citation Format

Share Document