Predicting Health Educational Material Understandability using Machine Learning Algorithms (Preprint)
BACKGROUND Improving the understandability of health information can significantly increase the cost-effectiveness and efficiency of health education programs for vulnerable populations. There is a pressing need to develop clinically informed computerized tools to enable rapid, reliable assessment of the linguistic understandability of specialized health and medical education resources. OBJECTIVE This paper fills a critical gap in current patient-oriented health resource development, which requires reliable, accurate evaluation instruments to increase the efficiency, cost-effectiveness of health education resource evaluation. We aim to translate internationally endorsed clinical guidelines, Patient Education Materials Assessment Tool (PEMAT) to machine learning algorithms to facilitate the evaluation of the understandability of health resources for international students at Australian universities. METHODS Based on international patient health resource assessment guidelines, we developed machine learning algorithms to predict the linguistic understandability of health texts for Australian college students (aged 25-30) from non-English speaking backgrounds. We compared extreme gradient boosting, random forest, neural networks, C5 decision tree for automated health information understandability evaluation. The five machine learning models achieved statistically better results compared to the baseline logistic regression model. We also evaluated the impact of each linguistic feature on the performance of each of the five models. RESULTS It was found that information evidentness, relevance to educational purposes and logical sequence were consistently more important than numeracy skills and medical knowledge when assessing the linguistic understandability of health education resources for international tertiary students with adequate English skills (IELT test score mean 6.5) and high health literacy (mean 16.5 in the Short Assessment of Health Literacy-English test). The results challenged traditional views that lack of medical knowledge and numerical skills constituted the barriers to the understanding of health educational materials. CONCLUSIONS Machine learning algorithms were developed to predict health information understandability for international college students aged 25-30. 13 natural language features and 5 evaluation dimensions were identified and compared in terms of their impact on the performance of the models. Health information understandability varies according to the demographic profiles of the target readers, and for international tertiary students, improving health information evidentness, relevance and logic is critical.