Predicting Health Material Cognitive Accessibility for Non-Native English Speakers Using Multidimensional Semantic Features as Predictors of Machine Learning Algorithms (Preprint)
BACKGROUND Much of current health information understandability research uses medical readability formula to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargons form the sole barriers to health information access among the public. Our study challenged this by showing that for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts which underpin the knowledge structure of English health texts, rather than medical jargons can explain the cognitive accessibility of health materials among readers with better understanding of English health terms, yet very limited exposure to English-based health education environments and traditions. OBJECTIVE Our study explored multidimensional semantic features for developing machine learning algorithms to predict the perceived level of cognitive accessibility of English health materials on health risks and diseases for young adults enrolled in Australian tertiary institutes. We compared algorithms to evaluate the cognitive accessibility of health information for non-native English speaker with advanced education levels yet very limited exposure to English health education environments. METHODS We used 108 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from Australian and international health organization websites, rated by overseas tertiary students, we compared machine learning (decision tree, SVM, ensemble tree, logistic regression) after hyperparameter optimization (grid search for the best hyperparameter combination of minimal classification errors). We applied 10-fold cross-validation on the whole dataset for the model training and testing, calculated the AUC, sensitivity, specificity, and accuracy as the measurement of the model performance. RESULTS We developed, compared four machine learning algorithms using multidimensional semantic features as predictors. The results showed that ensemble tree (LogitBoost) outperformed in terms of AUC (0.97), sensitivity (0.966), specificity (0.972) and accuracy (0.969). Decision tree followed closely with an AUC (0.924), sensitivity (0.912), specificity (0.9358), and accuracy (0.924), and SVM with an AUC (0.8946), sensitivity (0.8952), specificity (0.894), accuracy (0.8946). Decision tree, ensemble tree, SVM achieved statistically significant improvement over logistic regression in AUC, specificity, accuracy. As the best performing algorithm, ensemble tree reached statistically significant improvement over SVM in AUC, specificity, accuracy, and a statistically significant improvement over decision tree in sensitivity. CONCLUSIONS Our study showed that cognitive accessibility of English health texts is not limited to word length and sentence length as had been conventionally measured by the medical readability formula. We compared machine learning algorithms based on semantic features to explore the cognitive accessibility of health information for non-native English speakers. The results showed the new models reached statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership. Our study illustrated that semantic features such as cognitive abilities related semantic features, communicative actions and processes, power relationships in healthcare settings, and lexical familiarity and diversity of health texts are large contributors to the comprehension of health information and that for readers such as international students, semantic features of health texts which outweigh syntax and domain knowledge.