scholarly journals Predicting Health Material Cognitive Accessibility Using Multidimensional Semantic Features and Readability Tools as Predicators (Preprint)

2021 ◽  
Author(s):  
Meng Ji ◽  
Yanmeng Liu ◽  
Tianyong Hao

BACKGROUND Much of current health information understandability research uses medical readability formula (MRF) to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargons form the sole barriers to health information access among the public. Our study challenged this by showing that for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts rather than medical jargons can explain the lack of cognitive access of health materials among readers with better understanding of health terms, yet limited exposure to English health education materials. OBJECTIVE Our study explored combined MRF and multidimensional semantic features (MSF) for developing machine learning algorithms to predict the actual level of cognitive accessibility of English health materials on health risks and diseases for specific populations. We compare algorithms to evaluate the cognitive accessibility of specialised health information for non-native English speaker with advanced education levels yet very limited exposure to English health education environments. METHODS We used 108 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from international health organization websites, rated by international tertiary students, we compared machine learning (decision tree, SVM, discriminant analysis, ensemble tree and logistic regression) after automatic hyperparameter optimization (grid search for the best combination of hyperparameters of minimal classification errors). We applied 10-fold cross-validation on the whole dataset for the model training and testing, calculated the AUC, sensitivity, specificity, and accuracy as the measured of the model performance. RESULTS Using two sets of predictor features: widely tested MRF and MSF proposed in our study, we developed and compared three sets of machine learning algorithms: the first set of algorithms used MRF as predictors only, the second set of algorithms used MSF as predictors only, and the last set of algorithms used both MRF and MSF as integrated models. The results showed that the integrated models outperformed in terms of AUC, sensitivity, accuracy, and specificity. CONCLUSIONS Our study showed that cognitive accessibility of English health texts is not limited to word length and sentence length conventionally measured by MRF. We compared machine learning algorithms combing MRF and MSF to explore the cognitive accessibility of health information from syntactic and semantic perspectives. The results showed the strength of integrated models in terms of statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership, indicating that both MRF and MSF contribute to the comprehension of health information, and that for readers with advanced education, semantic features outweigh syntax and domain knowledge.

2021 ◽  
Author(s):  
Meng Ji ◽  
Yanmeng Liu ◽  
Tianyong Hao

BACKGROUND Much of current health information understandability research uses medical readability formula to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargons form the sole barriers to health information access among the public. Our study challenged this by showing that for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts which underpin the knowledge structure of English health texts, rather than medical jargons can explain the cognitive accessibility of health materials among readers with better understanding of English health terms, yet very limited exposure to English-based health education environments and traditions. OBJECTIVE Our study explored multidimensional semantic features for developing machine learning algorithms to predict the perceived level of cognitive accessibility of English health materials on health risks and diseases for young adults enrolled in Australian tertiary institutes. We compared algorithms to evaluate the cognitive accessibility of health information for non-native English speaker with advanced education levels yet very limited exposure to English health education environments. METHODS We used 108 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from Australian and international health organization websites, rated by overseas tertiary students, we compared machine learning (decision tree, SVM, ensemble tree, logistic regression) after hyperparameter optimization (grid search for the best hyperparameter combination of minimal classification errors). We applied 10-fold cross-validation on the whole dataset for the model training and testing, calculated the AUC, sensitivity, specificity, and accuracy as the measurement of the model performance. RESULTS We developed, compared four machine learning algorithms using multidimensional semantic features as predictors. The results showed that ensemble tree (LogitBoost) outperformed in terms of AUC (0.97), sensitivity (0.966), specificity (0.972) and accuracy (0.969). Decision tree followed closely with an AUC (0.924), sensitivity (0.912), specificity (0.9358), and accuracy (0.924), and SVM with an AUC (0.8946), sensitivity (0.8952), specificity (0.894), accuracy (0.8946). Decision tree, ensemble tree, SVM achieved statistically significant improvement over logistic regression in AUC, specificity, accuracy. As the best performing algorithm, ensemble tree reached statistically significant improvement over SVM in AUC, specificity, accuracy, and a statistically significant improvement over decision tree in sensitivity. CONCLUSIONS Our study showed that cognitive accessibility of English health texts is not limited to word length and sentence length as had been conventionally measured by the medical readability formula. We compared machine learning algorithms based on semantic features to explore the cognitive accessibility of health information for non-native English speakers. The results showed the new models reached statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership. Our study illustrated that semantic features such as cognitive abilities related semantic features, communicative actions and processes, power relationships in healthcare settings, and lexical familiarity and diversity of health texts are large contributors to the comprehension of health information and that for readers such as international students, semantic features of health texts which outweigh syntax and domain knowledge.


10.2196/29175 ◽  
2021 ◽  
Vol 9 (9) ◽  
pp. e29175
Author(s):  
Meng Ji ◽  
Yanmeng Liu ◽  
Tianyong Hao

Background Current health information understandability research uses medical readability formulas to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargon form the sole barriers to health information access among the public. Our study challenged this by showing that, for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts that underpin the knowledge structure of English health texts, rather than medical jargon, can explain the cognitive accessibility of health materials among readers with better understanding of English health terms yet limited exposure to English-based health education environments and traditions. Objective Our study explores multidimensional semantic features for developing machine learning algorithms to predict the perceived level of cognitive accessibility of English health materials on health risks and diseases for young adults enrolled in Australian tertiary institutes. We compared algorithms to evaluate the cognitive accessibility of health information for nonnative English speakers with advanced education levels yet limited exposure to English health education environments. Methods We used 113 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from Australian and international health organization websites rated by overseas tertiary students, we compared machine learning (decision tree, support vector machine [SVM], ensemble tree, and logistic regression) after hyperparameter optimization (grid search for the best hyperparameter combination of minimal classification errors). We applied 10-fold cross-validation on the whole data set for the model training and testing, and calculated the area under the operating characteristic curve (AUC), sensitivity, specificity, and accuracy as the measurement of the model performance. Results We developed and compared 4 machine learning algorithms using multidimensional semantic features as predictors. The results showed that ensemble tree (LogitBoost) outperformed in terms of AUC (0.97), sensitivity (0.966), specificity (0.972), and accuracy (0.969). Decision tree (AUC 0.924, sensitivity 0.912, specificity 0.9358, and accuracy 0.924) and SVM (AUC 0.8946, sensitivity 0.8952, specificity 0.894, and accuracy 0.8946) followed closely. Decision tree, ensemble tree, and SVM achieved statistically significant improvement over logistic regression in AUC, specificity, and accuracy. As the best performing algorithm, ensemble tree reached statistically significant improvement over SVM in AUC, specificity, and accuracy, and statistically significant improvement over decision tree in sensitivity. Conclusions Our study shows that cognitive accessibility of English health texts is not limited to word length and sentence length as had been conventionally measured by medical readability formulas. We compared machine learning algorithms based on semantic features to explore the cognitive accessibility of health information for nonnative English speakers. The results showed the new models reached statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership. Our study illustrated that semantic features such as cognitive ability–related semantic features, communicative actions and processes, power relationships in health care settings, and lexical familiarity and diversity of health texts are large contributors to the comprehension of health information; for readers such as international students, semantic features of health texts outweigh syntax and domain knowledge.


2021 ◽  
Author(s):  
Meng Ji ◽  
Yanmeng Liu ◽  
Tianyong Hao

BACKGROUND Current health information understandability research uses medical readability formulas to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargon form the sole barriers to health information access among the public. Our study challenged this by showing that, for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts that underpin the knowledge structure of English health texts, rather than medical jargon, can explain the cognitive accessibility of health materials among readers with better understanding of English health terms yet limited exposure to English-based health education environments and traditions. OBJECTIVE Our study explores multidimensional semantic features for developing machine learning algorithms to predict the perceived level of cognitive accessibility of English health materials on health risks and diseases for young adults enrolled in Australian tertiary institutes. We compared algorithms to evaluate the cognitive accessibility of health information for nonnative English speakers with advanced education levels yet limited exposure to English health education environments. METHODS We used 113 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from Australian and international health organization websites rated by overseas tertiary students, we compared machine learning (decision tree, support vector machine, ensemble classifier, and logistic regression) after hyperparameter optimization (grid search for the best hyperparameter combination of minimal classification errors). We applied 5-fold cross-validation on the whole data set for the model training and testing; and calculated the area under the operating characteristic curve (AUC), sensitivity, specificity, and accuracy as the measurement of the model performance. RESULTS We developed and compared 4 machine learning algorithms using multidimensional semantic features as predictors. The results showed that ensemble classifier (LogitBoost) outperformed in terms of AUC (0.858), sensitivity (0.787), specificity (0.813), and accuracy (0.802). Support vector machine (AUC 0.848, sensitivity 0.783, specificity 0.791, and accuracy 0.786) and decision tree (AUC 0.754, sensitivity 0.7174, specificity 0.7424, and accuracy 0.732) followed. Ensemble classifier (LogitBoost), support vector machine, and decision tree achieved statistically significant improvement over logistic regression in AUC, sensitivity, specificity, and accuracy. Support vector machine reached statistically significant improvement over decision tree in AUC and accuracy. As the best performing algorithm, ensemble classifier (LogitBoost) reached statistically significant improvement over decision tree in AUC, sensitivity, specificity, and accuracy. CONCLUSIONS Our study shows that cognitive accessibility of English health texts is not limited to word length and sentence length as had been conventionally measured by medical readability formulas. We compared machine learning algorithms based on semantic features to explore the cognitive accessibility of health information for nonnative English speakers. The results showed the new models reached statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership. Our study illustrated that semantic features such as cognitive ability–related semantic features, communicative actions and processes, power relationships in health care settings, and lexical familiarity and diversity of health texts are large contributors to the comprehension of health information; for readers such as international students, semantic features of health texts outweigh syntax and domain knowledge.


2021 ◽  
Author(s):  
Wenxiu Xie ◽  
Meng Ji ◽  
Yanmeng Liu ◽  
Tianyong Hao ◽  
Chi-Yin Chow

BACKGROUND Suitability of health resources for specific readerships represents a critical yet underexplored area of research in health informatics, despite its importance in health literacy and health education. High relevance of health information can improve the suitability and readability of online health educational resources for young readers. It has an important role in developing the health literacy of children with increasing exposure to online health information. Existing research on health resource evaluation is limited to the analysis of the morphological and syntactic complexity. Besides, empirical instruments do not exist to evaluate the suitability of online health information for children. OBJECTIVE We aimed to develop algorithms to predict suitability of online health information for this understudied user group, using a small number of semantic features to provide accurate and convenient tools for automatic prediction of the suitability of online health information for children. METHODS Combining machine learning and linguistic insights, we identified semantic features to predict the suitability of online health information for children, as an emerging and large readership on online health information. The selection of natural language features as predicator variables of algorithms went through initial automatic feature selection using Ridge Classifier, support vector machine, extreme gradient boost, followed by revision by linguists, education experts based on effective health information design. We compared algorithms using the automatically selected features (19) and linguistically enhanced features (20), using the initial features (115) as the baseline. RESULTS Using 5-fold cross-validation, comparing with the baseline (115 features), the Gaussian Naive Bayes model (20 features) achieved statistically higher mean sensitivity (P =0.0206, 95% CI: -0.016, 0.1929); mean specificity (P = 0.0205, 95% CI: -0.016, 0.199); mean AUC (P =0.017, 95% CI: -0.007, 0.140); mean Macro F1 (P =0.0061, 95% CI: 0.016, 0.167). The statistically improved performance of the final model (20 features) stands in contrast with the statistically insignificant changes between the original feature set (115) and the automatically selected features (19): mean sensitivity (P =0.134, 95% CI: -0.1699, 0.0681), mean specificity (P = 0.1001, 95% CI: -0.1389, 0.4017); mean AUC (P =0.0082, 95% CI: 0.0059, 0.1126), and mean macro F1 (P = 0.9796, 95% CI: -0.0555, 0.0548). This demonstrates the importance and effectiveness of combing automatic feature selection and expert-based linguistic revision to develop most effective machine learning algorithms from high-dimensional datasets. CONCLUSIONS Our study developed machine learning algorithms for evaluating health information suitability for children, an important readership who is having increasing reliance on online health information for developing their health literacy. User-adaptive automatic assessment of online health contents holds much promise for distant and remote health education among young readers. Our study leveraged the precision, adaptability of machine learning algorithms and insights from health linguistics to help advance this significant yet understudied area of research.


2021 ◽  
Author(s):  
Meng Ji ◽  
Yanmeng Liu ◽  
Mengdan Zhao ◽  
Ziqing Lyu ◽  
Boren Zhang ◽  
...  

BACKGROUND Improving the understandability of health information can significantly increase the cost-effectiveness and efficiency of health education programs for vulnerable populations. There is a pressing need to develop clinically informed computerized tools to enable rapid, reliable assessment of the linguistic understandability of specialized health and medical education resources. OBJECTIVE This paper fills a critical gap in current patient-oriented health resource development, which requires reliable, accurate evaluation instruments to increase the efficiency, cost-effectiveness of health education resource evaluation. We aim to translate internationally endorsed clinical guidelines, Patient Education Materials Assessment Tool (PEMAT) to machine learning algorithms to facilitate the evaluation of the understandability of health resources for international students at Australian universities. METHODS Based on international patient health resource assessment guidelines, we developed machine learning algorithms to predict the linguistic understandability of health texts for Australian college students (aged 25-30) from non-English speaking backgrounds. We compared extreme gradient boosting, random forest, neural networks, C5 decision tree for automated health information understandability evaluation. The five machine learning models achieved statistically better results compared to the baseline logistic regression model. We also evaluated the impact of each linguistic feature on the performance of each of the five models. RESULTS It was found that information evidentness, relevance to educational purposes and logical sequence were consistently more important than numeracy skills and medical knowledge when assessing the linguistic understandability of health education resources for international tertiary students with adequate English skills (IELT test score mean 6.5) and high health literacy (mean 16.5 in the Short Assessment of Health Literacy-English test). The results challenged traditional views that lack of medical knowledge and numerical skills constituted the barriers to the understanding of health educational materials. CONCLUSIONS Machine learning algorithms were developed to predict health information understandability for international college students aged 25-30. 13 natural language features and 5 evaluation dimensions were identified and compared in terms of their impact on the performance of the models. Health information understandability varies according to the demographic profiles of the target readers, and for international tertiary students, improving health information evidentness, relevance and logic is critical.


2021 ◽  
Vol 4 (2) ◽  
pp. p10
Author(s):  
Yanmeng Liu

The success of health education resources largely depends on their readability, as the health information can only be understood and accepted by the target readers when the information is uttered with proper reading difficulty. Unlike other populations, children feature limited knowledge and underdeveloped reading comprehension, which poses more challenges for the readability research on health education resources. This research aims to explore the readability prediction of health education resources for children by using semantic features to develop machine learning algorithms. A data-driven method was applied in this research:1000 health education articles were collected from international health organization websites, and they were grouped into resources for kids and resources for non-kids according to their sources. Moreover, 73 semantic features were used to train five machine learning algorithms (decision tree, support vector machine, k-nearest neighbors algorithm, ensemble classifier, and logistic regression). The results showed that the k-nearest neighbors algorithm and ensemble classifier outperformed in terms of area under the operating characteristic curve sensitivity, specificity, and accuracy and achieved good performance in predicting whether the readability of health education resources is suitable for children or not.


Terminology ◽  
2004 ◽  
Vol 10 (1) ◽  
pp. 153-168 ◽  
Author(s):  
Noriko Tomuro

Question terminology is a set of terms which appear in keywords, idioms and fixed expressions commonly observed in questions. This paper investigates ways to automatically extract question terminology from a corpus of questions and represent them for the purpose of classifying by question type. Our key interest is to see whether or not semantic features can enhance the representation of strongly lexical nature of question sentences. We compare two feature sets: one with lexical features only, and another with a mixture of lexical and semantic features. For evaluation, we measure the classification accuracy made by two machine learning algorithms, C5.0 and PEBLS, by using a procedure called domain cross-validation, which effectively measures the domain transferability of features.


Author(s):  
K L Malanchev ◽  
M V Pruzhinskaya ◽  
V S Korolev ◽  
P D Aleo ◽  
M V Kornilov ◽  
...  

Abstract We present results from applying the SNAD anomaly detection pipeline to the third public data release of the Zwicky Transient Facility (ZTF DR3). The pipeline is composed of 3 stages: feature extraction, search of outliers with machine learning algorithms and anomaly identification with followup by human experts. Our analysis concentrates in three ZTF fields, comprising more than 2.25 million objects. A set of 4 automatic learning algorithms was used to identify 277 outliers, which were subsequently scrutinised by an expert. From these, 188 (68%) were found to be bogus light curves – including effects from the image subtraction pipeline as well as overlapping between a star and a known asteroid, 66 (24%) were previously reported sources whereas 23 (8%) correspond to non-catalogued objects, with the two latter cases of potential scientific interest (e. g. 1 spectroscopically confirmed RS Canum Venaticorum star, 4 supernovae candidates, 1 red dwarf flare). Moreover, using results from the expert analysis, we were able to identify a simple bi-dimensional relation which can be used to aid filtering potentially bogus light curves in future studies. We provide a complete list of objects with potential scientific application so they can be further scrutinised by the community. These results confirm the importance of combining automatic machine learning algorithms with domain knowledge in the construction of recommendation systems for astronomy. Our code is publicly available*.


2019 ◽  
Vol 12 (2) ◽  
pp. 50-56
Author(s):  
A. Saravanan ◽  
S. Sathya Bama

Cyber attacks have become quite common in this internet era. The cybercrimes are getting increased every year and the intensity of damage is also increasing. providing security against cyber-attacks becomes the most significant in this digital world. However, ensuring cyber security is an extremely intricate task as requires domain knowledge about the attacks and capability of analysing the possibility of threats. The main challenge of cybersecurity is the evolving nature of the attacks. This paper presents the significance of cyber security along with the various risks that are in the current digital era. The analysis made for cyber-attacks and their statistics shows the intensity of the attacks. Various cybersecurity threats are presented along with the machine learning algorithms that can be applied to cyber attacks detection. The need for the fifth generation cybersecurity architecture is discussed.


Sign in / Sign up

Export Citation Format

Share Document