scholarly journals An automated method to enrich consumer health vocabularies using GloVe word embeddings and an auxiliary lexical resource

2021 ◽  
Vol 7 ◽  
pp. e668
Author(s):  
Mohammed Ibrahim ◽  
Susan Gauch ◽  
Omar Salman ◽  
Mohammed Alqahtani

Background Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical terminology which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. Objective Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen’s vocabularies that has the benefit of being able to be applied to vocabularies in any domain. Methods Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies. Our approach further improves the consumer health vocabularies by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. Results The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. Furthermore, the enhanced GloVe showed a statistical significance over the two ground truth datasets with P < 0.001. Conclusions This paper presents an automatic approach to enrich consumer health vocabularies using the GloVe word embeddings and an auxiliary lexical source, WordNet. Our approach was evaluated used healthcare text downloaded from MedHelp.org, a healthcare social media platform using two standard laymen vocabularies, OAC CHV, and MedlinePlus. We used the WordNet ontology to expand the healthcare corpus by including synonyms, hyponyms, and hypernyms for each layman term occurrence in the corpus. Given a seed term selected from a concept in the ontology, we measured our algorithms’ ability to automatically extract synonyms for those terms that appeared in the ground truth concept. We found that enhanced GloVe outperformed GloVe with a relative improvement of 25% in the F-score.

2020 ◽  
Author(s):  
Mohammed Ibrahim ◽  
Susan Gauch ◽  
Omar Salman ◽  
Mohammed Alqahatani

BACKGROUND Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical jargon which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. OBJECTIVE Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen's vocabularies that has the benefit of being able to be applied to vocabularies in any domain. METHODS Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies (CHV). Our approach further improves the CHV by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. RESULTS The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. CONCLUSIONS This paper presents an automatic approach to enrich consumer health vocabularies using the GloVe word embeddings and an auxiliary lexical source, WordNet. Our approach was evaluated used a healthcare text downloaded from MedHelp.org, a healthcare social media platform using two standard laymen vocabularies, OAC CHV, and MedlinePlus. We used the WordNet ontology to expand the healthcare corpus by including synonyms, hyponyms, and hypernyms for each CHV layman term occurrence in the corpus. Given a seed term selected from a concept in the ontology, we measured our algorithms’ ability to automatically extract synonyms for those terms that appeared in the ground truth concept. We found that enhanced GloVe outperformed GloVe with a relative improvement of 25% in the F-score.


2020 ◽  
Vol 4 (3) ◽  
pp. 177-190
Author(s):  
Daqing He ◽  
Zhendong Wang ◽  
Khushboo Thaker ◽  
Ning Zou

AbstractAcademic collections, such as COVID-19 Open Research Dataset (CORD-19), contain a large number of scholarly articles regarding COVID-19 and other related viruses. These articles represent the latest development in combating COVID-19 pandemic in various disciplines. However, it is difficult for laypeople to access these articles due to the term mismatch problem caused by their limited medical knowledge. In this article, we present an effort of helping laypeople to access the CORD-19 collection by translating and expanding laypeople's keywords to their corresponding medical terminology using the National Library of Medicine's Consumer Health Vocabulary. We then developed a retrieval system called Search engine for Laypeople to access the COVID-19 literature (SLAC) using open-source software. Utilizing Centers for Disease Control and Prevention's FAQ questions as the basis for developing common questions that laypeople could be interested in, we performed a set of experiments for testing the SLAC system and the translation and expansion (T&E) process. Our experiment results demonstrate that the T&E process indeed helped to overcome the term mismatch problem and mapped laypeople terms to the medical terms in the academic articles. But we also found that not all laypeople's search topics are meaningful to search on the CORD-19 collection. This indicates the scope and the limitation of enabling laypeople to search on academic article collection for obtaining high-quality information.


2015 ◽  
Vol 2 ◽  
pp. 110-121
Author(s):  
María Shcherbakova

В течение последних нескольких десятилетий проблемы терминологии привлекли внимание многих исследователей и ученых, что может объясняться растущей важностью науки в жизни людей. Развитие медицинской терминологии началось много веков назад и продолжается по сей день. Основная цель этой работы заключается в рассмотрении терминологического глоссария сердечно-сосудистой системы, созданного на основе  Nomina Anatomical 2001 года, а также в комплексном анализе перевода специализированной терминологии. Наряду с разработкой испано-русского двуязычного глоссария, мы также обратили внимание на анализ собранных данных и комментарии, которые могут предотвратить ошибки и путаницу среди переводчиков и получателей переведенной информации. Для достижения целей мы прибегли к методу анализа параллельных текстов на выбранную тему в испанском и русском языках, а также к методу визуализации, что позволило нам перевести термины из списка и гарантировать высокий уровень точности, объективности, корректности, эквивалентности и адекватности. Основной гипотезой данной статьи является то, что, несмотря на греческое и латинское происхождение большинства выбранных терминов на испанском и значительной их части в русском языке, дословный перевод представляет собой наиболее серьезную и наиболее распространенную ошибку, которую совершают переводчики медицинских текстов, что объясняется особенностями развития медицинских систем в русском и испанском языках, где каждая отрасль терминологии сосредоточилась на своих собственных эволюционных принципах.Resumen: En las últimas décadas, los problemas de la terminología han llamado la atención de muchos investigadores y científicos, lo cual se explica por la creciente importancia que adquiere la ciencia en la vida de las personas. La terminología médica empezó su formación hace siglos y sigue desarrollándose hasta el momento. El objetivo de nuestro trabajo es dar cuenta de la creación de un glosario de la terminología del sistema cardiovascular basada en la Nómina Anatómica del año 2001 y un análisis exhaustivo de la traducción de términos realizada. Además de la elaboración de un glosario bilingüe español-ruso también nos hemos centrado en el análisis de datos recogidos y comentarios que pueden prevenir errores y confusiones para los traductores y otros destinatarios. Para conseguir los objetivos propuestos hemos optado principalmente por el método de análisis de textos paralelos acerca del tema elegido en español y en ruso, así como el de la metodología de visualización, lo que nos ha permitido traducir los términos de la lista representada garantizando el máximo nivel de fidelidad, objetividad, precisión, equivalencia y adecuación. La hipótesis principal del presente artículo consiste en que, a pesar del origen griego y latín de la mayoría de los términos seleccionados en castellano y una gran parte de los términos en ruso, la traducción literal de éstos representa el error más grave y más frecuente de los traductores de los textos médicos ya que debido a las peculiaridades del desarrollo de los sistemas de lenguajes médicos en español y en ruso, la terminología de cada idioma ha seguido sus propias pautas de evolución.Abstract: During the last few decades the problems of terminology have caught the attention of many researchers and scientists which can be explained by the growing importance of science in the lives of people. Medical terminology formation began centuries ago and keeps developing nowadays. The main objective of this paper is to discuss the terminological glossary of the cardiovascular system created based on the Nomina Anatomical 2001 as well as the comprehensive analysis of specialized terminology translation. Apart from the development of the Spanish-Russian bilingual glossary, I have also focused on the analysis of data collected and comments that can prevent errors and confusion among translators and recipients of the translated information. To achieve the objectives the method of analysis of parallel texts on the subject chosen has been used in Spanish and in Russian, as well as that of visualization, which allowed us to translate the terms from the list and guarantee the highest level of fidelity, objectivity, accuracy, equivalence and adequacy. The main hypothesis of this article is that, despite the Greek and Latin origin of most of the terms selected in Spanish and a large part of the terms in Russian, their literal translation represents the most serious and most common mistake that translators in healthcare setting make due to the peculiarities of the development of medical systems in Spanish and Russian languages, where each language terminology has followed its own guidelines of evolution.  


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sakthi Kumar Arul Prakash ◽  
Conrad Tucker

AbstractThis work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need for ground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, this work leverages user–user and user–media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) being spread, without needing to know the actual details of the information itself. To study the inception and evolution of user–user and user–media interactions over time, we create an experimental platform that mimics the functionality of real-world social media networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty (entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world social media network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, and with media content. The discovery that the entropy of user–user and user–media interactions approximate fake and authentic media likes, enables us to classify fake media in an unsupervised learning manner.


2021 ◽  
Author(s):  
Hansi Hettiarachchi ◽  
Mariam Adedoyin-Olowe ◽  
Jagdev Bhogal ◽  
Mohamed Medhat Gaber

AbstractSocial media is becoming a primary medium to discuss what is happening around the world. Therefore, the data generated by social media platforms contain rich information which describes the ongoing events. Further, the timeliness associated with these data is capable of facilitating immediate insights. However, considering the dynamic nature and high volume of data production in social media data streams, it is impractical to filter the events manually and therefore, automated event detection mechanisms are invaluable to the community. Apart from a few notable exceptions, most previous research on automated event detection have focused only on statistical and syntactical features in data and lacked the involvement of underlying semantics which are important for effective information retrieval from text since they represent the connections between words and their meanings. In this paper, we propose a novel method termed Embed2Detect for event detection in social media by combining the characteristics in word embeddings and hierarchical agglomerative clustering. The adoption of word embeddings gives Embed2Detect the capability to incorporate powerful semantical features into event detection and overcome a major limitation inherent in previous approaches. We experimented our method on two recent real social media data sets which represent the sports and political domain and also compared the results to several state-of-the-art methods. The obtained results show that Embed2Detect is capable of effective and efficient event detection and it outperforms the recent event detection methods. For the sports data set, Embed2Detect achieved 27% higher F-measure than the best-performed baseline and for the political data set, it was an increase of 29%.


Author(s):  
Richard Fletcher ◽  
Steve Schifferes ◽  
Neil Thurman

Social media is now used as an information source in many different contexts. For professional journalists, the use of social media for news production creates new challenges for the verification process. This article describes the development and evaluation of the ‘Truthmeter’ – a tool that automatically scores the journalistic credibility of social media contributors in order to inform overall credibility assessments. The Truthmeter was evaluated using a three-stage process that used both qualitative and quantitative methods, consisting of (1) obtaining a ground truth, (2) building a description of existing practices and (3) calibration, modification and testing. As a result of the evaluation process, which could be generalized and applied in other contexts, the Truthmeter produced credibility scores that were closely aligned with those of trainee journalists. Substantively, the evaluation also highlighted the importance of ‘relational’ credibility assessments, where credibility may be attributed based on networked connections to other credible contributors.


Author(s):  
TATIANA V. LUKOYANOVA ◽  
◽  
LYUBOV M. KASIMTSEVA ◽  

He problem of interaction between language and society is more and more relevant in the modern world. Changes of the world have influenced the discourse in various spheres of communication. Moreover the progress of medicine has led to changes in the system of medical terms, and, consequently, to difficulties in understanding of terms in the ordinary consciousness. A person always wants to understand other people and to be understood by them. So it is the main goal of communication. The article deals with the issue of the functioning of medical terms in the minds of first-year French-speaking students. First-year stu- dents have the ordinary consciousness. Medical terms, which are studied by them (first-year students), function in the ordinary consciousness in incorrect form. It's because they are based on associations, feelings which the person has during the process of cognition, and, therefore, they must be interpreted. During the process of learning, there is a need to clarify the personal meanings that medical students have in regarding to special terms. In this article the author gives the results of the research concerning the French-speaking students' knowledge of medical terminology. The author gives examples of the perception and rethinking of medical terms based on common knowledge and examines the formation of scientific knowledge to develop the language of medicine in the consciousness of students and its competent use in the future.


Author(s):  
Mohamed Estai ◽  
Marc Tennant ◽  
Dieter Gebauer ◽  
Andrew Brostek ◽  
Janardhan Vignarajan ◽  
...  

Objective: This study aimed to evaluate an automated detection system to detect and classify permanent teeth on orthopantomogram (OPG) images using convolutional neural networks (CNNs). Methods: In total, 591 digital OPGs were collected from patients older than 18 years. Three qualified dentists performed individual teeth labelling on images to generate the ground truth annotations. A three-step procedure, relying upon CNNs, was proposed for automated detection and classification of teeth. Firstly, U-Net, a type of CNN, performed preliminary segmentation of tooth regions or detecting regions of interest (ROIs) on panoramic images. Secondly, the Faster R-CNN, an advanced object detection architecture, identified each tooth within the ROI determined by the U-Net. Thirdly, VGG-16 architecture classified each tooth into 32 categories, and a tooth number was assigned. A total of 17,135 teeth cropped from 591 radiographs were used to train and validate the tooth detection and tooth numbering modules. 90% of OPG images were used for training, and the remaining 10% were used for validation. 10-folds cross-validation was performed for measuring the performance. The intersection over union (IoU), F1 score, precision, and recall (i.e. sensitivity) were used as metrics to evaluate the performance of resultant CNNs. Results: The ROI detection module had an IoU of 0.70. The tooth detection module achieved a recall of 0.99 and a precision of 0.99. The tooth numbering module had a recall, precision and F1 score of 0.98. Conclusion: The resultant automated method achieved high performance for automated tooth detection and numbering from OPG images. Deep learning can be helpful in the automatic filing of dental charts in general dentistry and forensic medicine.


2021 ◽  
Vol 11 (3) ◽  
pp. 56
Author(s):  
Thapanee Seechaliao ◽  
Phamornpun Yurayat

The main research purpose focused on the effects of conducting the instructional model based on the principles of creative problem solving with social media to promote the creation of educational innovation for pre-service teachers. The participants consisted of twelve pre-service teachers. Research instruments were 1) the instructional model based on the principles of creative problem solving with social media, 2) the test of knowledge and creation of educational innovation, 3) the creation of educational innovation&rsquo;s evaluation form, and 4) the questionnaires&rsquo; conducting this instructional model. Collected data were analyzed with statistics and categorized into key issues based on literature. The results were presented through the form of Shapiro-Wilk, Wilcoxon signed-ranks test, arithmetic mean, standard deviation, and descriptive analysis. The research findings were presented as follows: 1) the effects of conducting the instructional model that was conducted sixteen weeks on the course 0537211 Innovation in Educational Technology and Communications in the first semester of 2020. The research hypothesizes were followed the established as follows; 1.1) the pre-service teachers had post-test scores&rsquo; the knowledge and creation of educational innovation higher than pre-test with statistical significance at the .01 level. 1.2) they had post-learning scores for creating educational innovations&rsquo; processes at the overall excellent level (M = 92.83, S.D. = 11.78), and their educational innovations were be post-learning at the overall good level (M = 48.33, S.D. = 7.45) 2) the opinions&rsquo; pre-service teachers toward conducting this instructional model that they have positive opinions to this conduct at the overall excellent level (M = 4.92, S.D. = 0.25).


Sign in / Sign up

Export Citation Format

Share Document