scholarly journals Word Embedding for the French Natural Language in Health Care: Comparative Study

10.2196/12310 ◽  
2019 ◽  
Vol 7 (3) ◽  
pp. e12310 ◽  
Author(s):  
Emeric Dynomant ◽  
Romain Lelong ◽  
Badisse Dahamna ◽  
Clément Massonnaud ◽  
Gaétan Kerdelhué ◽  
...  

Background Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. Objective The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. Methods Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. Results Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. Conclusions Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.

2018 ◽  
Author(s):  
Emeric Dynomant ◽  
Romain Lelong ◽  
Badisse Dahamna ◽  
Clément Massonaud ◽  
Gaétan Kerdelhué ◽  
...  

BACKGROUND Word embedding technologies are now used in a wide range of applications. However, no formal evaluation and comparison have been made on models produced by the three most famous implementations (Word2Vec, GloVe and FastText). OBJECTIVE The goal of this study is to compare embedding implementations on a corpus of documents produced in a working context, by health professionals. METHODS Models have been trained on documents coming from the Rouen university hospital. This data is not structured and cover a wide range of documents produced in a clinic (discharge summary, prescriptions ...). Four evaluation tasks have been defined (cosine similarity, odd one, mathematical operations and human formal evaluation) and applied on each model. RESULTS Word2Vec had the highest score for three of the four tasks (mathematical operations, odd one similarity and human validation), particularly regarding the Skip-Gram architecture. CONCLUSIONS Even if this implementation had the best rate, each model has its own qualities and defects, like the training time which is very short for GloVe or morphosyntaxic similarity conservation observed with FastText. Models and test sets produced by this study will be the first publicly available through a graphical interface to help advance French biomedical research.


Author(s):  
Clifford Nangle ◽  
Stuart McTaggart ◽  
Margaret MacLeod ◽  
Jackie Caldwell ◽  
Marion Bennie

ABSTRACT ObjectivesThe Prescribing Information System (PIS) datamart, hosted by NHS National Services Scotland receives around 90 million electronic prescription messages per year from GP practices across Scotland. Prescription messages contain information including drug name, quantity and strength stored as coded, machine readable, data while prescription dose instructions are unstructured free text and difficult to interpret and analyse in volume. The aim, using Natural Language Processing (NLP), was to extract drug dose amount, unit and frequency metadata from freely typed text in dose instructions to support calculating the intended number of days’ treatment. This then allows comparison with actual prescription frequency, treatment adherence and the impact upon prescribing safety and effectiveness. ApproachAn NLP algorithm was developed using the Ciao implementation of Prolog to extract dose amount, unit and frequency metadata from dose instructions held in the PIS datamart for drugs used in the treatment of gastrointestinal, cardiovascular and respiratory disease. Accuracy estimates were obtained by randomly sampling 0.1% of the distinct dose instructions from source records, comparing these with metadata extracted by the algorithm and an iterative approach was used to modify the algorithm to increase accuracy and coverage. ResultsThe NLP algorithm was applied to 39,943,465 prescription instructions issued in 2014, consisting of 575,340 distinct dose instructions. For drugs used in the gastrointestinal, cardiovascular and respiratory systems (i.e. chapters 1, 2 and 3 of the British National Formulary (BNF)) the NLP algorithm successfully extracted drug dose amount, unit and frequency metadata from 95.1%, 98.5% and 97.4% of prescriptions respectively. However, instructions containing terms such as ‘as directed’ or ‘as required’ reduce the usability of the metadata by making it difficult to calculate the total dose intended for a specific time period as 7.9%, 0.9% and 27.9% of dose instructions contained terms meaning ‘as required’ while 3.2%, 3.7% and 4.0% contained terms meaning ‘as directed’, for drugs used in BNF chapters 1, 2 and 3 respectively. ConclusionThe NLP algorithm developed can extract dose, unit and frequency metadata from text found in prescriptions issued to treat a wide range of conditions and this information may be used to support calculating treatment durations, medicines adherence and cumulative drug exposure. The presence of terms such as ‘as required’ and ‘as directed’ has a negative impact on the usability of the metadata and further work is required to determine the level of impact this has on calculating treatment durations and cumulative drug exposure.


10.2196/23230 ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. e23230
Author(s):  
Pei-Fu Chen ◽  
Ssu-Ming Wang ◽  
Wei-Chih Liao ◽  
Lu-Cheng Kuo ◽  
Kuan-Chih Chen ◽  
...  

Background The International Classification of Diseases (ICD) code is widely used as the reference in medical system and billing purposes. However, classifying diseases into ICD codes still mainly relies on humans reading a large amount of written material as the basis for coding. Coding is both laborious and time-consuming. Since the conversion of ICD-9 to ICD-10, the coding task became much more complicated, and deep learning– and natural language processing–related approaches have been studied to assist disease coders. Objective This paper aims at constructing a deep learning model for ICD-10 coding, where the model is meant to automatically determine the corresponding diagnosis and procedure codes based solely on free-text medical notes to improve accuracy and reduce human effort. Methods We used diagnosis records of the National Taiwan University Hospital as resources and apply natural language processing techniques, including global vectors, word to vectors, embeddings from language models, bidirectional encoder representations from transformers, and single head attention recurrent neural network, on the deep neural network architecture to implement ICD-10 auto-coding. Besides, we introduced the attention mechanism into the classification model to extract the keywords from diagnoses and visualize the coding reference for training freshmen in ICD-10. Sixty discharge notes were randomly selected to examine the change in the F1-score and the coding time by coders before and after using our model. Results In experiments on the medical data set of National Taiwan University Hospital, our prediction results revealed F1-scores of 0.715 and 0.618 for the ICD-10 Clinical Modification code and Procedure Coding System code, respectively, with a bidirectional encoder representations from transformers embedding approach in the Gated Recurrent Unit classification model. The well-trained models were applied on the ICD-10 web service for coding and training to ICD-10 users. With this service, coders can code with the F1-score significantly increased from a median of 0.832 to 0.922 (P<.05), but not in a reduced interval. Conclusions The proposed model significantly improved the F1-score but did not decrease the time consumed in coding by disease coders.


Author(s):  
Phillip Osial ◽  
Arnold Kim ◽  
Kalle Kauranen

Despite rapid advancements in technology, the healthcare industry is known to lag behind when it comes to adopting new changes. Most often, when a new technology such as CPOE or EHR systems presents themselves in the healthcare industry, clinicians are left struggling to keep up with their workloads while learning to adjust a new workflow. Instead of disrupting the clinician's clinical workflow, the authors propose a system for transforming clinical narratives presented in the form of discharge summaries from the i2b2 Natural Language Processing dataset into a standardized order set. The proposed system uses natural language processing techniques based on Scala, which extracts discharge summary information about a patient and is proven to be highly scalable. The goal of this system is to increase interoperability between CPOE systems by performing further transformations on the extracted data. The authors adhere to HL7's FHIR standards and use JSON as the primary medical messaging format, which is used both in the US and international healthcare industry organizations and companies.


AI Magazine ◽  
2015 ◽  
Vol 36 (1) ◽  
pp. 99-102
Author(s):  
Tiffany Barnes ◽  
Oliver Bown ◽  
Michael Buro ◽  
Michael Cook ◽  
Arne Eigenfeldt ◽  
...  

The AIIDE-14 Workshop program was held Friday and Saturday, October 3–4, 2014 at North Carolina State University in Raleigh, North Carolina. The workshop program included five workshops covering a wide range of topics. The titles of the workshops held Friday were Games and Natural Language Processing, and Artificial Intelligence in Adversarial Real-Time Games. The titles of the workshops held Saturday were Diversity in Games Research, Experimental Artificial Intelligence in Games, and Musical Metacreation. This article presents short summaries of those events.


2004 ◽  
Vol 10 (1) ◽  
pp. 57-89 ◽  
Author(s):  
MARJORIE MCSHANE ◽  
SERGEI NIRENBURG ◽  
RON ZACHARSKI

The topic of mood and modality (MOD) is a difficult aspect of language description because, among other reasons, the inventory of modal meanings is not stable across languages, moods do not map neatly from one language to another, modality may be realised morphologically or by free-standing words, and modality interacts in complex ways with other modules of the grammar, like tense and aspect. Describing MOD is especially difficult if one attempts to develop a unified approach that not only provides cross-linguistic coverage, but is also useful in practical natural language processing systems. This article discusses an approach to MOD that was developed for and implemented in the Boas Knowledge-Elicitation (KE) system. Boas elicits knowledge about any language, L, from an informant who need not be a trained linguist. That knowledge then serves as the static resources for an L-to-English translation system. The KE methodology used throughout Boas is driven by a resident inventory of parameters, value sets, and means of their realisation for a wide range of language phenomena. MOD is one of those parameters, whose values are the inventory of attested and not yet attested moods (e.g. indicative, conditional, imperative), and whose realisations include flective morphology, agglutinating morphology, isolating morphology, words, phrases and constructions. Developing the MOD elicitation procedures for Boas amounted to wedding the extensive theoretical and descriptive research on MOD with practical approaches to guiding an untrained informant through this non-trivial task. We believe that our experience in building the MOD module of Boas offers insights not only into cross-linguistic aspects of MOD that have not previously been detailed in the natural language processing literature, but also into KE methodologies that could be applied more broadly.


2020 ◽  
Author(s):  
Masashi Sugiyama

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


Sentiment Classification is one of the well-known and most popular domain of machine learning and natural language processing. An algorithm is developed to understand the opinion of an entity similar to human beings. This research fining article presents a similar to the mention above. Concept of natural language processing is considered for text representation. Later novel word embedding model is proposed for effective classification of the data. Tf-IDF and Common BoW representation models were considered for representation of text data. Importance of these models are discussed in the respective sections. The proposed is testing using IMDB datasets. 50% training and 50% testing with three random shuffling of the datasets are used for evaluation of the model.


2017 ◽  
Vol 22 (1) ◽  
pp. 7-31 ◽  
Author(s):  
Alexandra Pomares Quimbaya ◽  
Rafael A. Gonzalez ◽  
Oscar Muñoz ◽  
Olga Milena García ◽  
Wilson Ricardo Bohorquez

Objective: Electronic medical records (EMR) typically contain both structured attributes as well as narrative text. The usefulness of EMR for research and administration is hampered by the difficulty in automatically analyzing their narrative portions. Accordingly, this paper proposes SPIRE, a strategy for prioritizing EMR, using natural language processing in combination with analysis of structured data, in order to identify and rank EMR that match specific queries from clinical researchers and health administrators. Materials and Methods: The resulting software tool was evaluated technically and validated with three cases (heart failure, pulmonary hypertension and diabetes mellitus) compared against expert obtained results. Results and Discussion: Our preliminary results show high sensitivity (70%, 82% and 87% respectively) and specificity (85%, 73.7% and 87.5%) in the resulting set of records. The AUC was between 0.84 and 0.9. Conclusions: SPIRE was successfully implemented and used in the context of a university hospital information system, enabling clinical researchers to obtain prioritized EMR to solve their information needs through collaborative search templates with faster and more accurate results than other existing methods.


Sign in / Sign up

Export Citation Format

Share Document