scholarly journals 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text

2011 ◽  
Vol 18 (5) ◽  
pp. 552-556 ◽  
Author(s):  
Özlem Uzuner ◽  
Brett R South ◽  
Shuying Shen ◽  
Scott L DuVall

Abstract The 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records presented three tasks: a concept extraction task focused on the extraction of medical concepts from patient reports; an assertion classification task focused on assigning assertion types for medical problem concepts; and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. i2b2 and the VA provided an annotated reference standard corpus for the three tasks. Using this reference standard, 22 systems were developed for concept extraction, 21 for assertion classification, and 16 for relation classification. These systems showed that machine learning approaches could be augmented with rule-based systems to determine concepts, assertions, and relations. Depending on the task, the rule-based systems can either provide input for machine learning or post-process the output of machine learning. Ensembles of classifiers, information from unlabeled data, and external knowledge sources can help when the training data are inadequate.

Author(s):  
Raymond Chiong

In the field of Natural Language Processing, one of the very important research areas of Information Extraction (IE) comes in Named Entity Recognition (NER). NER is a subtask of IE that seeks to identify and classify the predefined categories of named entities in text documents. Considerable amount of work has been done on NER in recent years due to the increasing demand of automated texts and the wide availability of electronic corpora. While it is relatively easy and natural for a human reader to read and understand the context of a given article, getting a machine to understand and differentiate between words is a big challenge. For instance, the word ‘brown’ may refer to a person called Mr. Brown, or the colour of an item which is brown. Human readers can easily discern the meaning of the word by looking at the context of that particular sentence, but it would be almost impossible for a computer to interpret it without any additional information. To deal with the issue, researchers in NER field have proposed various rule-based systems (Wakao, Gaizauskas & Wilks, 1996; Krupka & Hausman, 1998; Maynard, Tablan, Ursu, Cunningham & Wilks, 2001). These systems are able to achieve high accuracy in recognition with the help of some lists of known named entities called gazetteers. The problem with rule-based approach is that it lacks the robustness and portability. It incurs steep maintenance cost especially when new rules need to be introduced for some new information or new domains. A better option is thus to use machine learning approach that is trainable and adaptable. Three wellknown machine learning approaches that have been used extensively in NER are Hidden Markov Model (HMM), Maximum Entropy Model (MEM) and Decision Tree. Many of the existing machine learning-based NER systems (Bikel, Schwartz & Weischedel, 1999; Zhou & Su, 2002; Borthwick, Sterling, Agichten & Grisham, 1998; Bender, Och & Ney, 2003; Chieu & Ng, 2002; Sekine, Grisham & Shinnou, 1998) are able to achieve near-human performance for named entity tagging, even though the overall performance is still about 2% short from the rule-based systems. There have also been many attempts to improve the performance of NER using a hybrid approach with the combination of handcrafted rules and statistical models (Mikheev, Moens & Grover, 1999; Srihari & Li, 2000; Seon, Ko, Kim & Seo, 2001). These systems can achieve relatively good performance in the targeted domains owing to the comprehensive handcrafted rules. Nevertheless, the portability problem still remains unsolved when it comes to dealing with NER in various domains. As such, this article presents a hybrid machine learning approach using MEM and HMM successively. The reason for using two statistical models in succession instead of one is due to the distinctive nature of the two models. HMM is able to achieve better performance than any other statistical models, and is generally regarded as the most successful one in machine learning approach. However, it suffers from sparseness problem, which means considerable amount of data is needed for it to achieve acceptable performance. On the other hand, MEM is able to maintain reasonable performance even when there is little data available for training purpose. The idea is therefore to walkthrough the testing corpus using MEM first in order to generate a temporary tagging result, while this procedure can be simultaneously used as a training process for HMM. During the second walkthrough, the corpus uses HMM for the final tagging. In this process, the temporary tagging result generated by MEM will be used as a reference for subsequent error checking and correction. In the case when there is little training data available, the final result can still be reliable based on the contribution of the initial MEM tagging result.


Literator ◽  
2008 ◽  
Vol 29 (1) ◽  
pp. 21-42 ◽  
Author(s):  
S. Pilon ◽  
M.J. Puttkammer ◽  
G.B. Van Huyssteen

The development of a hyphenator and compound analyser for Afrikaans The development of two core-technologies for Afrikaans, viz. a hyphenator and a compound analyser is described in this article. As no annotated Afrikaans data existed prior to this project to serve as training data for a machine learning classifier, the core-technologies in question are first developed using a rule-based approach. The rule-based hyphenator and compound analyser are evaluated and the hyphenator obtains an fscore of 90,84%, while the compound analyser only reaches an f-score of 78,20%. Since these results are somewhat disappointing and/or insufficient for practical implementation, it was decided that a machine learning technique (memory-based learning) will be used instead. Training data for each of the two core-technologies is then developed using “TurboAnnotate”, an interface designed to improve the accuracy and speed of manual annotation. The hyphenator developed using machine learning has been trained with 39 943 words and reaches an fscore of 98,11% while the f-score of the compound analyser is 90,57% after being trained with 77 589 annotated words. It is concluded that machine learning (specifically memory-based learning) seems an appropriate approach for developing coretechnologies for Afrikaans.


Kokborok named entity recognition using the rules based approach is being studied in this paper. Named entity recognition is one of the applications of natural language processing. It is considered a subtask for information extraction. Named entity recognition is the means of identifying the named entity for some specific task. We have studied the named entity recognition system for the Kokborok language. Kokborok is the official language of the state of Tripura situated in the north eastern part of India. It is also widely spoken in other part of the north eastern state of India and adjoining areas of Bangladesh. The named entities are like the name of person, organization, location etc. Named entity recognitions are studied using the machine learning approach, rule based approach or the hybrid approach combining the machine learning and rule based approaches. Rule based named entity recognitions are influence by the linguistic knowledge of the language. Machine learning approach requires a large number of training data. Kokborok being a low resource language has very limited number of training data. The rule based approach requires linguistic rules and the results are not depended on the size of data available. We have framed a heuristic rules for identifying the named entity based on linguistic knowledge of the language. An encouraging result is obtained after we test our data with the rule based approach. We also tried to study and frame the rules for the counting system in Kokborok in this paper. The rule based approach to named entity recognition is found suitable for low resource language with limited digital work and absence of named entity tagged data. We have framed a suitable algorithm using the rules for solving the named entity recognition task for obtaining a desirable result.


2021 ◽  
Vol 12 (2) ◽  
pp. 136
Author(s):  
Arnan Dwika Diasmara ◽  
Aditya Wikan Mahastama ◽  
Antonius Rachmat Chrismanto

Abstract. Intelligent System of the Battle of Honor Board Game with Decision Making and Machine Learning. The Battle of Honor is a board game where 2 players face each other to bring down their opponent's flag. This game requires a third party to act as the referee because the players cannot see each other's pawns during the game. The solution to this is to implement Rule-Based Systems (RBS) on a system developed with Unity to support the referee's role in making decisions based on the rules of the game. Researchers also develop Artificial Intelligence (AI) as opposed to applying Case-Based reasoning (CBR). The application of CBR is supported by the nearest neighbor algorithm to find cases that have a high degree of similarity. In the basic test, the results of the CBR test were obtained with the highest formulated accuracy of the 3 examiners, namely 97.101%. In testing the AI scenario as a referee, it is analyzed through colliding pieces and gives the right decision in determining victoryKeywords: The Battle of Honor, CBR, RBS, unity, AIAbstrak. The Battle of Honor merupakan permainan papan dimana 2 pemain saling berhadapan untuk menjatuhkan bendera lawannya. Permainan ini membutuhkan pihak ketiga yang berperan sebagai wasit karena pemain yang saling berhadapan tidak dapat saling melihat bidak lawannya. Solusi dari hal tersebut yaitu mengimplementasikan Rule-Based Systems (RBS) pada sistem yang dikembangkan dengan Unity untuk mendukung peran wasit dalam memberikan keputusan berdasarkan aturan permainan. Peneliti juga mengembangkan Artificial Intelligence (AI) sebagai lawan dengan menerapkan Case-Based reasoning (CBR). Penerapan CBR didukung dengan algoritma nearest neighbour untuk mencari kasus yang memiliki tingkat kemiripan yang tinggi. Pada pengujian dasar didapatkan hasil uji CBR dengan accuracy yang dirumuskan tertinggi dari 3 penguji yaitu 97,101%. Pada pengujian skenario AI sebagai wasit dianalisis lewat bidak yang bertabrakan dan memberikan keputusan yang tepat dalam menentukan kemenangan.Kata Kunci: The Battle of Honor, CBR, RBS, unity, AI


2018 ◽  
Vol 24 (3) ◽  
pp. 367-382
Author(s):  
Nassau de Nogueira Nardez ◽  
Cláudia Pereira Krueger ◽  
Rosana Sueli da Motta Jafelice ◽  
Marcio Augusto Reolon Schmidt

Abstract Knowledge concerning Phase Center Offset (PCO) is an important aspect in the calibration of GNSS antennas and has a direct influence on the quality of high precision positioning. Studies show that there is a correlation between meteorological variables when determining the north (N), east (E) and vertical Up (H) components of PCO. This article presents results for the application of Fuzzy Rule-Based Systems (FRBS) for determining the position of these components. The function Adaptive Neuro-Fuzzy Inference Systems (ANFIS) was used to generate FRBS, with the PCO components as output variables. As input data, the environmental variables such as temperature, relative humidity and precipitation were used; along with variables obtained from the antenna calibration process such as Positional Dilution of Precision and the multipath effect. An FRBS was constructed for each planimetric N and E components from the carriers L1 and L2, using a training data set by means of ANFIS. Once the FRBS were defined, the verification data set was applied, the components obtained by the FRBS and Antenna Calibration Base at the Federal University of Paraná were compared. For planimetric components, the difference was less than 1.00 mm, which shows the applicability of the method for horizontal components.


2016 ◽  
Vol 23 (6) ◽  
pp. 1166-1173 ◽  
Author(s):  
Vibhu Agarwal ◽  
Tanya Podchiyska ◽  
Juan M Banda ◽  
Veena Goel ◽  
Tiffany I Leung ◽  
...  

Abstract Objective Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record. Methods We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard. Results Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively. We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach. Conclusions Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.


Literator ◽  
2008 ◽  
Vol 29 (1) ◽  
pp. 93-110
Author(s):  
S. Pilon

The development of an inflected form generator for Afrikaans In this article the development of an inflected form generator for Afrikaans is described. Two requirements are set for this inflected form generator, viz. to generate only one specific inflected form of a lemma and to generate all possible inflected forms of a lemma. The decision to use machine learning instead of the more traditional rule-based approach in the development of this core-technology is explained and a brief overview of the development of LIA, a lemmatiser for Afrikaans, is given. Experiments are done with three different methods and it is shown that the most effective way of developing an inflected form generator for Afrikaans is by training different classifiers for each affix. Therefore a classifier is trained to generate a plural form, one to generate the diminutive, one to generate the plural of diminutive, et cetera. The final inflected form generator for Afrikaans (AIL-3) reaches an average accuracy of 86,37% on the training data and 86,88% on a small amount of new data. It is indicated that, with the help of a preprocessing module, AIL-3 meets the requirements that were set for an Afrikaans inflected form generator. Finally suggestions are made on how to improve the accuracy of AIL-3.


2019 ◽  
Vol 27 (1) ◽  
pp. 3-12 ◽  
Author(s):  
Sam Henry ◽  
Kevin Buchan ◽  
Michele Filannino ◽  
Amber Stubbs ◽  
Ozlem Uzuner

Abstract Objective This article summarizes the preparation, organization, evaluation, and results of Track 2 of the 2018 National NLP Clinical Challenges shared task. Track 2 focused on extraction of adverse drug events (ADEs) from clinical records and evaluated 3 tasks: concept extraction, relation classification, and end-to-end systems. We perform an analysis of the results to identify the state of the art in these tasks, learn from it, and build on it. Materials and Methods For all tasks, teams were given raw text of narrative discharge summaries, and in all the tasks, participants proposed deep learning–based methods with hand-designed features. In the concept extraction task, participants used sequence labelling models (bidirectional long short-term memory being the most popular), whereas in the relation classification task, they also experimented with instance-based classifiers (namely support vector machines and rules). Ensemble methods were also popular. Results A total of 28 teams participated in task 1, with 21 teams in tasks 2 and 3. The best performing systems set a high performance bar with F1 scores of 0.9418 for concept extraction, 0.9630 for relation classification, and 0.8905 for end-to-end. However, the results were much lower for concepts and relations of Reasons and ADEs. These were often missed because local context is insufficient to identify them. Conclusions This challenge shows that clinical concept extraction and relation classification systems have a high performance for many concept types, but significant improvement is still required for ADEs and Reasons. Incorporating the larger context or outside knowledge will likely improve the performance of future systems.


Sign in / Sign up

Export Citation Format

Share Document