Markup: A Web-Based Annotation Tool Powered by Active Learning

Across various domains, such as health and social care, law, news, and social media, there are increasing quantities of unstructured texts being produced. These potential data sources often contain rich information that could be used for domain-specific and research purposes. However, the unstructured nature of free-text data poses a significant challenge for its utilisation due to the necessity of substantial manual intervention from domain-experts to label embedded information. Annotation tools can assist with this process by providing functionality that enables the accurate capture and transformation of unstructured texts into structured annotations, which can be used individually, or as part of larger Natural Language Processing (NLP) pipelines. We present Markup (https://www.getmarkup.com/) an open-source, web-based annotation tool that is undergoing continued development for use across all domains. Markup incorporates NLP and Active Learning (AL) technologies to enable rapid and accurate annotation using custom user configurations, predictive annotation suggestions, and automated mapping suggestions to both domain-specific ontologies, such as the Unified Medical Language System (UMLS), and custom, user-defined ontologies. We demonstrate a real-world use case of how Markup has been used in a healthcare setting to annotate structured information from unstructured clinic letters, where captured annotations were used to build and test NLP applications.

Download Full-text

Markup: A Web-Based Clinical Annotation Tool with Enhanced Ontology Mapping

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1634 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Samuel Dobbie ◽

Huw Strafford ◽

W Owen Pickrell ◽

Beata Fonferko-Shadrach ◽

Ashley Akbari ◽

...

Keyword(s):

Language Processing ◽

Gold Standard ◽

Clinical Information ◽

Healthcare Research ◽

Learning Technologies ◽

Free Text ◽

Specific Information ◽

Annotation Tool ◽

Snomed Ct ◽

Web Based

IntroductionUnstructured free-text clinical notes often contain valuable information relating to patient symptoms, prescriptions and diagnoses. These can assist with better care for patients and novel healthcare research if transformed into accessible, structured clinical text. In particular, Natural Language Processing (NLP) algorithms can produce such structured outputs, but require gold standard data to train and validate their accuracy. While existing tools such as Brat and Webanno provide interfaces to manually annotate text, there is a lack of capability to efficiently annotate complex clinical information. Objectives and ApproachWe present Markup, an open-source, web-based annotation tool developed for use within clinical contexts by domain experts to produce gold standard annotations for NLP development. Markup incorporates NLP and Active Learning technologies to enable rapid and accurate annotation of unstructured documents. Markup supports custom user configurations, automated annotation suggestions, and automated mapping to existing clinical ontologies such as the Unified Medical Language System (UMLS), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), or custom, user-defined ontologies. ResultsMarkup has been tested on Epilepsy clinic letters, where captured annotations were used to build and test NLP applications. Markup allowed for inter-annotator statistics to be calculated in the case of multiple annotators. Re-annotation, following iterations of annotation definitions, was incorporated for flexibility. UMLS codes, certainty context, and multiple components from complex phrases were all able to be captured and exported in a structured format. Conclusions / ImplicationsMarkup allows gold standard annotations to be collected efficiently across unstructured text and is optimized to capture health-specific information. These annotations are important to develop and validate NLP algorithms that automate the capture of important information from clinic letters at scale.

Download Full-text

Accurator: Nichesourcing for Cultural Heritage

Human Computation ◽

10.15346/hc.v6i1.91 ◽

2019 ◽

Vol 6 ◽

pp. 12-41

Author(s):

Chris Dijkshoorn ◽

Victor De Boer ◽

Lora Aroyo ◽

Guus Schreiber

Keyword(s):

Cultural Heritage ◽

Case Studies ◽

Expert Knowledge ◽

Image Annotation ◽

User Evaluation ◽

Annotation Tool ◽

Specific Knowledge ◽

Web Based ◽

Domain Specific ◽

Domain Specific Knowledge

With the increase of cultural heritage data published online, the usefulness of data in this open context hinges on the quality and diversity of descriptions of collection objects. In many cases, existing descriptions are not sufficient for retrieval and research tasks, resulting in the need for more specific annotations. However, eliciting such annotations is a challenge since it often requires domain-specific knowledge. Where crowdsourcing can be successfully used to execute simple annotation tasks, identifying people with the required expertise might prove troublesome for more complex and domain-specific tasks. Nichesourcing addresses this problem, by tapping into the expert knowledge available in niche communities. This paper presents Accurator, a methodology for conducting nichesourcing campaigns for cultural heritage institutions, by addressing communities, organizing events and tailoring a web-based annotation tool to a domain of choice. The contribution of this paper is fourfold: 1) a nichesourcing methodology, 2) an annotation tool for experts, 3) validation of the methodology in three case studies and 4) a dataset including the obtained annotations. The three domains of the case studies are birds on art, bible prints and fashion images. We compare the quality and quantity of obtained annotations in the three case studies, showing that the nichesourcing methodology in combination with the image annotation tool can be used to collect high-quality annotations in a variety of domains. A user evaluation indicates the tool is suited and usable for domain-specific annotation tasks.

Download Full-text

A Novel Active Learning Regression Framework for Balancing the Exploration-Exploitation Trade-Off

Entropy ◽

10.3390/e21070651 ◽

2019 ◽

Vol 21 (7) ◽

pp. 651 ◽

Cited By ~ 3

Author(s):

Dina Elreedy ◽

Amir F. Atiya ◽

Samir I. Shaheen

Keyword(s):

Active Learning ◽

Language Processing ◽

Optimization Problems ◽

Model Accuracy ◽

Trade Off ◽

Learning Framework ◽

Domain Specific ◽

Significant Performance ◽

Price Demand ◽

Exploration Exploitation

Recently, active learning is considered a promising approach for data acquisition due to the significant cost of the data labeling process in many real world applications, such as natural language processing and image processing. Most active learning methods are merely designed to enhance the learning model accuracy. However, the model accuracy may not be the primary goal and there could be other domain-specific objectives to be optimized. In this work, we develop a novel active learning framework that aims to solve a general class of optimization problems. The proposed framework mainly targets the optimization problems exposed to the exploration-exploitation trade-off. The active learning framework is comprehensive, it includes exploration-based, exploitation-based and balancing strategies that seek to achieve the balance between exploration and exploitation. The paper mainly considers regression tasks, as they are under-researched in the active learning field compared to classification tasks. Furthermore, in this work, we investigate the different active querying approaches—pool-based and the query synthesis—and compare them. We apply the proposed framework to the problem of learning the price-demand function, an application that is important in optimal product pricing and dynamic (or time-varying) pricing. In our experiments, we provide a comparative study including the proposed framework strategies and some other baselines. The accomplished results demonstrate a significant performance for the proposed methods.

Download Full-text

Active learning: a step towards automating medical concept extraction

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv069 ◽

2015 ◽

Vol 23 (2) ◽

pp. 289-296 ◽

Cited By ~ 17

Author(s):

Mahnoosh Kholghi ◽

Laurianne Sitbon ◽

Guido Zuccon ◽

Anthony Nguyen

Keyword(s):

Active Learning ◽

Language Processing ◽

Selection Criteria ◽

Conditional Random Fields ◽

Free Text ◽

Data Sets ◽

Comparative Performance ◽

Concept Extraction ◽

Learning Framework ◽

Medical Concept

Abstract Objective This paper presents an automatic, active learning-based system for the extraction of medical concepts from clinical free-text reports. Specifically, (1) the contribution of active learning in reducing the annotation effort and (2) the robustness of incremental active learning framework across different selection criteria and data sets are determined. Materials and methods The comparative performance of an active learning framework and a fully supervised approach were investigated to study how active learning reduces the annotation effort while achieving the same effectiveness as a supervised approach. Conditional random fields as the supervised method, and least confidence and information density as 2 selection criteria for active learning framework were used. The effect of incremental learning vs standard learning on the robustness of the models within the active learning framework with different selection criteria was also investigated. The following 2 clinical data sets were used for evaluation: the Informatics for Integrating Biology and the Bedside/Veteran Affairs (i2b2/VA) 2010 natural language processing challenge and the Shared Annotated Resources/Conference and Labs of the Evaluation Forum (ShARe/CLEF) 2013 eHealth Evaluation Lab. Results The annotation effort saved by active learning to achieve the same effectiveness as supervised learning is up to 77%, 57%, and 46% of the total number of sequences, tokens, and concepts, respectively. Compared with the random sampling baseline, the saving is at least doubled. Conclusion Incremental active learning is a promising approach for building effective and robust medical concept extraction models while significantly reducing the burden of manual annotation.

Download Full-text

Development of an Automated Solution for Large Scale Health Service Feedback: Using NLP and Topic Modelling techniques (Preprint)

10.2196/preprints.29385 ◽

2021 ◽

Author(s):

George Alexander ◽

Mohammed Bahja ◽

Gibran F Butt

Keyword(s):

Lived Experience ◽

Language Processing ◽

Web Application ◽

Large Scale ◽

Service Providers ◽

Classification Model ◽

Free Text ◽

Analysis Tool ◽

Patient Feedback ◽

Health And Social Care

UNSTRUCTURED Obtaining patient feedback is an essential mechanism for healthcare service providers to assess their quality and effectiveness. Unlike assessments of clinical outcomes, feedback from patients offers insights into their lived experience. The Department of Health and Social Care in England via NHS Digital operates a patient feedback web service through which patients can leave feedback of their experiences into structured and free-text report forms. Free-text feedback compared to structured questionnaires may be less biased by the feedback collector thus more representative; however, it is harder to analyse in large quantities and challenging to derive meaningful, quantitative outcomes for better representation of the general public feedback. This study details the development of a text analysis tool that utilises contemporary natural language processing (NLP) and machine learning models to analyse free-text clinical service reviews to develop a robust classification model, and interactive visualisation web application based on a Vue.js application with NodeJS, working with a C# serverless API and SQL server all hosted on Microsoft Azure Platform, which facilitates exploration of the data, designed for the use by all stakeholders. Of the 11,103 possible clinical services that could be reviewed across England, 2030 different services had received a combined total of 51,845 reviews between 1/10/2017 and 31/10/2019; these were included for analysis. Dominant topics were identified for the entire corpus and then negative and positive sentiment topics in turn. Reviews containing high and low sentiment topics occurred more frequently than less polarised topics. Time series analysis can identify trends in topic and sentiment occurrence frequency across the study period. This tool automates the analysis of large volumes of free text specific to medical services, and the web application summarises the results and presents them in an accessible and interactive format. Such a tool has the potential to considerably reduce administrative burden and increase user uptake.

Download Full-text

Experiences implementing scalable, containerized, cloud-based NLP for extracting biobank participant phenotypes at scale

JAMIA Open ◽

10.1093/jamiaopen/ooaa016 ◽

2020 ◽

Vol 3 (2) ◽

pp. 185-189

Author(s):

Timothy A Miller ◽

Paul Avillach ◽

Kenneth D Mandl

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

High Throughput ◽

Free Text ◽

Queue Size ◽

Health Records ◽

Document Collections ◽

Web Based ◽

Patient Status ◽

High Throughput Phenotyping

Abstract Objective To develop scalable natural language processing (NLP) infrastructure for processing the free text in electronic health records (EHRs). Materials and Methods We extend the open-source Apache cTAKES NLP software with several standard technologies for scalability. We remove processing bottlenecks by monitoring component queue size. We process EHR free text for patients in the PrecisionLink Biobank at Boston Children’s Hospital. The extracted concepts are made searchable via a web-based portal. Results We processed over 1.2 million notes for over 8000 patients, extracting 154 million concepts. Our largest tested configuration processes over 1 million notes per day. Discussion The unique information represented by extracted NLP concepts has great potential to provide a more complete picture of patient status. Conclusion NLP large EHR document collections can be done efficiently, in service of high throughput phenotyping.

Download Full-text

Automated de-identification of clinical free-text

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.345 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Azad Dehghan ◽

Tom Liptrot ◽

Catherine O’hara ◽

Matthew Barker-Hewitt ◽

Daniel Tibble ◽

...

Keyword(s):

Active Learning ◽

Language Processing ◽

Large Scale ◽

Linear Chain ◽

Data Access ◽

Computational Method ◽

Free Text ◽

Clinical Text ◽

Clinical Notes ◽

Clinical Records

ABSTRACTObjectivesIncreasing interest to use unstructured electronic patient records for research has attracted attention to automated de-identification methods to conduct large scale removal of Personal Identifiable Information (PII). PII mainly include identifiable information such as person names, dates (e.g., date of birth), reference numbers (e.g., hospital number, NHS number), locations (e.g., hospital names, addresses), contacts (e.g., telephone, e-mail), occupation, age, and other identity information (ethnical, religion, sexual) mentioned in a private context. De-identification of clinical free-text remains crucial to enable large-scale data access for health research while adhering to legal (Data Protection Act 1998) and ethical obligations. Here we present a computational method developed to automatically remove PII from clinical text. ApproachIn order to automatically identify PII in clinical text, we have developed and validated a Natural Language Processing (NLP) method which combine knowledge- (lexical dictionaries and rules) and data-driven (linear-chain conditional random fields) techniques. In addition, we have designed a novel two-pass recognition approach that uses the output of the initial pass to create patient-level and run-time dictionaries used to identify PII mentions that lack specific contextual clues considered by the initial entity extraction modules. The labelled data used to model and validate our techniques were generated using six human annotators and two distinct types of free-text from The Christie NHS Foundation Trust: (1) clinical correspondence (400 documents) and (2) clinical notes (1,300 documents). ResultsThe de-identification approach was developed and validated using a 60/40 percent split between the development and test datasets. The preliminary results show that our method achieves 97% and 93% token-level F1-measure on clinical correspondence and clinical notes respectively. In addition, the proposed two-pass recognition method was found particularly effective for longitudinal records. Notably, the performances are comparable to human benchmarks (using inter annotator agreements) of 97% and 90% F1 respectively. ConclusionsWe have developed and validated a state-of-the-art method that matches human benchmarks in identification and removal of PII from free-text clinical records. The method has been further validated across multiple institutions and countries (United States and United Kingdom), where we have identified a notable NLP challenge of cross-dataset adaption and have proposed using active learning methods to address this problem. The algorithm, including an active learning component, will be provided as open source to the healthcare community.

Download Full-text

UGLEO: A WEB BASED INTELLIGENCE CHATBOT FOR STUDENT ADMISSION PORTAL USING MEGAHAL STYLE

Jurnal Ilmiah Informatika Komputer ◽

10.35760/ik.2018.v23i3.2373 ◽

2018 ◽

Vol 23 (3) ◽

pp. 175-191

Author(s):

Anneke Annassia Putri Siswadi ◽

Avinanta Tarigan

Keyword(s):

Markov Chain ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Information Need ◽

Web Based ◽

Markov Chain Method ◽

Information Center

To fulfill the prospective student's information need about student admission, Gunadarma University has already many kinds of services which are time limited, such as website, book, registration place, Media Information Center, and Question Answering’s website (UG-Pedia). It needs a service that can serve them anytime and anywhere. Therefore, this research is developing the UGLeo as a web based QA intelligence chatbot application for Gunadarma University's student admission portal. UGLeo is developed by MegaHal style which implements the Markov Chain method. In this research, there are some modifications in MegaHal style, those modifications are the structure of natural language processing and the structure of database. The accuracy of UGLeo reply is 65%. However, to increase the accuracy there are some improvements to be applied in UGLeo system, both improvement in natural language processing and improvement in MegaHal style.

Download Full-text

A comparison of three interactive examination designs in active learning classrooms for nursing students

BMC Nursing ◽

10.1186/s12912-021-00575-6 ◽

2021 ◽

Vol 20 (1) ◽

Author(s):

Linda Ahlstrom ◽

Christopher Holmberg

Keyword(s):

Nursing Education ◽

Mixed Methods ◽

Nursing Students ◽

Active Learning ◽

Online Education ◽

Learning Strategies ◽

Free Text ◽

Course Content ◽

Undergraduate Nursing ◽

Chi Squared

Abstract Background Despite the advantages of using active learning strategies in nursing education, researchers have rarely investigated how such pedagogic approaches can be used to assess students or how interactive examinations can be modified depending on circumstances of practice (e.g., in online education). Aims The aim was to compare three interactive examination designs, all based on active learning pedagogy, in terms of nursing students’ engagement and preparedness, their learning achievement, and instructional aspects. Methods A comparative research design was used including final-year undergraduate nursing students. All students were enrolled in a quality improvement course at a metropolitan university in Sweden. In this comparative study to evaluate three course layouts, participants (Cohort 1, n = 89; Cohort 2, n = 97; Cohort 3, n = 60) completed different examinations assessing the same course content and learning objectives, after which they evaluated the examinations on a questionnaire in numerical and free-text responses. Chi-squared tests were conducted to compare background variables between the cohorts and Kruskal–Wallis H tests to assess numerical differences in experiences between cohorts. Following the guidelines of the Good Reporting of a Mixed Methods Study (GRAMMS), a sequential mixed-methods analysis was performed on the quantitative findings, and the qualitative findings were used complementary to support the interpretation of the quantitative results. Results The 246 students who completed the questionnaire generally appreciated the interactive examination in active learning classrooms. Among significant differences in the results, Cohort 2 (e.g., conducted the examination on campus) scored highest for overall positive experience and engagement, whereas Cohort 3 (e.g., conducted the examination online) scored the lowest. Students in Cohort 3 generally commended the online examination’s chat function available for use during the examination. Conclusions Interactive examinations for nursing students succeed when they are campus-based, focus on student preparation, and provide the necessary time to be completed.

Download Full-text

Sentiment Analysis Techniques Applied to Raw-Text Data from a Csq-8 Questionnaire about Mindfulness in Times of COVID-19 to Improve Strategy Generation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18126408 ◽

2021 ◽

Vol 18 (12) ◽

pp. 6408

Author(s):

Mario Jojoa Acosta ◽

Gema Castillo-Sánchez ◽

Begonya Garcia-Zapirain ◽

Isabel de la Torre Díez ◽

Manuel Franco-Martín

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Transfer Learning ◽

Language Processing ◽

Health Care Professionals ◽

Ground Truth ◽

Relevant Information ◽

Free Text

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.

Download Full-text