Interpretable segmentation of medical free-text records based on word embeddings

AbstractMedical free-text records store a lot of useful information that can be exploited in developing computer-supported medicine. However, extracting the knowledge from the unstructured text is difficult and depends on the language. In the paper, we apply Natural Language Processing methods to process raw medical texts in Polish and propose a new methodology for clustering of patients’ visits. We (1) extract medical terminology from a corpus of free-text clinical records, (2) annotate data with medical concepts, (3) compute vector representations of medical concepts and validate them on the proposed term analogy tasks, (4) compute visit representations as vectors, (5) introduce a new method for clustering of patients’ visits and (6) apply the method to a corpus of 100,000 visits. We use several approaches to visual exploration that facilitate interpretation of segments. With our method, we obtain stable and separated segments of visits which are positively validated against final medical diagnoses. In this paper we show how algorithm for segmentation of medical free-text records may be used to aid medical doctors. In addition to this, we share implementation of described methods with examples as open-source package .

Download Full-text

An Integrated Approach to Biomedical Term Identification Systems

Applied Sciences ◽

10.3390/app10051726 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1726 ◽

Cited By ~ 1

Author(s):

Pilar López-Úbeda ◽

Manuel Carlos Díaz-Galiano ◽

Arturo Montejo-Ráez ◽

María-Teresa Martín-Valdivia ◽

L. Alfonso Ureña-López

Keyword(s):

Language Processing ◽

Integrated Approach ◽

Knowledge Bases ◽

Sources Of Information ◽

Medical Texts ◽

Medical Term ◽

Conceptual Graph ◽

Medical Concepts ◽

Fine Tune ◽

Term Identification

In this paper a novel architecture to build biomedical term identification systems is presented. The architecture combines several sources of information and knowledge bases to provide practical and exploration-enabled biomedical term identification systems. We have implemented a system to evidence the convenience of the different modules considered in the architecture. Our system includes medical term identification, retrieval of specialized literature and semantic concept browsing from medical ontologies. By applying several Natural Language Processing (NLP) technologies, we have developed a prototype that offers an easy interface for helping to understand biomedical specialized terminology present in Spanish medical texts. The result is a system that performs term identification of medical concepts over any textual document written in Spanish. It is possible to perform a sub-concept selection using the previously identified terms to accomplish a fine-tune retrieval process over resources like SciELO, Google Scholar and MedLine. Moreover, the system generates a conceptual graph which semantically relates all the terms found in the text. In order to evaluate our proposal on medical term identification, we present the results obtained by our system using the MANTRA corpus and compare its performance with the Freeling-Med tool.

Download Full-text

A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use

Methods of Information in Medicine ◽

10.3414/me11-01-0048 ◽

2012 ◽

Vol 51 (03) ◽

pp. 229-241 ◽

Cited By ~ 9

Author(s):

B. S. Erdal ◽

J. Liu ◽

J. Ding ◽

J. Chen ◽

C. B. Marsh ◽

...

Keyword(s):

Language Processing ◽

Modular Design ◽

Target System ◽

Free Text ◽

Institutional Research ◽

Secondary Use ◽

Data Analyst ◽

Sequence Of Events ◽

Pilot Implementation ◽

Clinical Records

SummaryObjective: To qualify the use of patient clinical records as non-human-subject for research purpose, electronic medical record data must be de-identified so there is minimum risk to protected health information exposure. This study demonstrated a robust framework for structured data de-identification that can be applied to any relational data source that needs to be de-identified.Methods: Using a real world clinical data warehouse, a pilot implementation of limited subject areas were used to demonstrate and evaluate this new de-identification process. Query results and performances are compared between source and target system to validate data accuracy and usability.Results: The combination of hashing, pseudonyms, and session dependent randomizer provides a rigorous de-identification framework to guard against 1) source identifier exposure; 2) internal data analyst manually linking to source identifiers; and 3) identifier cross-link among different researchers or multiple query sessions by the same researcher. In addition, a query rejection option is provided to refuse queries resulting in less than preset numbers of subjects and total records to prevent users from accidental subject identification due to low volume of data.This framework does not prevent subject re-identification based on prior knowledge and sequence of events. Also, it does not deal with medical free text de-identification, although text de-identification using natural language processing can be included due its modular design.Conclusion: We demonstrated a framework resulting in HIPAA Compliant databases that can be directly queried by researchers. This technique can be augmented to facilitate inter-institutional research data sharing through existing middleware such as caGrid.

Download Full-text

Cognitive Impairments in Schizophrenia: A Study in a Large Clinical Sample Using Natural Language Processing

Frontiers in Digital Health ◽

10.3389/fdgth.2021.711941 ◽

2021 ◽

Vol 3 ◽

Author(s):

Aurelie Mascio ◽

Robert Stewart ◽

Riley Botelle ◽

Marcus Williams ◽

Luwaiza Mirza ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Clinical Outcomes ◽

Language Processing ◽

Large Scale ◽

Rating Scales ◽

Cognitive Impairments ◽

Free Text ◽

Socio Demographic Factors ◽

Clinical Records

Background: Cognitive impairments are a neglected aspect of schizophrenia despite being a major factor of poor functional outcome. They are usually measured using various rating scales, however, these necessitate trained practitioners and are rarely routinely applied in clinical settings. Recent advances in natural language processing techniques allow us to extract such information from unstructured portions of text at a large scale and in a cost effective manner. We aimed to identify cognitive problems in the clinical records of a large sample of patients with schizophrenia, and assess their association with clinical outcomes.Methods: We developed a natural language processing based application identifying cognitive dysfunctions from the free text of medical records, and assessed its performance against a rating scale widely used in the United Kingdom, the cognitive component of the Health of the Nation Outcome Scales (HoNOS). Furthermore, we analyzed cognitive trajectories over the course of patient treatment, and evaluated their relationship with various socio-demographic factors and clinical outcomes.Results: We found a high prevalence of cognitive impairments in patients with schizophrenia, and a strong correlation with several socio-demographic factors (gender, education, ethnicity, marital status, and employment) as well as adverse clinical outcomes. Results obtained from the free text were broadly in line with those obtained using the HoNOS subscale, and shed light on additional associations, notably related to attention and social impairments for patients with higher education.Conclusions: Our findings demonstrate that cognitive problems are common in patients with schizophrenia, can be reliably extracted from clinical records using natural language processing, and are associated with adverse clinical outcomes. Harvesting the free text from medical records provides a larger coverage in contrast to neurocognitive batteries or rating scales, and access to additional socio-demographic and clinical variables. Text mining tools can therefore facilitate large scale patient screening and early symptoms detection, and ultimately help inform clinical decisions.

Download Full-text

The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa106 ◽

2020 ◽

Vol 27 (10) ◽

pp. 1529-1537 ◽

Cited By ~ 1

Author(s):

Sam Henry ◽

Yanshan Wang ◽

Feichen Shen ◽

Ozlem Uzuner

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Controlled Vocabulary ◽

Future Research ◽

Shared Task ◽

Data Set ◽

Clinical Records ◽

Medical Concepts

Abstract Objective The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task track 3, focused on medical concept normalization (MCN) in clinical records. This track aimed to assess the state of the art in identifying and matching salient medical concepts to a controlled vocabulary. In this paper, we describe the task, describe the data set used, compare the participating systems, present results, identify the strengths and limitations of the current state of the art, and identify directions for future research. Materials and Methods Participating teams were provided with narrative discharge summaries in which text spans corresponding to medical concepts were identified. This paper refers to these text spans as mentions. Teams were tasked with normalizing these mentions to concepts, represented by concept unique identifiers, within the Unified Medical Language System. Submitted systems represented 4 broad categories of approaches: cascading dictionary matching, cosine distance, deep learning, and retrieve-and-rank systems. Disambiguation modules were common across all approaches. Results A total of 33 teams participated in the MCN task. The best-performing team achieved an accuracy of 0.8526. The median and mean performances among all teams were 0.7733 and 0.7426, respectively. Conclusions Overall performance among the top 10 teams was high. However, several mention types were challenging for all teams. These included mentions requiring disambiguation of misspelled words, acronyms, abbreviations, and mentions with more than 1 possible semantic type. Also challenging were complex mentions of long, multi-word terms that may require new ways of extracting and representing mention meaning, the use of domain knowledge, parse trees, or hand-crafted rules.

Download Full-text

Identification of Atherosclerotic and Cardiovascular Clinical Phenotypes in Spanish Electronic Health Records: Assessment of an Automated Information Extraction System. (Preprint)

10.2196/preprints.25888 ◽

2020 ◽

Author(s):

Ignacio Hernández-Medrano ◽

Marisa Serrano ◽

Sergio Collazo ◽

Ana López-Ballesteros ◽

Blai Coll ◽

...

Keyword(s):

Electronic Health Records ◽

Language Processing ◽

Research Quality ◽

Free Text ◽

Case Detection ◽

Significant Information ◽

Health Records ◽

Time Dynamic ◽

Electronic Health ◽

Medical Concepts

BACKGROUND Research efforts to develop strategies to effectively identify patients and to reduce the burden of cardiovascular diseases is essential for the future of the health system. Most research studies have used only coded parts of electronic health records (EHRs) for case-detection, obtaining missed data cases and reducing study quality. Incorporating information from free-text into case-detection through Natural Language Processing (NLP) techniques improves research quality. SAVANA was born as an innovating data-driven system based on NLP and big data techniques designed to retrieve prominent biomedical information from narratives clinic notes and to maximize the huge amount of information contained in Spanish EHRs. OBJECTIVE The aim of this work if to assess the performance of SAVANA when identifying concepts within the cardiovascular domain in Spanish EHRs. METHODS SAVANA is a platform for acceleration of clinical research, based on real-time dynamic exploitation of all the information contained in EHRs corpora that uses its own technology (EHRead) to allow unstructured information contained in EHRs to be analysed and expressed by means of medical concepts that contain the most significant information in the text. RESULTS The evaluation corpus consisted of a stratified random sample of patients from 3 Spanish sites. For site 01, the corpus contained a total of 280 mentions of cardiovascular clinical entities, where 249 were correctly identified, obtaining a P=0.93. In site 02, SAVANA correctly detected 53 mentions of cardiovascular entities among 57 annotations, achieving a P=0.98; and in site 03, among 165 manual annotations, 75 were correctly identified, yielding a P= 0.99. CONCLUSIONS This research clearly demonstrates the ability of SAVANA at identifying mentions of atherosclerotic/cardiovascular clinical phenotype in Spanish EHRs, as well as retrieving patients and records related to this pathology.

Download Full-text

Automated de-identification of clinical free-text

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.345 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Azad Dehghan ◽

Tom Liptrot ◽

Catherine O’hara ◽

Matthew Barker-Hewitt ◽

Daniel Tibble ◽

...

Keyword(s):

Active Learning ◽

Language Processing ◽

Large Scale ◽

Linear Chain ◽

Data Access ◽

Computational Method ◽

Free Text ◽

Clinical Text ◽

Clinical Notes ◽

Clinical Records

ABSTRACTObjectivesIncreasing interest to use unstructured electronic patient records for research has attracted attention to automated de-identification methods to conduct large scale removal of Personal Identifiable Information (PII). PII mainly include identifiable information such as person names, dates (e.g., date of birth), reference numbers (e.g., hospital number, NHS number), locations (e.g., hospital names, addresses), contacts (e.g., telephone, e-mail), occupation, age, and other identity information (ethnical, religion, sexual) mentioned in a private context. De-identification of clinical free-text remains crucial to enable large-scale data access for health research while adhering to legal (Data Protection Act 1998) and ethical obligations. Here we present a computational method developed to automatically remove PII from clinical text. ApproachIn order to automatically identify PII in clinical text, we have developed and validated a Natural Language Processing (NLP) method which combine knowledge- (lexical dictionaries and rules) and data-driven (linear-chain conditional random fields) techniques. In addition, we have designed a novel two-pass recognition approach that uses the output of the initial pass to create patient-level and run-time dictionaries used to identify PII mentions that lack specific contextual clues considered by the initial entity extraction modules. The labelled data used to model and validate our techniques were generated using six human annotators and two distinct types of free-text from The Christie NHS Foundation Trust: (1) clinical correspondence (400 documents) and (2) clinical notes (1,300 documents). ResultsThe de-identification approach was developed and validated using a 60/40 percent split between the development and test datasets. The preliminary results show that our method achieves 97% and 93% token-level F1-measure on clinical correspondence and clinical notes respectively. In addition, the proposed two-pass recognition method was found particularly effective for longitudinal records. Notably, the performances are comparable to human benchmarks (using inter annotator agreements) of 97% and 90% F1 respectively. ConclusionsWe have developed and validated a state-of-the-art method that matches human benchmarks in identification and removal of PII from free-text clinical records. The method has been further validated across multiple institutions and countries (United States and United Kingdom), where we have identified a notable NLP challenge of cross-dataset adaption and have proposed using active learning methods to address this problem. The algorithm, including an active learning component, will be provided as open source to the healthcare community.

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

Sentiment Analysis Techniques Applied to Raw-Text Data from a Csq-8 Questionnaire about Mindfulness in Times of COVID-19 to Improve Strategy Generation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18126408 ◽

2021 ◽

Vol 18 (12) ◽

pp. 6408

Author(s):

Mario Jojoa Acosta ◽

Gema Castillo-Sánchez ◽

Begonya Garcia-Zapirain ◽

Isabel de la Torre Díez ◽

Manuel Franco-Martín

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Transfer Learning ◽

Language Processing ◽

Health Care Professionals ◽

Ground Truth ◽

Relevant Information ◽

Free Text

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.

Download Full-text

Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review

BMJ Health & Care Informatics ◽

10.1136/bmjhci-2020-100262 ◽

2021 ◽

Vol 28 (1) ◽

pp. e100262

Author(s):

Mustafa Khanbhai ◽

Patrick Anyadi ◽

Joshua Symons ◽

Kelsey Flott ◽

Ara Darzi ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Patient Experience ◽

Language Processing ◽

Performance Metrics ◽

Free Text ◽

Patient Feedback

ObjectivesUnstructured free-text patient feedback contains rich information, and analysing these data manually would require a lot of personnel resources which are not available in most healthcare organisations.To undertake a systematic review of the literature on the use of natural language processing (NLP) and machine learning (ML) to process and analyse free-text patient experience data.MethodsDatabases were systematically searched to identify articles published between January 2000 and December 2019 examining NLP to analyse free-text patient feedback. Due to the heterogeneous nature of the studies, a narrative synthesis was deemed most appropriate. Data related to the study purpose, corpus, methodology, performance metrics and indicators of quality were recorded.ResultsNineteen articles were included. The majority (80%) of studies applied language analysis techniques on patient feedback from social media sites (unsolicited) followed by structured surveys (solicited). Supervised learning was frequently used (n=9), followed by unsupervised (n=6) and semisupervised (n=3). Comments extracted from social media were analysed using an unsupervised approach, and free-text comments held within structured surveys were analysed using a supervised approach. Reported performance metrics included the precision, recall and F-measure, with support vector machine and Naïve Bayes being the best performing ML classifiers.ConclusionNLP and ML have emerged as an important tool for processing unstructured free text. Both supervised and unsupervised approaches have their role depending on the data source. With the advancement of data analysis tools, these techniques may be useful to healthcare organisations to generate insight from the volumes of unstructured free-text data.

Download Full-text

Ascertaining Framingham heart failure phenotype from inpatient electronic health record data using natural language processing: a multicentre Atherosclerosis Risk in Communities (ARIC) validation study

BMJ Open ◽

10.1136/bmjopen-2020-047356 ◽

2021 ◽

Vol 11 (6) ◽

pp. e047356

Author(s):

Carlton R Moore ◽

Saumya Jain ◽

Stephanie Haas ◽

Harish Yadav ◽

Eric Whitsel ◽

...

Keyword(s):

Language Processing ◽

Validation Dataset ◽

Free Text ◽

Electronic Health Record Data ◽

Atherosclerosis Risk In Communities ◽

Clinical Notes ◽

Atherosclerosis Risk ◽

Record Data ◽

Sensitivity Specificity ◽

Aric Study

ObjectivesUsing free-text clinical notes and reports from hospitalised patients, determine the performance of natural language processing (NLP) ascertainment of Framingham heart failure (HF) criteria and phenotype.Study designA retrospective observational study design of patients hospitalised in 2015 from four hospitals participating in the Atherosclerosis Risk in Communities (ARIC) study was used to determine NLP performance in the ascertainment of Framingham HF criteria and phenotype.SettingFour ARIC study hospitals, each representing an ARIC study region in the USA.ParticipantsA stratified random sample of hospitalisations identified using a broad range of International Classification of Disease, ninth revision, diagnostic codes indicative of an HF event and occurring during 2015 was drawn for this study. A randomly selected set of 394 hospitalisations was used as the derivation dataset and 406 hospitalisations was used as the validation dataset.InterventionUse of NLP on free-text clinical notes and reports to ascertain Framingham HF criteria and phenotype.Primary and secondary outcome measuresNLP performance as measured by sensitivity, specificity, positive-predictive value (PPV) and agreement in ascertainment of Framingham HF criteria and phenotype. Manual medical record review by trained ARIC abstractors was used as the reference standard.ResultsOverall, performance of NLP ascertainment of Framingham HF phenotype in the validation dataset was good, with 78.8%, 81.7%, 84.4% and 80.0% for sensitivity, specificity, PPV and agreement, respectively.ConclusionsBy decreasing the need for manual chart review, our results on the use of NLP to ascertain Framingham HF phenotype from free-text electronic health record data suggest that validated NLP technology holds the potential for significantly improving the feasibility and efficiency of conducting large-scale epidemiologic surveillance of HF prevalence and incidence.

Download Full-text