A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use

SummaryObjective: To qualify the use of patient clinical records as non-human-subject for research purpose, electronic medical record data must be de-identified so there is minimum risk to protected health information exposure. This study demonstrated a robust framework for structured data de-identification that can be applied to any relational data source that needs to be de-identified.Methods: Using a real world clinical data warehouse, a pilot implementation of limited subject areas were used to demonstrate and evaluate this new de-identification process. Query results and performances are compared between source and target system to validate data accuracy and usability.Results: The combination of hashing, pseudonyms, and session dependent randomizer provides a rigorous de-identification framework to guard against 1) source identifier exposure; 2) internal data analyst manually linking to source identifiers; and 3) identifier cross-link among different researchers or multiple query sessions by the same researcher. In addition, a query rejection option is provided to refuse queries resulting in less than preset numbers of subjects and total records to prevent users from accidental subject identification due to low volume of data.This framework does not prevent subject re-identification based on prior knowledge and sequence of events. Also, it does not deal with medical free text de-identification, although text de-identification using natural language processing can be included due its modular design.Conclusion: We demonstrated a framework resulting in HIPAA Compliant databases that can be directly queried by researchers. This technique can be augmented to facilitate inter-institutional research data sharing through existing middleware such as caGrid.

Download Full-text

Managing Free Text for Secondary Use of Health Data

Yearbook of Medical Informatics ◽

10.15265/iy-2014-0037 ◽

2014 ◽

Vol 23 (01) ◽

pp. 167-169 ◽

Cited By ~ 5

Author(s):

N. Griffon ◽

J. Charlet ◽

S. J. Darmoni ◽

Keyword(s):

Language Processing ◽

Meaningful Use ◽

Health Data ◽

Structured Data ◽

The Other ◽

Free Text ◽

Reference Resolution ◽

Comprehensive Review ◽

Secondary Use ◽

Corpus Creation

Summary Objective: To summarize the best papers in the field of Knowledge Representation and Management (KRM). Methods: A comprehensive review of medical informatics literature was performed to select some of the most interesting papers of KRM and natural language processing (NLP) published in 2013. Results: Four articles were selected, one focuses on Electronic Health Record (EHR) interoperability for clinical pathway personalization based on structured data. The other three focus on NLP (corpus creation, de-identification, and co-reference resolution) and highlight the increase in NLP tools performances. Conclusion: NLP tools are close to being seriously concurrent to humans in some annotation tasks. Their use could increase drastically the amount of data usable for meaningful use of EHR.

Download Full-text

Cognitive Impairments in Schizophrenia: A Study in a Large Clinical Sample Using Natural Language Processing

Frontiers in Digital Health ◽

10.3389/fdgth.2021.711941 ◽

2021 ◽

Vol 3 ◽

Author(s):

Aurelie Mascio ◽

Robert Stewart ◽

Riley Botelle ◽

Marcus Williams ◽

Luwaiza Mirza ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Clinical Outcomes ◽

Language Processing ◽

Large Scale ◽

Rating Scales ◽

Cognitive Impairments ◽

Free Text ◽

Socio Demographic Factors ◽

Clinical Records

Background: Cognitive impairments are a neglected aspect of schizophrenia despite being a major factor of poor functional outcome. They are usually measured using various rating scales, however, these necessitate trained practitioners and are rarely routinely applied in clinical settings. Recent advances in natural language processing techniques allow us to extract such information from unstructured portions of text at a large scale and in a cost effective manner. We aimed to identify cognitive problems in the clinical records of a large sample of patients with schizophrenia, and assess their association with clinical outcomes.Methods: We developed a natural language processing based application identifying cognitive dysfunctions from the free text of medical records, and assessed its performance against a rating scale widely used in the United Kingdom, the cognitive component of the Health of the Nation Outcome Scales (HoNOS). Furthermore, we analyzed cognitive trajectories over the course of patient treatment, and evaluated their relationship with various socio-demographic factors and clinical outcomes.Results: We found a high prevalence of cognitive impairments in patients with schizophrenia, and a strong correlation with several socio-demographic factors (gender, education, ethnicity, marital status, and employment) as well as adverse clinical outcomes. Results obtained from the free text were broadly in line with those obtained using the HoNOS subscale, and shed light on additional associations, notably related to attention and social impairments for patients with higher education.Conclusions: Our findings demonstrate that cognitive problems are common in patients with schizophrenia, can be reliably extracted from clinical records using natural language processing, and are associated with adverse clinical outcomes. Harvesting the free text from medical records provides a larger coverage in contrast to neurocognitive batteries or rating scales, and access to additional socio-demographic and clinical variables. Text mining tools can therefore facilitate large scale patient screening and early symptoms detection, and ultimately help inform clinical decisions.

Download Full-text

Interpretable segmentation of medical free-text records based on word embeddings

Journal of Intelligent Information Systems ◽

10.1007/s10844-021-00659-4 ◽

2021 ◽

Author(s):

Adam Gabriel Dobrakowski ◽

Agnieszka Mykowiecka ◽

Małgorzata Marciniak ◽

Wojciech Jaworski ◽

Przemysław Biecek

Keyword(s):

Language Processing ◽

Visual Exploration ◽

Free Text ◽

Medical Doctors ◽

Medical Terminology ◽

Medical Texts ◽

Medical Diagnoses ◽

Clinical Records ◽

Vector Representations ◽

Medical Concepts

AbstractMedical free-text records store a lot of useful information that can be exploited in developing computer-supported medicine. However, extracting the knowledge from the unstructured text is difficult and depends on the language. In the paper, we apply Natural Language Processing methods to process raw medical texts in Polish and propose a new methodology for clustering of patients’ visits. We (1) extract medical terminology from a corpus of free-text clinical records, (2) annotate data with medical concepts, (3) compute vector representations of medical concepts and validate them on the proposed term analogy tasks, (4) compute visit representations as vectors, (5) introduce a new method for clustering of patients’ visits and (6) apply the method to a corpus of 100,000 visits. We use several approaches to visual exploration that facilitate interpretation of segments. With our method, we obtain stable and separated segments of visits which are positively validated against final medical diagnoses. In this paper we show how algorithm for segmentation of medical free-text records may be used to aid medical doctors. In addition to this, we share implementation of described methods with examples as open-source package .

Download Full-text

Automated de-identification of clinical free-text

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.345 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Azad Dehghan ◽

Tom Liptrot ◽

Catherine O’hara ◽

Matthew Barker-Hewitt ◽

Daniel Tibble ◽

...

Keyword(s):

Active Learning ◽

Language Processing ◽

Large Scale ◽

Linear Chain ◽

Data Access ◽

Computational Method ◽

Free Text ◽

Clinical Text ◽

Clinical Notes ◽

Clinical Records

ABSTRACTObjectivesIncreasing interest to use unstructured electronic patient records for research has attracted attention to automated de-identification methods to conduct large scale removal of Personal Identifiable Information (PII). PII mainly include identifiable information such as person names, dates (e.g., date of birth), reference numbers (e.g., hospital number, NHS number), locations (e.g., hospital names, addresses), contacts (e.g., telephone, e-mail), occupation, age, and other identity information (ethnical, religion, sexual) mentioned in a private context. De-identification of clinical free-text remains crucial to enable large-scale data access for health research while adhering to legal (Data Protection Act 1998) and ethical obligations. Here we present a computational method developed to automatically remove PII from clinical text. ApproachIn order to automatically identify PII in clinical text, we have developed and validated a Natural Language Processing (NLP) method which combine knowledge- (lexical dictionaries and rules) and data-driven (linear-chain conditional random fields) techniques. In addition, we have designed a novel two-pass recognition approach that uses the output of the initial pass to create patient-level and run-time dictionaries used to identify PII mentions that lack specific contextual clues considered by the initial entity extraction modules. The labelled data used to model and validate our techniques were generated using six human annotators and two distinct types of free-text from The Christie NHS Foundation Trust: (1) clinical correspondence (400 documents) and (2) clinical notes (1,300 documents). ResultsThe de-identification approach was developed and validated using a 60/40 percent split between the development and test datasets. The preliminary results show that our method achieves 97% and 93% token-level F1-measure on clinical correspondence and clinical notes respectively. In addition, the proposed two-pass recognition method was found particularly effective for longitudinal records. Notably, the performances are comparable to human benchmarks (using inter annotator agreements) of 97% and 90% F1 respectively. ConclusionsWe have developed and validated a state-of-the-art method that matches human benchmarks in identification and removal of PII from free-text clinical records. The method has been further validated across multiple institutions and countries (United States and United Kingdom), where we have identified a notable NLP challenge of cross-dataset adaption and have proposed using active learning methods to address this problem. The algorithm, including an active learning component, will be provided as open source to the healthcare community.

Download Full-text

Sentiment Analysis Techniques Applied to Raw-Text Data from a Csq-8 Questionnaire about Mindfulness in Times of COVID-19 to Improve Strategy Generation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18126408 ◽

2021 ◽

Vol 18 (12) ◽

pp. 6408

Author(s):

Mario Jojoa Acosta ◽

Gema Castillo-Sánchez ◽

Begonya Garcia-Zapirain ◽

Isabel de la Torre Díez ◽

Manuel Franco-Martín

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Transfer Learning ◽

Language Processing ◽

Health Care Professionals ◽

Ground Truth ◽

Relevant Information ◽

Free Text

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.

Download Full-text

Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review

BMJ Health & Care Informatics ◽

10.1136/bmjhci-2020-100262 ◽

2021 ◽

Vol 28 (1) ◽

pp. e100262

Author(s):

Mustafa Khanbhai ◽

Patrick Anyadi ◽

Joshua Symons ◽

Kelsey Flott ◽

Ara Darzi ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Patient Experience ◽

Language Processing ◽

Performance Metrics ◽

Free Text ◽

Patient Feedback

ObjectivesUnstructured free-text patient feedback contains rich information, and analysing these data manually would require a lot of personnel resources which are not available in most healthcare organisations.To undertake a systematic review of the literature on the use of natural language processing (NLP) and machine learning (ML) to process and analyse free-text patient experience data.MethodsDatabases were systematically searched to identify articles published between January 2000 and December 2019 examining NLP to analyse free-text patient feedback. Due to the heterogeneous nature of the studies, a narrative synthesis was deemed most appropriate. Data related to the study purpose, corpus, methodology, performance metrics and indicators of quality were recorded.ResultsNineteen articles were included. The majority (80%) of studies applied language analysis techniques on patient feedback from social media sites (unsolicited) followed by structured surveys (solicited). Supervised learning was frequently used (n=9), followed by unsupervised (n=6) and semisupervised (n=3). Comments extracted from social media were analysed using an unsupervised approach, and free-text comments held within structured surveys were analysed using a supervised approach. Reported performance metrics included the precision, recall and F-measure, with support vector machine and Naïve Bayes being the best performing ML classifiers.ConclusionNLP and ML have emerged as an important tool for processing unstructured free text. Both supervised and unsupervised approaches have their role depending on the data source. With the advancement of data analysis tools, these techniques may be useful to healthcare organisations to generate insight from the volumes of unstructured free-text data.

Download Full-text

Ascertaining Framingham heart failure phenotype from inpatient electronic health record data using natural language processing: a multicentre Atherosclerosis Risk in Communities (ARIC) validation study

BMJ Open ◽

10.1136/bmjopen-2020-047356 ◽

2021 ◽

Vol 11 (6) ◽

pp. e047356

Author(s):

Carlton R Moore ◽

Saumya Jain ◽

Stephanie Haas ◽

Harish Yadav ◽

Eric Whitsel ◽

...

Keyword(s):

Language Processing ◽

Validation Dataset ◽

Free Text ◽

Electronic Health Record Data ◽

Atherosclerosis Risk In Communities ◽

Clinical Notes ◽

Atherosclerosis Risk ◽

Record Data ◽

Sensitivity Specificity ◽

Aric Study

ObjectivesUsing free-text clinical notes and reports from hospitalised patients, determine the performance of natural language processing (NLP) ascertainment of Framingham heart failure (HF) criteria and phenotype.Study designA retrospective observational study design of patients hospitalised in 2015 from four hospitals participating in the Atherosclerosis Risk in Communities (ARIC) study was used to determine NLP performance in the ascertainment of Framingham HF criteria and phenotype.SettingFour ARIC study hospitals, each representing an ARIC study region in the USA.ParticipantsA stratified random sample of hospitalisations identified using a broad range of International Classification of Disease, ninth revision, diagnostic codes indicative of an HF event and occurring during 2015 was drawn for this study. A randomly selected set of 394 hospitalisations was used as the derivation dataset and 406 hospitalisations was used as the validation dataset.InterventionUse of NLP on free-text clinical notes and reports to ascertain Framingham HF criteria and phenotype.Primary and secondary outcome measuresNLP performance as measured by sensitivity, specificity, positive-predictive value (PPV) and agreement in ascertainment of Framingham HF criteria and phenotype. Manual medical record review by trained ARIC abstractors was used as the reference standard.ResultsOverall, performance of NLP ascertainment of Framingham HF phenotype in the validation dataset was good, with 78.8%, 81.7%, 84.4% and 80.0% for sensitivity, specificity, PPV and agreement, respectively.ConclusionsBy decreasing the need for manual chart review, our results on the use of NLP to ascertain Framingham HF phenotype from free-text electronic health record data suggest that validated NLP technology holds the potential for significantly improving the feasibility and efficiency of conducting large-scale epidemiologic surveillance of HF prevalence and incidence.

Download Full-text

Measuring Adoption of Patient Priorities-Aligned Care Using Natural Language Processing

Innovation in Aging ◽

10.1093/geroni/igaa057.592 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

pp. 183-183

Author(s):

Javad Razjouyan ◽

Jennifer Freytag ◽

Edward Odom ◽

Lilian Dindo ◽

Aanand Naik

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chart Review ◽

Group Analysis ◽

Intervention Group ◽

Multiple Chronic Conditions ◽

Free Text ◽

Term Care

Abstract Patient Priorities Care (PPC) is a model of care that aligns health care recommendations with priorities of older adults with multiple chronic conditions. Social workers (SW), after online training, document PPC in the patient’s electronic health record (EHR). Our goal is to identify free-text notes with PPC language using a natural language processing (NLP) model and to measure PPC adoption and effect on long term services and support (LTSS) use. Free-text notes from the EHR produced by trained SWs passed through a hybrid NLP model that utilized rule-based and statistical machine learning. NLP accuracy was validated against chart review. Patients who received PPC were propensity matched with patients not receiving PPC (control) on age, gender, BMI, Charlson comorbidity index, facility and SW. The change in LTSS utilization 6-month intervals were compared by groups with univariate analysis. Chart review indicated that 491 notes out of 689 had PPC language and the NLP model reached to precision of 0.85, a recall of 0.90, an F1 of 0.87, and an accuracy of 0.91. Within group analysis shows that intervention group used LTSS 1.8 times more in the 6 months after the encounter compared to 6 months prior. Between group analysis shows that intervention group has significant higher number of LTSS utilization (p=0.012). An automated NLP model can be used to reliably measure the adaptation of PPC by SW. PPC seems to encourage use of LTSS that may delay time to long term care placement.

Download Full-text

Natural language processing methods for knowledge management—Applying document clustering for fast search and grouping of engineering documents

Concurrent Engineering ◽

10.1177/1063293x20982973 ◽

2021 ◽

pp. 1063293X2098297

Author(s):

Ivar Örn Arnarsson ◽

Otto Frost ◽

Emil Gustavsson ◽

Mats Jirstrand ◽

Johan Malmqvist

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Clustering Algorithms ◽

Document Clustering ◽

Unstructured Data ◽

Free Text ◽

Engineering Change ◽

Engineering Documents

Product development companies collect data in form of Engineering Change Requests for logged design issues, tests, and product iterations. These documents are rich in unstructured data (e.g. free text). Previous research affirms that product developers find that current IT systems lack capabilities to accurately retrieve relevant documents with unstructured data. In this research, we demonstrate a method using Natural Language Processing and document clustering algorithms to find structurally or contextually related documents from databases containing Engineering Change Request documents. The aim is to radically decrease the time needed to effectively search for related engineering documents, organize search results, and create labeled clusters from these documents by utilizing Natural Language Processing algorithms. A domain knowledge expert at the case company evaluated the results and confirmed that the algorithms we applied managed to find relevant document clusters given the queries tested.

Download Full-text