scholarly journals Developing data governance standards for using free-text data in research (TexGov)

Author(s):  
Kerina Jones ◽  
Elizabeth Ford ◽  
Nathan Lea ◽  
Lucy Griffiths ◽  
Sharon Heys ◽  
...  

BackgroundFree-text data represent a vast, untapped source of rich information to guide research and public service delivery. Free-text data contain a wealth of additional detail that, if more accessible, would clarify and supplement information coded in structured data fields. Personal data usually need to be de-identified or anonymised before they can be used for purposes such as audit and research, but there are major challenges in finding effective methods to de-identify free-text that do not damage data utility as a by-product. The main aim of the TexGov project is to work towards data governance standards to enable free-text data to be used safely for public benefit. MethodsWe conducted: a rapid literature review to explore the data governance models used in working with free-text data, plus case studies of systems making de-identified free-text data available for research; we engaged with text mining researchers and the general public to explore barriers and solutions in working with free-text; and we outlined (UK) data protection legislation and regulations for context. ResultsWe reviewed 50 articles and the models of 4 systems providing access to de-identified free-text. The main emerging themes were: i) patient involvement at identifiable and de-identified data stages; ii) questions of consent and notification for the reuse of free-text data; iii) working with identifiable data for Natural Language Processing algorithm development; and iv) de-identification methods and thresholds of reliability. ConclusionWe have proposed a set of recommendations, including: ensuring public transparency in data flows and uses; adhering to the principles of minimal data extraction; treating de-identified blacklisted free-text as potentially identifiable with use limited to accredited data safe-havens; and, the need to commit to a culture of continuous improvement to understand the relationships between accuracy of de-identification and re-identification risk, so this can be communicated to all stakeholders.

2019 ◽  
Author(s):  
Kerina H Jones ◽  
Elizabeth M Ford ◽  
Nathan Lea ◽  
Lucy J Griffiths ◽  
Lamiece Hassan ◽  
...  

BACKGROUND Clinical free-text data (eg, outpatient letters or nursing notes) represent a vast, untapped source of rich information that, if more accessible for research, would clarify and supplement information coded in structured data fields. Data usually need to be deidentified or anonymized before they can be reused for research, but there is a lack of established guidelines to govern effective deidentification and use of free-text information and avoid damaging data utility as a by-product. OBJECTIVE This study aimed to develop recommendations for the creation of data governance standards to integrate with existing frameworks for personal data use, to enable free-text data to be used safely for research for patient and public benefit. METHODS We outlined data protection legislation and regulations relating to the United Kingdom for context and conducted a rapid literature review and UK-based case studies to explore data governance models used in working with free-text data. We also engaged with stakeholders, including text-mining researchers and the general public, to explore perceived barriers and solutions in working with clinical free-text. RESULTS We proposed a set of recommendations, including the need for authoritative guidance on data governance for the reuse of free-text data, to ensure public transparency in data flows and uses, to treat deidentified free-text data as potentially identifiable with use limited to accredited data safe havens, and to commit to a culture of continuous improvement to understand the relationships between the efficacy of deidentification and reidentification risks, so this can be communicated to all stakeholders. CONCLUSIONS By drawing together the findings of a combination of activities, we present a position paper to contribute to the development of data governance standards for the reuse of clinical free-text data for secondary purposes. While working in accordance with existing data governance frameworks, there is a need for further work to take forward the recommendations we have proposed, with commitment and investment, to assure and expand the safe reuse of clinical free-text data for public benefit.


10.2196/16760 ◽  
2020 ◽  
Vol 22 (6) ◽  
pp. e16760
Author(s):  
Kerina H Jones ◽  
Elizabeth M Ford ◽  
Nathan Lea ◽  
Lucy J Griffiths ◽  
Lamiece Hassan ◽  
...  

Background Clinical free-text data (eg, outpatient letters or nursing notes) represent a vast, untapped source of rich information that, if more accessible for research, would clarify and supplement information coded in structured data fields. Data usually need to be deidentified or anonymized before they can be reused for research, but there is a lack of established guidelines to govern effective deidentification and use of free-text information and avoid damaging data utility as a by-product. Objective This study aimed to develop recommendations for the creation of data governance standards to integrate with existing frameworks for personal data use, to enable free-text data to be used safely for research for patient and public benefit. Methods We outlined data protection legislation and regulations relating to the United Kingdom for context and conducted a rapid literature review and UK-based case studies to explore data governance models used in working with free-text data. We also engaged with stakeholders, including text-mining researchers and the general public, to explore perceived barriers and solutions in working with clinical free-text. Results We proposed a set of recommendations, including the need for authoritative guidance on data governance for the reuse of free-text data, to ensure public transparency in data flows and uses, to treat deidentified free-text data as potentially identifiable with use limited to accredited data safe havens, and to commit to a culture of continuous improvement to understand the relationships between the efficacy of deidentification and reidentification risks, so this can be communicated to all stakeholders. Conclusions By drawing together the findings of a combination of activities, we present a position paper to contribute to the development of data governance standards for the reuse of clinical free-text data for secondary purposes. While working in accordance with existing data governance frameworks, there is a need for further work to take forward the recommendations we have proposed, with commitment and investment, to assure and expand the safe reuse of clinical free-text data for public benefit.


2021 ◽  
Author(s):  
Anahita Davoudi ◽  
Natalie Lee ◽  
Thaibinh Luong ◽  
Timothy Delaney ◽  
Elizabeth Asch ◽  
...  

Background: Free-text communication between patients and providers is playing an increasing role in chronic disease management, through platforms varying from traditional healthcare portals to more novel mobile messaging applications. These text data are rich resources for clinical and research purposes, but their sheer volume render them difficult to manage. Even automated approaches such as natural language processing require labor-intensive manual classification for developing training datasets, which is a rate-limiting step. Automated approaches to organizing free-text data are necessary to facilitate the use of free-text communication for clinical care and research. Objective: We applied unsupervised learning approaches to 1) understand the types of topics discussed and 2) to learn medication-related intents from messages sent between patients and providers through a bi-directional text messaging system for managing participant blood pressure. Methods: This study was a secondary analysis of de-identified messages from a remote mobile text-based employee hypertension management program at an academic institution. In experiment 1, we trained a Latent Dirichlet Allocation (LDA) model for each message type (inbound-patient and outbound-provider) and identified the distribution of major topics and significant topics (probability >0.20) across message types. In experiment 2, we annotated all medication-related messages with a single medication intent. Then, we trained a second LDA model (medLDA) to assess how well the unsupervised method could identify more fine-grained medication intents. We encoded each medication message with n-grams (n-1-3 words) using spaCy, clinical named entities using STANZA, and medication categories using MedEx, and then applied Chi-square feature selection to learn the most informative features associated with each medication intent. Results: A total of 253 participants and 5 providers engaged in the program generating 12,131 total messages: 47% patient messages and 53% provider messages. Most patient messages correspond to blood pressure (BP) reporting, BP encouragement, and appointment scheduling. In contrast, most provider messages correspond to BP reporting, medication adherence, and confirmatory statements. In experiment 1, for both patient and provider messages, most messages contained 1 topic and few with more than 3 topics identified using LDA. However, manual review of some messages within topics revealed significant heterogeneity even within single-topic messages as identified by LDA. In experiment 2, among the 534 medication messages annotated with a single medication intent, most of the 282 patient medication messages referred to medication request (48%; n=134) and medication taking (28%; n=79); most of the 252 provider medication messages referred to medication question (69%; n=173). Although medLDA could identify a majority intent within each topic, the model could not distinguish medication intents with low prevalence within either patient or provider messages. Richer feature engineering identified informative lexical-semantic patterns associated with each medication intent class. Conclusion: LDA can be an effective method for generating subgroups of messages with similar term usage and facilitate the review of topics to inform annotations. However, few training cases and shared vocabulary between intents precludes the use of LDA for fully automated deep medication intent classification.


2018 ◽  
Author(s):  
Jeremy Petch ◽  
Jane Batt ◽  
Joshua Murray ◽  
Muhammad Mamdani

BACKGROUND The increasing adoption of electronic health records (EHRs) in clinical practice holds the promise of improving care and advancing research by serving as a rich source of data, but most EHRs allow clinicians to enter data in a text format without much structure. Natural language processing (NLP) may reduce reliance on manual abstraction of these text data by extracting clinical features directly from unstructured clinical digital text data and converting them into structured data. OBJECTIVE This study aimed to assess the performance of a commercially available NLP tool for extracting clinical features from free-text consult notes. METHODS We conducted a pilot, retrospective, cross-sectional study of the accuracy of NLP from dictated consult notes from our tuberculosis clinic with manual chart abstraction as the reference standard. Consult notes for 130 patients were extracted and processed using NLP. We extracted 15 clinical features from these consult notes and grouped them a priori into categories of simple, moderate, and complex for analysis. RESULTS For the primary outcome of overall accuracy, NLP performed best for features classified as simple, achieving an overall accuracy of 96% (95% CI 94.3-97.6). Performance was slightly lower for features of moderate clinical and linguistic complexity at 93% (95% CI 91.1-94.4), and lowest for complex features at 91% (95% CI 87.3-93.1). CONCLUSIONS The findings of this study support the use of NLP for extracting clinical features from dictated consult notes in the setting of a tuberculosis clinic. Further research is needed to fully establish the validity of NLP for this and other purposes.


Author(s):  
Keno K Bressem ◽  
Lisa C Adams ◽  
Robert A Gaudin ◽  
Daniel Tröltzsch ◽  
Bernd Hamm ◽  
...  

Abstract Motivation The development of deep, bidirectional transformers such as Bidirectional Encoder Representations from Transformers (BERT) led to an outperformance of several Natural Language Processing (NLP) benchmarks. Especially in radiology, large amounts of free-text data are generated in daily clinical workflow. These report texts could be of particular use for the generation of labels in machine learning, especially for image classification. However, as report texts are mostly unstructured, advanced NLP methods are needed to enable accurate text classification. While neural networks can be used for this purpose, they must first be trained on large amounts of manually labelled data to achieve good results. In contrast, BERT models can be pre-trained on unlabelled data and then only require fine tuning on a small amount of manually labelled data to achieve even better results. Results Using BERT to identify the most important findings in intensive care chest radiograph reports, we achieve areas under the receiver operation characteristics curve of 0.98 for congestion, 0.97 for effusion, 0.97 for consolidation and 0.99 for pneumothorax, surpassing the accuracy of previous approaches with comparatively little annotation effort. Our approach could therefore help to improve information extraction from free-text medical reports. Availability and implementation We make the source code for fine-tuning the BERT-models freely available at https://github.com/fast-raidiology/bert-for-radiology. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 102-B (7_Supple_B) ◽  
pp. 99-104
Author(s):  
Romil F. Shah ◽  
Stefano Bini ◽  
Thomas Vail

Aims Natural Language Processing (NLP) offers an automated method to extract data from unstructured free text fields for arthroplasty registry participation. Our objective was to investigate how accurately NLP can be used to extract structured clinical data from unstructured clinical notes when compared with manual data extraction. Methods A group of 1,000 randomly selected clinical and hospital notes from eight different surgeons were collected for patients undergoing primary arthroplasty between 2012 and 2018. In all, 19 preoperative, 17 operative, and two postoperative variables of interest were manually extracted from these notes. A NLP algorithm was created to automatically extract these variables from a training sample of these notes, and the algorithm was tested on a random test sample of notes. Performance of the NLP algorithm was measured in Statistical Analysis System (SAS) by calculating the accuracy of the variables collected, the ability of the algorithm to collect the correct information when it was indeed in the note (sensitivity), and the ability of the algorithm to not collect a certain data element when it was not in the note (specificity). Results The NLP algorithm performed well at extracting variables from unstructured data in our random test dataset (accuracy = 96.3%, sensitivity = 95.2%, and specificity = 97.4%). It performed better at extracting data that were in a structured, templated format such as range of movement (ROM) (accuracy = 98%) and implant brand (accuracy = 98%) than data that were entered with variation depending on the author of the note such as the presence of deep-vein thrombosis (DVT) (accuracy = 90%). Conclusion The NLP algorithm used in this study was able to identify a subset of variables from randomly selected unstructured notes in arthroplasty with an accuracy above 90%. For some variables, such as objective exam data, the accuracy was very high. Our findings suggest that automated algorithms using NLP can help orthopaedic practices retrospectively collect information for registries and quality improvement (QI) efforts. Cite this article: Bone Joint J 2020;102-B(7 Supple B):99–104.


Author(s):  
Mario Jojoa ◽  
Gema Castillo-Sánchez ◽  
Begonya Garcia-Zapirain ◽  
Isabel De la Torre Diez ◽  
Manuel Franco-Martín

The aim of this study was to build a tool to analyze, using artificial intelligence, the sentiment perception of users who answered two questions from the CSQ – 8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satis-faction level of the participants involved, with a view to establishing strategies to improve fu-ture experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks and transfer learning, so as to classify the inputs into the following 3 categories: negative, neutral and positive. Due to the lim-ited amount of data available - 86 registers for the first and 68 for the second - transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02 % and 90.53 % respectively based on ground truth labeled by 3 experts. Finally, we proposed a complementary analysis, using com-puter graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages


2021 ◽  
Author(s):  
Dane Gunter ◽  
Paulo Puac-Polanco ◽  
Olivier Miguel ◽  
Rebecca E. Thornhill ◽  
Amy Y. X. Yu ◽  
...  

BACKGROUND Data extraction from radiology free-text reports is time-consuming when performed manually. Recently, more automated extraction methods using natural language processing (NLP) are proposed. A previously developed rule-based NLP algorithm showed promise in its ability to extract stroke-related data from radiology reports. OBJECTIVE We aimed to externally validate the accuracy of CHARTextract, a rule-based NLP algorithm, to extract stroke-related data from free-text radiology reports. METHODS Free-text reports of CT angiography (CTA) and perfusion (CTP) studies of consecutive patients with acute ischemic stroke admitted to a regional Stroke center for endovascular thrombectomy were analyzed from January 2015 - 2021. Stroke-related variables were manually extracted (reference standard) from the reports, including proximal and distal anterior circulation occlusion, posterior circulation occlusion, presence of ischemia, hemorrhage, Alberta stroke program early CT score (ASPECTS), and collateral status. These variables were simultaneously extracted using a rule-based NLP algorithm. The NLP algorithm's accuracy, specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV) were assessed. RESULTS The NLP algorithm's accuracy was >90% for identifying distal anterior occlusion, posterior circulation occlusion, hemorrhage, and ASPECTS. Accuracy was 85%, 74%, and 79% for proximal anterior circulation occlusion, presence of ischemia, and collateral status respectively. The algorithm had an accuracy of 87-100% for the detection of variables not reported in radiology reports. CONCLUSIONS Rule-based NLP has a moderate to good performance for stroke-related data extraction from free-text imaging reports. The algorithm's accuracy was affected by inconsistent report styles and lexicon among reporting radiologists.


2021 ◽  
Author(s):  
Peter Elkin ◽  
Sarah Mullin ◽  
Jack Mardekian ◽  
Chris Crowner ◽  
Sylvester Sakilay ◽  
...  

BACKGROUND Non-Valvular Atrial Fibrillation (NVAF), affects almost 6 million Americans and is a major contributor to strokes; but is significantly undiagnosed and undertreated despite explicit guidelines for oral anticoagulation. OBJECTIVE We investigate if use of semi-supervised natural language processing (NLP) of electronic health record’s (EHRs’) free-text information combined with structured EHR data improves NVAF discovery and treatment--perhaps offering a method to prevent thousands of deaths and save billions of dollars. METHODS We abstracted a set of 96,681 participants from the University at Buffalo’s faculty practice’s EHR. NLP was used to index the notes and compare the ability to identify NVAF, CHA2DS2 VASc and HAS-BLED scores using unstructured data (ICD codes) vs. Structured plus Unstructured data from clinical notes. Additionally, we analyzed data from 63,296,120 participants in the Optum and Truven databases to determine the NVAF’s frequency, rates of CHA2DS2 VASc ≥ 2 and no contraindications to oral anticoagulants (OAC), rates of stroke and death in the untreated population, and first year’s costs after stroke. 16,17 RESULTS The structured-plus-unstructured method would have identified 3,976,056 additional true NVAF cases (p<0.001) and improved sensitivity for CHA2DS2-VASc and HAS-BLED scores compared to the structured data alone (P=0.00195, and P<0.001 respectively), a 32.1% improvement. For the US this method would prevent an estimated 176,537 strokes, save 10,575 lives, and save over $13.5 billion. CONCLUSIONS AI-informed bio-surveillance combining NLP of free-text information with structured EHR data improves data completeness, could prevent thousands of strokes, and save lives and funds. This method is applicable to many disorders, with profound public health consequences. CLINICALTRIAL None


Sign in / Sign up

Export Citation Format

Share Document