scholarly journals Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes

2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Beau Norgeot ◽  
Kathleen Muenzen ◽  
Thomas A. Peterson ◽  
Xuancheng Fan ◽  
Benjamin S. Glicksberg ◽  
...  
2019 ◽  
Author(s):  
Zachary N. Flamholz ◽  
Lyle H. Ungar ◽  
Gary E. Weissman

AbstractRationaleWord embeddings are used to create vector representations of text data but not all embeddings appropriately capture clinical information, are free of protected health information, and are computationally accessible to most researchers.MethodsWe trained word embeddings on published case reports because their language mimics that of clinical notes, the manuscripts are already de-identified by virtue of being published, and the corpus is much smaller than those trained on large, publicly available datasets. We tested the performance of these embeddings across five clinically relevant tasks and compared the results to embeddings trained on a large Wikipedia corpus, all publicly available manuscripts, notes from the MIMIC-III database using fastText, GloVe, and word2vec, and using different dimensions. Tasks included clinical applications of lexicographic coverage, semantic similarity, clustering purity, linguistic regularity, and mortality prediction.ResultsThe embeddings trained using the published case reports performed as well as if not better on most tasks than those using other corpora. The embeddings trained using all published manuscripts had the most consistent performance across all tasks and required a corpus with 100 times as many tokens as the corpus comprised of only case reports. Embeddings trained on the MIMIC-III dataset had small but marginally better scores on the clustering tasks which was also based on clinical notes from the MIMIC-III dataset. Embeddings trained on the Wikipedia corpus, although containing almost twice as many tokens as all available published manuscripts, performed poorly compared to those trained on medical and clinical corpora.ConclusionWord embeddings trained on freely available published case reports performed well for most clinical task, are free of protected health information, and are small compared to commonly used embeddings trained on larger clinical and non-clinical corpora. The optimal corpus, dimension size, and which embedding model to use for a given task involves tradeoffs in privacy, reproducibility, performance, and computational resources.


2019 ◽  
Vol 08 (02) ◽  
pp. 01-11
Author(s):  
Geetha Mahadevaiah ◽  
M.S Dinesh ◽  
Rithesh Sreenivasan ◽  
Sana Moin ◽  
Andre Dekker

Author(s):  
Saman Hina ◽  
Raheela Asif ◽  
Syed Abbas Ali

It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.


Healthcare ◽  
2021 ◽  
Vol 9 (6) ◽  
pp. 735
Author(s):  
Schoultz Mariyana ◽  
Leung Janni ◽  
Bonsaksen Tore ◽  
Ruffolo Mary ◽  
Thygesen Hilde ◽  
...  

Background: Due to the COVID-19 pandemic and the strict national policies regarding social distancing behavior in Europe, America and Australia, people became reliant on social media as a means for gathering information and as a tool for staying connected to family, friends and work. This is the first trans-national study exploring the qualitative experiences and challenges of using social media while in lockdown or shelter-in-place during the current pandemic. Methods: This study was part of a wider cross-sectional online survey conducted in Norway, the UK, USA and Australia during April/May 2020. The manuscript reports on the qualitative free-text component of the study asking about the challenges of social media users during the COVID-19 pandemic in the UK, USA and Australia. A total of 1991 responses were included in the analysis. Thematic analysis was conducted independently by two researchers. Results: Three overarching themes identified were: Emotional/Mental Health, Information and Being Connected. Participants experienced that using social media during the pandemic amplified anxiety, depression, fear, panic, anger, frustration and loneliness. They felt that there was information overload and social media was full of misleading or polarized opinions which were difficult to switch off. Nonetheless, participants also thought that there was an urge for connection and learning, which was positive and stressful at the same time. Conclusion: Using social media while in a shelter-in-place or lockdown could have a negative impact on the emotional and mental health of some of the population. To support policy and practice in strengthening mental health care in the community, social media could be used to deliver practical advice on coping and stress management. Communication with the public should be strengthened by unambiguous and clear messages and clear communication pathways. We should be looking at alternative ways of staying connected.


BMJ Open ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. e047356
Author(s):  
Carlton R Moore ◽  
Saumya Jain ◽  
Stephanie Haas ◽  
Harish Yadav ◽  
Eric Whitsel ◽  
...  

ObjectivesUsing free-text clinical notes and reports from hospitalised patients, determine the performance of natural language processing (NLP) ascertainment of Framingham heart failure (HF) criteria and phenotype.Study designA retrospective observational study design of patients hospitalised in 2015 from four hospitals participating in the Atherosclerosis Risk in Communities (ARIC) study was used to determine NLP performance in the ascertainment of Framingham HF criteria and phenotype.SettingFour ARIC study hospitals, each representing an ARIC study region in the USA.ParticipantsA stratified random sample of hospitalisations identified using a broad range of International Classification of Disease, ninth revision, diagnostic codes indicative of an HF event and occurring during 2015 was drawn for this study. A randomly selected set of 394 hospitalisations was used as the derivation dataset and 406 hospitalisations was used as the validation dataset.InterventionUse of NLP on free-text clinical notes and reports to ascertain Framingham HF criteria and phenotype.Primary and secondary outcome measuresNLP performance as measured by sensitivity, specificity, positive-predictive value (PPV) and agreement in ascertainment of Framingham HF criteria and phenotype. Manual medical record review by trained ARIC abstractors was used as the reference standard.ResultsOverall, performance of NLP ascertainment of Framingham HF phenotype in the validation dataset was good, with 78.8%, 81.7%, 84.4% and 80.0% for sensitivity, specificity, PPV and agreement, respectively.ConclusionsBy decreasing the need for manual chart review, our results on the use of NLP to ascertain Framingham HF phenotype from free-text electronic health record data suggest that validated NLP technology holds the potential for significantly improving the feasibility and efficiency of conducting large-scale epidemiologic surveillance of HF prevalence and incidence.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Michael Rutherford ◽  
Seong K. Mun ◽  
Betty Levine ◽  
William Bennett ◽  
Kirk Smith ◽  
...  

AbstractWe developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM objects (a total of 1,693 CT, MRI, PET, and digital X-ray images) were selected from datasets published in the Cancer Imaging Archive (TCIA). Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM Attributes to mimic typical clinical imaging exams. The DICOM Standard and TCIA curation audit logs guided the insertion of synthetic PHI into standard and non-standard DICOM data elements. A TCIA curation team tested the utility of the evaluation dataset. With this publication, the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (the result of TCIA curation) are released on TCIA in advance of a competition, sponsored by the National Cancer Institute (NCI), for algorithmic de-identification of medical image datasets. The competition will use a much larger evaluation dataset constructed in the same manner. This paper describes the creation of the evaluation datasets and guidelines for their use.


2010 ◽  
Vol 01 (01) ◽  
pp. 1-10 ◽  
Author(s):  
S. E. Ross ◽  
B. K. Mellis ◽  
B. L. Beaty ◽  
L. M. Schilling ◽  
A. J. Davidson ◽  
...  

SummaryObjective: Assess the interest in and preferences of ambulatory practitioners in HIE.Background: Health information exchange (HIE) may improve the quality and efficiency of care. Identifying the value proposition for smaller ambulatory practices may help those practices engage in HIE.Methods: Survey of primary care and specialist practitioners in the State of Colorado.Results: Clinical data were commonly (always [2%], often [29%] or sometimes [49%]) missing during clinic visits. Of 12 data types proposed as available through HIE, ten were considered “extremely useful” by most practitioners. “Clinical notes/consultation reports,” “diagnosis or problem lists,” and “hospital discharge summaries” were considered the three most useful data types. Interest in EKG reports, diagnosis/problem lists, childhood immunizations, and discharge summaries differed among ambulatory practitioner groups (primary care, obstetrics-gynecology, and internal medicine subspecialties).Conclusion: Practitioners express strong interest in most of the data types, but opinions differed by specialties on what types were most important. All providers felt that a system that provided all data types would be useful. These results support the potential benefit of HIE in ambulatory practices.


2012 ◽  
Vol 03 (02) ◽  
pp. 175-185 ◽  
Author(s):  
J.A. Bernstein ◽  
R.B. McKenzie ◽  
B.J. King ◽  
C.A. Longhurst ◽  
J.S. Hahn

SummaryElectronic physician documentation is an essential element of a complete electronic medical record (EMR). At Lucile Packard Children’s Hospital, a teaching hospital affiliated with Stanford University, we implemented an inpatient electronic documentation system for physicians over a 12-month period. Using an EMR-based free-text editor coupled with automated import of system data elements, we were able to achieve voluntary, widespread adoption of the electronic documentation process. When given the choice between electronic versus dictated report creation, the vast majority of users preferred the electronic method. In addition to increasing the legibility and accessibility of clinical notes, we also decreased the volume of dictated notes and scanning of handwritten notes, which provides the opportunity for cost savings to the institution.


Sign in / Sign up

Export Citation Format

Share Document