PHIs (Protected Health Information) identification from free text clinical records based on machine learning

Author(s):  
Kunal Rajput ◽  
Girija Chetty ◽  
Rachel Davey
2018 ◽  
Vol 116 ◽  
pp. 24-32 ◽  
Author(s):  
Liting Du ◽  
Chenxi Xia ◽  
Zhaohua Deng ◽  
Gary Lu ◽  
Shuxu Xia ◽  
...  

2019 ◽  
Vol 08 (02) ◽  
pp. 01-11
Author(s):  
Geetha Mahadevaiah ◽  
M.S Dinesh ◽  
Rithesh Sreenivasan ◽  
Sana Moin ◽  
Andre Dekker

2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Beau Norgeot ◽  
Kathleen Muenzen ◽  
Thomas A. Peterson ◽  
Xuancheng Fan ◽  
Benjamin S. Glicksberg ◽  
...  

Author(s):  
Saman Hina ◽  
Raheela Asif ◽  
Syed Abbas Ali

It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Pratyusha Rakshit ◽  
Onintze Zaballa ◽  
Aritz Pérez ◽  
Elisa Gómez-Inhiesto ◽  
Maria T. Acaiturri-Ayesta ◽  
...  

AbstractThis paper presents a novel machine learning approach to perform an early prediction of the healthcare cost of breast cancer patients. The learning phase of our prediction method considers the following two steps: (1) in the first step, the patients are clustered taking into account the sequences of actions undergoing similar clinical activities and ensuring similar healthcare costs, and (2) a Markov chain is then learned for each group to describe the action-sequences of the patients in the cluster. A two step procedure is undertaken in the prediction phase: (1) first, the healthcare cost of a new patient’s treatment is estimated based on the average healthcare cost of its k-nearest neighbors in each group, and (2) finally, an aggregate measure of the healthcare cost estimated by each group is used as the final predicted cost. Experiments undertaken reveal a mean absolute percentage error as small as 6%, even when half of the clinical records of a patient is available, substantiating the early prediction capability of the proposed method. Comparative analysis substantiates the superiority of the proposed algorithm over the state-of-the-art techniques.


Healthcare ◽  
2021 ◽  
Vol 9 (6) ◽  
pp. 735
Author(s):  
Schoultz Mariyana ◽  
Leung Janni ◽  
Bonsaksen Tore ◽  
Ruffolo Mary ◽  
Thygesen Hilde ◽  
...  

Background: Due to the COVID-19 pandemic and the strict national policies regarding social distancing behavior in Europe, America and Australia, people became reliant on social media as a means for gathering information and as a tool for staying connected to family, friends and work. This is the first trans-national study exploring the qualitative experiences and challenges of using social media while in lockdown or shelter-in-place during the current pandemic. Methods: This study was part of a wider cross-sectional online survey conducted in Norway, the UK, USA and Australia during April/May 2020. The manuscript reports on the qualitative free-text component of the study asking about the challenges of social media users during the COVID-19 pandemic in the UK, USA and Australia. A total of 1991 responses were included in the analysis. Thematic analysis was conducted independently by two researchers. Results: Three overarching themes identified were: Emotional/Mental Health, Information and Being Connected. Participants experienced that using social media during the pandemic amplified anxiety, depression, fear, panic, anger, frustration and loneliness. They felt that there was information overload and social media was full of misleading or polarized opinions which were difficult to switch off. Nonetheless, participants also thought that there was an urge for connection and learning, which was positive and stressful at the same time. Conclusion: Using social media while in a shelter-in-place or lockdown could have a negative impact on the emotional and mental health of some of the population. To support policy and practice in strengthening mental health care in the community, social media could be used to deliver practical advice on coping and stress management. Communication with the public should be strengthened by unambiguous and clear messages and clear communication pathways. We should be looking at alternative ways of staying connected.


2021 ◽  
Vol 28 (1) ◽  
pp. e100262
Author(s):  
Mustafa Khanbhai ◽  
Patrick Anyadi ◽  
Joshua Symons ◽  
Kelsey Flott ◽  
Ara Darzi ◽  
...  

ObjectivesUnstructured free-text patient feedback contains rich information, and analysing these data manually would require a lot of personnel resources which are not available in most healthcare organisations.To undertake a systematic review of the literature on the use of natural language processing (NLP) and machine learning (ML) to process and analyse free-text patient experience data.MethodsDatabases were systematically searched to identify articles published between January 2000 and December 2019 examining NLP to analyse free-text patient feedback. Due to the heterogeneous nature of the studies, a narrative synthesis was deemed most appropriate. Data related to the study purpose, corpus, methodology, performance metrics and indicators of quality were recorded.ResultsNineteen articles were included. The majority (80%) of studies applied language analysis techniques on patient feedback from social media sites (unsolicited) followed by structured surveys (solicited). Supervised learning was frequently used (n=9), followed by unsupervised (n=6) and semisupervised (n=3). Comments extracted from social media were analysed using an unsupervised approach, and free-text comments held within structured surveys were analysed using a supervised approach. Reported performance metrics included the precision, recall and F-measure, with support vector machine and Naïve Bayes being the best performing ML classifiers.ConclusionNLP and ML have emerged as an important tool for processing unstructured free text. Both supervised and unsupervised approaches have their role depending on the data source. With the advancement of data analysis tools, these techniques may be useful to healthcare organisations to generate insight from the volumes of unstructured free-text data.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Michael Rutherford ◽  
Seong K. Mun ◽  
Betty Levine ◽  
William Bennett ◽  
Kirk Smith ◽  
...  

AbstractWe developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM objects (a total of 1,693 CT, MRI, PET, and digital X-ray images) were selected from datasets published in the Cancer Imaging Archive (TCIA). Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM Attributes to mimic typical clinical imaging exams. The DICOM Standard and TCIA curation audit logs guided the insertion of synthetic PHI into standard and non-standard DICOM data elements. A TCIA curation team tested the utility of the evaluation dataset. With this publication, the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (the result of TCIA curation) are released on TCIA in advance of a competition, sponsored by the National Cancer Institute (NCI), for algorithmic de-identification of medical image datasets. The competition will use a much larger evaluation dataset constructed in the same manner. This paper describes the creation of the evaluation datasets and guidelines for their use.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Eyal Klang ◽  
Benjamin R. Kummer ◽  
Neha S. Dangayach ◽  
Amy Zhong ◽  
M. Arash Kia ◽  
...  

AbstractEarly admission to the neurosciences intensive care unit (NSICU) is associated with improved patient outcomes. Natural language processing offers new possibilities for mining free text in electronic health record data. We sought to develop a machine learning model using both tabular and free text data to identify patients requiring NSICU admission shortly after arrival to the emergency department (ED). We conducted a single-center, retrospective cohort study of adult patients at the Mount Sinai Hospital, an academic medical center in New York City. All patients presenting to our institutional ED between January 2014 and December 2018 were included. Structured (tabular) demographic, clinical, bed movement record data, and free text data from triage notes were extracted from our institutional data warehouse. A machine learning model was trained to predict likelihood of NSICU admission at 30 min from arrival to the ED. We identified 412,858 patients presenting to the ED over the study period, of whom 1900 (0.5%) were admitted to the NSICU. The daily median number of ED presentations was 231 (IQR 200–256) and the median time from ED presentation to the decision for NSICU admission was 169 min (IQR 80–324). A model trained only with text data had an area under the receiver-operating curve (AUC) of 0.90 (95% confidence interval (CI) 0.87–0.91). A structured data-only model had an AUC of 0.92 (95% CI 0.91–0.94). A combined model trained on structured and text data had an AUC of 0.93 (95% CI 0.92–0.95). At a false positive rate of 1:100 (99% specificity), the combined model was 58% sensitive for identifying NSICU admission. A machine learning model using structured and free text data can predict NSICU admission soon after ED arrival. This may potentially improve ED and NSICU resource allocation. Further studies should validate our findings.


Sign in / Sign up

Export Citation Format

Share Document