Natural Language Processing to Identify Cancer Treatments With Electronic Medical Records

2021 ◽  
pp. 379-393
Author(s):  
Jiaming Zeng ◽  
Imon Banerjee ◽  
A. Solomon Henry ◽  
Douglas J. Wood ◽  
Ross D. Shachter ◽  
...  

PURPOSE Knowing the treatments administered to patients with cancer is important for treatment planning and correlating treatment patterns with outcomes for personalized medicine study. However, existing methods to identify treatments are often lacking. We develop a natural language processing approach with structured electronic medical records and unstructured clinical notes to identify the initial treatment administered to patients with cancer. METHODS We used a total number of 4,412 patients with 483,782 clinical notes from the Stanford Cancer Institute Research Database containing patients with nonmetastatic prostate, oropharynx, and esophagus cancer. We trained treatment identification models for each cancer type separately and compared performance of using only structured, only unstructured ( bag-of-words, doc2vec, fasttext), and combinations of both ( structured + bow, structured + doc2vec, structured + fasttext). We optimized the identification model among five machine learning methods (logistic regression, multilayer perceptrons, random forest, support vector machines, and stochastic gradient boosting). The treatment information recorded in the cancer registry is the gold standard and compares our methods to an identification baseline with billing codes. RESULTS For prostate cancer, we achieved an f1-score of 0.99 (95% CI, 0.97 to 1.00) for radiation and 1.00 (95% CI, 0.99 to 1.00) for surgery using structured + doc2vec. For oropharynx cancer, we achieved an f1-score of 0.78 (95% CI, 0.58 to 0.93) for chemoradiation and 0.83 (95% CI, 0.69 to 0.95) for surgery using doc2vec. For esophagus cancer, we achieved an f1-score of 1.0 (95% CI, 1.0 to 1.0) for both chemoradiation and surgery using all combinations of structured and unstructured data. We found that employing the free-text clinical notes outperforms using the billing codes or only structured data for all three cancer types. CONCLUSION Our results show that treatment identification using free-text clinical notes greatly improves upon the performance using billing codes and simple structured data. The approach can be used for treatment cohort identification and adapted for longitudinal cancer treatment identification.

PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0247404
Author(s):  
Akshaya V. Annapragada ◽  
Marcella M. Donaruma-Kwoh ◽  
Ananth V. Annapragada ◽  
Zbigniew A. Starosolski

Child physical abuse is a leading cause of traumatic injury and death in children. In 2017, child abuse was responsible for 1688 fatalities in the United States, of 3.5 million children referred to Child Protection Services and 674,000 substantiated victims. While large referral hospitals maintain teams trained in Child Abuse Pediatrics, smaller community hospitals often do not have such dedicated resources to evaluate patients for potential abuse. Moreover, identification of abuse has a low margin of error, as false positive identifications lead to unwarranted separations, while false negatives allow dangerous situations to continue. This context makes the consistent detection of and response to abuse difficult, particularly given subtle signs in young, non-verbal patients. Here, we describe the development of artificial intelligence algorithms that use unstructured free-text in the electronic medical record—including notes from physicians, nurses, and social workers—to identify children who are suspected victims of physical abuse. Importantly, only the notes from time of first encounter (e.g.: birth, routine visit, sickness) to the last record before child protection team involvement were used. This allowed us to develop an algorithm using only information available prior to referral to the specialized child protection team. The study was performed in a multi-center referral pediatric hospital on patients screened for abuse within five different locations between 2015 and 2019. Of 1123 patients, 867 records were available after data cleaning and processing, and 55% were abuse-positive as determined by a multi-disciplinary team of clinical professionals. These electronic medical records were encoded with three natural language processing (NLP) algorithms—Bag of Words (BOW), Word Embeddings (WE), and Rules-Based (RB)—and used to train multiple neural network architectures. The BOW and WE encodings utilize the full free-text, while RB selects crucial phrases as identified by physicians. The best architecture was selected by average classification accuracy for the best performing model from each train-test split of a cross-validation experiment. Natural language processing coupled with neural networks detected cases of likely child abuse using only information available to clinicians prior to child protection team referral with average accuracy of 0.90±0.02 and average area under the receiver operator characteristic curve (ROC-AUC) 0.93±0.02 for the best performing Bag of Words models. The best performing rules-based models achieved average accuracy of 0.77±0.04 and average ROC-AUC 0.81±0.05, while a Word Embeddings strategy was severely limited by lack of representative embeddings. Importantly, the best performing model had a false positive rate of 8%, as compared to rates of 20% or higher in previously reported studies. This artificial intelligence approach can help screen patients for whom an abuse concern exists and streamline the identification of patients who may benefit from referral to a child protection team. Furthermore, this approach could be applied to develop computer-aided-diagnosis platforms for the challenging and often intractable problem of reliably identifying pediatric patients suffering from physical abuse.


2021 ◽  
Author(s):  
Raquel A. J. Gritz ◽  
Rafael S. Pereira ◽  
Henrique Matheus F. da Silva ◽  
Henrique G. Zatti ◽  
Laura E. A. Viana ◽  
...  

COVID-19 became a pandemic infecting more than 100 million people across the world and has been going on for over a year. A huge amount of data has been produced as electronic medical records in the form of textual data because of patient visits. Extracting this information may be very useful in better understanding the COVID-19 disease. However, challenges exist in interpreting the medical records typed as free text as doctors may use different terms to type in their observations. In order to deal with the latter, we created an ontology in Portuguese to describe the terms used in COVID-19 medical records in Brazil. In this paper, we present a brief overview of the ontology and how we are using it as the first step of a more complex NLP task.


BMJ ◽  
2015 ◽  
Vol 350 (apr24 11) ◽  
pp. h1885-h1885 ◽  
Author(s):  
K. P. Liao ◽  
T. Cai ◽  
G. K. Savova ◽  
S. N. Murphy ◽  
E. W. Karlson ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document