BACKGROUND
Colorectal cancer is a leading cause of cancer deaths. Several screening tests such as colonoscopy can be used to find polyps or colorectal cancer. Colonoscopy reports are often written in unstructured narrative text. The information embedded in the reports can be used for various purposes, including colorectal cancer risk prediction, follow-up recommendation, and quality measurement. However, the availability and accessibility of the unstructured text data are still very low despite the large amounts of accumulated data.
OBJECTIVE
We aimed to develop a deep learning-based natural language processing (NLP) method for named entity recognition (NER) in colonoscopy reports. To the best of our knowledge, no previous studies on clinical NLP for colonoscopy reports have applied deep learning techniques.
METHODS
This study proposed a method to apply pre-trained word embedding to a deep learning-based NER model using large unlabeled colonoscopy reports. Approximately 280,668 colonoscopy reports were extracted from the clinical data warehouse of the Samsung Medical Center. For 5,000 reports, procedural information and colonoscopic findings were manually annotated with 17 labels. We compared variants of the long short-term memory (LSTM) model to select the one with the best performance for colonoscopy reports, which was the bidirectional LSTM with conditional random fields. Then, we applied pre-trained word embedding using a large unlabeled data (280,668 reports) to the selected model.
RESULTS
The NER model with pre-trained word embedding performed better for most labels than the model with one-hot encoding. The F1 score for colonoscopic findings were: 0.9564 for lesions, 0.9722 for locations, 0.9809 for shapes, 0.9720 for colors, 0.9862 for sizes, and 0.9717 for numbers.
CONCLUSIONS
In this study, clinical NER was applied to extract meaningful information from colonoscopy reports. We proposed a deep learning-based NER model with pre-trained word embedding. The proposed method in this study achieved promising results that demonstrate it can be applied to various practical purposes.