KM-BERT: A Pre-trained BERT for Korean Medical Natural Language Processing (Preprint)
BACKGROUND With advances in deep learning and natural language processing, analyzing medical texts is becoming increasingly important. Nonetheless, a study on medical-specific language models has not yet been conducted given the importance of medical texts. OBJECTIVE Korean medical text is highly difficult to analyze because of the agglutinative characteristics of the language as well as the complex terminologies in the medical domain. To solve this problem, we collected a Korean medical corpus and used it to train language models. METHODS In this paper, we present a Korean medical language model based on deep learning natural language processing. The proposed model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model. RESULTS After pre-training, the proposed method showed increased accuracies of 0.147 and 0.148 for the masked language model with next sentence prediction. In the intrinsic evaluation, the next sentence prediction accuracy improved by 0.258, which is a remarkable enhancement. In addition, the extrinsic evaluation of Korean medical semantic textual similarity data showed a 0.046 increase in the Pearson correlation. CONCLUSIONS The results demonstrated the superiority of the proposed model for Korean medical natural language processing. We expect that our proposed model can be extended for application to various languages and domains.