scholarly journals Think Twice: A Post-Processing Approach for the Chinese Spelling Error Correction

2021 ◽  
Vol 11 (13) ◽  
pp. 5832
Author(s):  
Wei Gou ◽  
Zheng Chen

Chinese Spelling Error Correction is a hot subject in the field of natural language processing. Researchers have already produced many great solutions, from the initial rule-based solution to the current deep learning method. At present, SpellGCN, proposed by Alibaba’s team, achieves the best results of which character level precision over SIGHAN2013 is 98.4%. However, when we apply this algorithm to practical error correction tasks, it produces many false error correction results. We believe that this is because the corpus used for model training contains significantly more errors than the text used for model correcting. In response to this problem, we propose performing a post-processing operation on the error correction tasks. We employ the initial model’s output as a candidate character, obtain various features of the character itself and its context, and then use a classification model to filter the initial model’s false error correction results. The post-processing idea introduced in this paper can apply to most Chinese Spelling Error Correction models to improve their performance over practical error correction tasks.

10.2196/23230 ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. e23230
Author(s):  
Pei-Fu Chen ◽  
Ssu-Ming Wang ◽  
Wei-Chih Liao ◽  
Lu-Cheng Kuo ◽  
Kuan-Chih Chen ◽  
...  

Background The International Classification of Diseases (ICD) code is widely used as the reference in medical system and billing purposes. However, classifying diseases into ICD codes still mainly relies on humans reading a large amount of written material as the basis for coding. Coding is both laborious and time-consuming. Since the conversion of ICD-9 to ICD-10, the coding task became much more complicated, and deep learning– and natural language processing–related approaches have been studied to assist disease coders. Objective This paper aims at constructing a deep learning model for ICD-10 coding, where the model is meant to automatically determine the corresponding diagnosis and procedure codes based solely on free-text medical notes to improve accuracy and reduce human effort. Methods We used diagnosis records of the National Taiwan University Hospital as resources and apply natural language processing techniques, including global vectors, word to vectors, embeddings from language models, bidirectional encoder representations from transformers, and single head attention recurrent neural network, on the deep neural network architecture to implement ICD-10 auto-coding. Besides, we introduced the attention mechanism into the classification model to extract the keywords from diagnoses and visualize the coding reference for training freshmen in ICD-10. Sixty discharge notes were randomly selected to examine the change in the F1-score and the coding time by coders before and after using our model. Results In experiments on the medical data set of National Taiwan University Hospital, our prediction results revealed F1-scores of 0.715 and 0.618 for the ICD-10 Clinical Modification code and Procedure Coding System code, respectively, with a bidirectional encoder representations from transformers embedding approach in the Gated Recurrent Unit classification model. The well-trained models were applied on the ICD-10 web service for coding and training to ICD-10 users. With this service, coders can code with the F1-score significantly increased from a median of 0.832 to 0.922 (P<.05), but not in a reduced interval. Conclusions The proposed model significantly improved the F1-score but did not decrease the time consumed in coding by disease coders.


2020 ◽  
Author(s):  
Pei-Fu Chen ◽  
Ssu-Ming Wang ◽  
Wei-Chih Liao ◽  
Lu-Cheng Kuo ◽  
Kuan-Chih Chen ◽  
...  

BACKGROUND The International Classification of Diseases (ICD) code is widely used as the reference in medical system and billing purposes. However, classifying diseases into ICD codes still mainly relies on humans reading a large amount of written material as the basis for coding. Coding is both laborious and time-consuming. Since the conversion of ICD-9 to ICD-10, the coding task became much more complicated, and deep learning– and natural language processing–related approaches have been studied to assist disease coders. OBJECTIVE This paper aims at constructing a deep learning model for ICD-10 coding, where the model is meant to automatically determine the corresponding diagnosis and procedure codes based solely on free-text medical notes to improve accuracy and reduce human effort. METHODS We used diagnosis records of the National Taiwan University Hospital as resources and apply natural language processing techniques, including global vectors, word to vectors, embeddings from language models, bidirectional encoder representations from transformers, and single head attention recurrent neural network, on the deep neural network architecture to implement ICD-10 auto-coding. Besides, we introduced the attention mechanism into the classification model to extract the keywords from diagnoses and visualize the coding reference for training freshmen in ICD-10. Sixty discharge notes were randomly selected to examine the change in the F<sub>1</sub>-score and the coding time by coders before and after using our model. RESULTS In experiments on the medical data set of National Taiwan University Hospital, our prediction results revealed F<sub>1</sub>-scores of 0.715 and 0.618 for the ICD-10 Clinical Modification code and Procedure Coding System code, respectively, with a <i>bidirectional encoder representations from transformers</i> embedding approach in the Gated Recurrent Unit classification model. The well-trained models were applied on the ICD-10 web service for coding and training to ICD-10 users. With this service, coders can code with the F<sub>1</sub>-score significantly increased from a median of 0.832 to 0.922 (<i>P</i>&lt;.05), but not in a reduced interval. CONCLUSIONS The proposed model significantly improved the F<sub>1</sub>-score but did not decrease the time consumed in coding by disease coders.


2020 ◽  
Author(s):  
Wojciech Ozimek

The automatic text summarizing task is one of the most complex problems in the field of natural language processing. In this dissertation, we present the abstraction-based summarization approach which allows to paraphrase the original text and generate new sentences. Creation of new formulations, completely different from the original text is similar to how humans summarize texts. To achieve this, we propose the deep learning method using Sequence to Sequence architecture with the attention mechanism. The goal is to create the model for Polish language, using dataset containing over 200,000 articles from Polish websites, split into text and summary parts. Presented outcomes look promising, obtaining decent results utilizing standard metrics for such type of task.Based on review of prior research done during experiments, this is the very first attempt of applying abstractive text summarization techniques for Polish language.


2020 ◽  
Author(s):  
Hui Chen ◽  
Honglei Liu ◽  
Ni Wang ◽  
Yanqun Huang ◽  
Zhiqiang Zhang ◽  
...  

BACKGROUND Liver cancer remains to be a substantial disease burden in China. As one of the primary diagnostic means for liver cancer, the dynamic enhanced computed tomography (CT) scan provides detailed diagnosis evidence that is recorded in the free-text radiology reports. OBJECTIVE In this study, we combined knowledge-driven deep learning methods and data-driven natural language processing (NLP) methods to extract the radiological features from these reports, and designed a computer-aided liver cancer diagnosis framework.In this study, we combined knowledge-driven deep learning methods and data-driven natural language processing (NLP) methods to extract the radiological features from these reports, and designed a computer-aided liver cancer diagnosis framework. METHODS We collected 1089 CT radiology reports in Chinese. We proposed a pre-trained fine-tuning BERT (Bidirectional Encoder Representations from Transformers) language model for word embedding. The embedding served as the inputs for BiLSTM (Bidirectional Long Short-Term Memory) and CRF (Conditional Random Field) model (BERT-BiLSTM-CRF) to extract features of hyperintense enhancement in the arterial phase (APHE) and hypointense in the portal and delayed phases (PDPH). Furthermore, we also extracted features using the traditional rule-based NLP method based on the content of radiology reports. We then applied random forest for liver cancer diagnosis and calculated the Gini impurity for the identification of diagnosis evidence. RESULTS The BERT-BiLSTM-CRF predicted the features of APHE and PDPH with an F1 score of 98.40% and 90.67%, respectively. The prediction model using combined features had a higher performance (F1 score, 88.55%) than those using the single kind of features obtained by BERT-BiLSTM-CRF (84.88%) or traditional rule-based NLP method (83.52%). The features of APHE and PDPH were the top two essential features for the liver cancer diagnosis. CONCLUSIONS We proposed a BERT-based deep learning method for diagnosis evidence extraction based on clinical knowledge. With the recognized features of APHE and PDPH, the liver cancer diagnosis could get a high performance, which was further increased by combining with the radiological features obtained by the traditional rule-based NLP method. The BERT-BiLSTM-CRF had achieved the state-of-the-art performance in this study, which could be extended to other kinds of Chinese clinical texts. CLINICALTRIAL None


2022 ◽  
Vol 355 ◽  
pp. 03028
Author(s):  
Saihan Li ◽  
Zhijie Hu ◽  
Rong Cao

Natural Language inference refers to the problem of determining the relationships between a premise and a hypothesis, it is an emerging area of natural language processing. The paper uses deep learning methods to complete natural language inference task. The dataset includes 3GPP dataset and SNLI dataset. Gensim library is used to get the word embeddings, there are 2 methods which are word2vec and doc2vec to map the sentence to array. 2 deep learning models DNNClassifier and Attention are implemented separately to classify the relationship between the proposals from the telecommunication area dataset. The highest accuracy of the experiment is 88% and we found that the quality of the dataset decided the upper bound of the accuracy.


Author(s):  
Tian Kang ◽  
Adler Perotte ◽  
Youlan Tang ◽  
Casey Ta ◽  
Chunhua Weng

Abstract Objective The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. Materials and Methods We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. Results UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). Conclusions This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.


Sign in / Sign up

Export Citation Format

Share Document