Pre-trained Data Augmentation for Text Classification

BACKGROUND Traditional Chinese medicine (TCM) has been shown to be an efficient mode to manage advanced lung cancer, and accurate syndrome differentiation is crucial to treatment. Documented evidence of TCM treatment cases and the progress of artificial intelligence technology are enabling the development of intelligent TCM syndrome differentiation models. This is expected to expand the benefits of TCM to lung cancer patients. OBJECTIVE The objective of this work was to establish end-to-end TCM diagnostic models to imitate lung cancer syndrome differentiation. The proposed models used unstructured medical records as inputs to capitalize on data collected for practical TCM treatment cases by lung cancer experts. The resulting models were expected to be more efficient than approaches that leverage structured TCM datasets. METHODS We approached lung cancer TCM syndrome differentiation as a multilabel text classification problem. First, entity representation was conducted with Bidirectional Encoder Representations from Transformers and conditional random fields models. Then, five deep learning–based text classification models were applied to the construction of a medical record multilabel classifier, during which two data augmentation strategies were adopted to address overfitting issues. Finally, a fusion model approach was used to elevate the performance of the models. RESULTS The F1 score of the recurrent convolutional neural network (RCNN) model with augmentation was 0.8650, a 2.41% improvement over the unaugmented model. The Hamming loss for RCNN with augmentation was 0.0987, which is 1.8% lower than that of the same model without augmentation. Among the models, the text-hierarchical attention network (Text-HAN) model achieved the highest F1 scores of 0.8676 and 0.8751. The mean average precision for the word encoding–based RCNN was 10% higher than that of the character encoding–based representation. A fusion model of the text-convolutional neural network, text-recurrent neural network, and Text-HAN models achieved an F1 score of 0.8884, which showed the best performance among the models. CONCLUSIONS Medical records could be used more productively by constructing end-to-end models to facilitate TCM diagnosis. With the aid of entity-level representation, data augmentation, and model fusion, deep learning–based multilabel classification approaches can better imitate TCM syndrome differentiation in complex cases such as advanced lung cancer.

Download Full-text

Hierarchical Data Augmentation and the Application in Text Classification

IEEE Access ◽

10.1109/access.2019.2960263 ◽

2019 ◽

Vol 7 ◽

pp. 185476-185485 ◽

Cited By ~ 2

Author(s):

Shujuan Yu ◽

Jie Yang ◽

Danlei Liu ◽

Runqi Li ◽

Yun Zhang ◽

...

Keyword(s):

Text Classification ◽

Data Augmentation ◽

Hierarchical Data

Download Full-text

AUG-BERT: An Efficient Data Augmentation Algorithm for Text Classification

Lecture Notes in Electrical Engineering - Communications, Signal Processing, and Systems ◽

10.1007/978-981-13-9409-6_266 ◽

2020 ◽

pp. 2191-2198

Author(s):

Linqing Shi ◽

Danyang Liu ◽

Gongshen Liu ◽

Kui Meng

Keyword(s):

Text Classification ◽

Data Augmentation ◽

Augmentation Algorithm ◽

Efficient Data

Download Full-text

Text Classification by Contrastive Learning and Cross-lingual Data Augmentation for Alzheimer’s Disease Detection

10.18653/v1/2020.coling-main.542 ◽

2020 ◽

Author(s):

Zhiqiang Guo ◽

Zhaoci Liu ◽

Zhenhua Ling ◽

Shijin Wang ◽

Lingjing Jin ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Text Classification ◽

Data Augmentation ◽

Disease Detection ◽

Cross Lingual

Download Full-text

Data Augmentation Based on Distributed Expressions in Text Classification Tasks

10.18653/v1/w19-8304 ◽

2019 ◽

Author(s):

Sugawara Yu

Keyword(s):

Text Classification ◽

Data Augmentation ◽

Classification Tasks

Download Full-text

Data Augmentation For Chinese Text Classification Using Back-Translation

Journal of Physics Conference Series ◽

10.1088/1742-6596/1651/1/012039 ◽

2020 ◽

Vol 1651 ◽

pp. 012039

Author(s):

Jun Ma ◽

Langlang Li

Keyword(s):

Chinese Text ◽

Text Classification ◽

Data Augmentation ◽

Chinese Text Classification ◽

Back Translation

Download Full-text

End-to-End Models to Imitate Traditional Chinese Medicine Syndrome Differentiation in Lung Cancer Diagnosis: Model Development and Validation

JMIR Medical Informatics ◽

10.2196/17821 ◽

2020 ◽

Vol 8 (6) ◽

pp. e17821

Author(s):

Ziqing Liu ◽

Haiyang He ◽

Shixing Yan ◽

Yong Wang ◽

Tao Yang ◽

...

Keyword(s):

Neural Network ◽

Lung Cancer ◽

Text Classification ◽

Medical Records ◽

Data Augmentation ◽

Advanced Lung Cancer ◽

Syndrome Differentiation ◽

Fusion Model ◽

End To End ◽

Tcm Syndrome

Background Traditional Chinese medicine (TCM) has been shown to be an efficient mode to manage advanced lung cancer, and accurate syndrome differentiation is crucial to treatment. Documented evidence of TCM treatment cases and the progress of artificial intelligence technology are enabling the development of intelligent TCM syndrome differentiation models. This is expected to expand the benefits of TCM to lung cancer patients. Objective The objective of this work was to establish end-to-end TCM diagnostic models to imitate lung cancer syndrome differentiation. The proposed models used unstructured medical records as inputs to capitalize on data collected for practical TCM treatment cases by lung cancer experts. The resulting models were expected to be more efficient than approaches that leverage structured TCM datasets. Methods We approached lung cancer TCM syndrome differentiation as a multilabel text classification problem. First, entity representation was conducted with Bidirectional Encoder Representations from Transformers and conditional random fields models. Then, five deep learning–based text classification models were applied to the construction of a medical record multilabel classifier, during which two data augmentation strategies were adopted to address overfitting issues. Finally, a fusion model approach was used to elevate the performance of the models. Results The F1 score of the recurrent convolutional neural network (RCNN) model with augmentation was 0.8650, a 2.41% improvement over the unaugmented model. The Hamming loss for RCNN with augmentation was 0.0987, which is 1.8% lower than that of the same model without augmentation. Among the models, the text-hierarchical attention network (Text-HAN) model achieved the highest F1 scores of 0.8676 and 0.8751. The mean average precision for the word encoding–based RCNN was 10% higher than that of the character encoding–based representation. A fusion model of the text-convolutional neural network, text-recurrent neural network, and Text-HAN models achieved an F1 score of 0.8884, which showed the best performance among the models. Conclusions Medical records could be used more productively by constructing end-to-end models to facilitate TCM diagnosis. With the aid of entity-level representation, data augmentation, and model fusion, deep learning–based multilabel classification approaches can better imitate TCM syndrome differentiation in complex cases such as advanced lung cancer.

Download Full-text