Hybrid Model of Data Augmentation Methods for Text Classification Task

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Download Full-text

Voiceprint Identification for Limited Dataset Using the Deep Migration Hybrid Model Based on Transfer Learning

Sensors ◽

10.3390/s18072399 ◽

2018 ◽

Vol 18 (7) ◽

pp. 2399 ◽

Cited By ~ 9

Author(s):

Cunwei Sun ◽

Yuxin Yang ◽

Chang Wen ◽

Kai Xie ◽

Fangqing Wen

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Transfer Learning ◽

Hybrid Model ◽

Data Augmentation ◽

Small Sample ◽

Restricted Boltzmann Machine ◽

Small Samples ◽

Boltzmann Machine ◽

Training Time

The convolutional neural network (CNN) has made great strides in the area of voiceprint recognition; but it needs a huge number of data samples to train a deep neural network. In practice, it is too difficult to get a large number of training samples, and it cannot achieve a better convergence state due to the limited dataset. In order to solve this question, a new method using a deep migration hybrid model is put forward, which makes it easier to realize voiceprint recognition for small samples. Firstly, it uses Transfer Learning to transfer the trained network from the big sample voiceprint dataset to our limited voiceprint dataset for the further training. Fully-connected layers of a pre-training model are replaced by restricted Boltzmann machine layers. Secondly, the approach of Data Augmentation is adopted to increase the number of voiceprint datasets. Finally, we introduce fast batch normalization algorithms to improve the speed of the network convergence and shorten the training time. Our new voiceprint recognition approach uses the TLCNN-RBM (convolutional neural network mixed restricted Boltzmann machine based on transfer learning) model, which is the deep migration hybrid model that is used to achieve an average accuracy of over 97%, which is higher than that when using either CNN or the TL-CNN network (convolutional neural network based on transfer learning). Thus, an effective method for a small sample of voiceprint recognition has been provided.

Download Full-text

Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification

10.1109/bigdata52589.2021.9671510 ◽

2021 ◽

Author(s):

Guanyi Mou ◽

Yichuan Li ◽

Kyumin Lee

Keyword(s):

Text Classification ◽

Data Augmentation

Download Full-text

Explicit Interaction Model towards Text Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016359 ◽

2019 ◽

Vol 33 ◽

pp. 6359-6366 ◽

Cited By ~ 3

Author(s):

Cunxiao Du ◽

Zhaozheng Chen ◽

Fuli Feng ◽

Lei Zhu ◽

Tian Gan ◽

...

Keyword(s):

Language Processing ◽

Text Classification ◽

Deep Neural Networks ◽

Interaction Mechanism ◽

Interaction Model ◽

Classification Task ◽

Fine Grained ◽

Word Level ◽

Benchmark Datasets ◽

Classification Tasks

Text classification is one of the fundamental tasks in natural language processing. Recently, deep neural networks have achieved promising performance in the text classification task compared to shallow models. Despite of the significance of deep models, they ignore the fine-grained (matching signals between words and classes) classification clues since their classifications mainly rely on the text-level representations. To address this problem, we introduce the interaction mechanism to incorporate word-level matching signals into the text classification task. In particular, we design a novel framework, EXplicit interAction Model (dubbed as EXAM), equipped with the interaction mechanism. We justified the proposed approach on several benchmark datasets including both multilabel and multi-class text classification tasks. Extensive experimental results demonstrate the superiority of the proposed method. As a byproduct, we have released the codes and parameter settings to facilitate other researches.

Download Full-text

Pre-trained Data Augmentation for Text Classification

Intelligent Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-030-61377-8_38 ◽

2020 ◽

pp. 551-565

Author(s):

Hugo Queiroz Abonizio ◽

Sylvio Barbon Junior

Keyword(s):

Text Classification ◽

Data Augmentation

Download Full-text

An Efficient Method for Text Classification Task

Proceedings of the 2019 International Conference on Big Data Engineering (BDE 2019) - BDE 2019 ◽

10.1145/3341620.3341631 ◽

2019 ◽

Cited By ~ 1

Author(s):

Qiancheng Liang ◽

Ping Wu ◽

Chaoyi Huang

Keyword(s):

Text Classification ◽

Efficient Method ◽

Classification Task

Download Full-text

Logo-2K+: A Large-Scale Logo Dataset for Scalable Logo Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6085 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6194-6201

Author(s):

Jing Wang ◽

Weiqing Min ◽

Sujuan Hou ◽

Shengnan Ma ◽

Yuanjie Zheng ◽

...

Keyword(s):

Image Recognition ◽

Real World ◽

Large Scale ◽

Data Augmentation ◽

Ground Truth ◽

Classification Task ◽

The Real ◽

Product Recommendation ◽

Contextual Advertising ◽

Benchmark Datasets

Logo classification has gained increasing attention for its various applications, such as copyright infringement detection, product recommendation and contextual advertising. Compared with other types of object images, the real-world logo images have larger variety in logo appearance and more complexity in their background. Therefore, recognizing the logo from images is challenging. To support efforts towards scalable logo classification task, we have curated a dataset, Logo-2K+, a new large-scale publicly available real-world logo dataset with 2,341 categories and 167,140 images. Compared with existing popular logo datasets, such as FlickrLogos-32 and LOGO-Net, Logo-2K+ has more comprehensive coverage of logo categories and larger quantity of logo images. Moreover, we propose a Discriminative Region Navigation and Augmentation Network (DRNA-Net), which is capable of discovering more informative logo regions and augmenting these image regions for logo classification. DRNA-Net consists of four sub-networks: the navigator sub-network first selected informative logo-relevant regions guided by the teacher sub-network, which can evaluate its confidence belonging to the ground-truth logo class. The data augmentation sub-network then augments the selected regions via both region cropping and region dropping. Finally, the scrutinizer sub-network fuses features from augmented regions and the whole image for logo classification. Comprehensive experiments on Logo-2K+ and other three existing benchmark datasets demonstrate the effectiveness of proposed method. Logo-2K+ and the proposed strong baseline DRNA-Net are expected to further the development of scalable logo image recognition, and the Logo-2K+ dataset can be found at https://github.com/msn199959/Logo-2k-plus-Dataset.

Download Full-text

Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning

10.18653/v1/2021.naacl-main.434 ◽

2021 ◽

Author(s):

Jason Wei ◽

Chengyu Huang ◽

Soroush Vosoughi ◽

Yu Cheng ◽

Shiqi Xu

Keyword(s):

Text Classification ◽

Data Augmentation

Download Full-text

Spam Classification on 2019 Indonesian President Election Youtube Comments Using Multinomial Naïve-Bayes

Indonesian Journal of Artificial Intelligence and Data Mining ◽

10.24014/ijaidm.v2i1.6445 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 1

Author(s):

Jonathan Radot Fernando ◽

Raymond Budiraharjo ◽

Emeraldi Haganusa

Keyword(s):

Text Classification ◽

Naive Bayes ◽

Naïve Bayes ◽

Classification Task ◽

Bag Of Words ◽

Text Representation ◽

Frequency Data ◽

Bayes Algorithm ◽

Representation Method ◽

The Way

Text classification are used in many aspect of technologies such as spam classification, news categorization, Auto-correct texting. One of the most popular algorithm for text classification nowadays is Multinomial Naïve-Bayes. This paper explained how Naïve-Bayes assumption method works to classify 2019 Indonesian Election Youtube comments. The output prediction of this algorithm is spam or not spam. Spam messages are defined as racist comments, advertising comments, and unsolicited comments. The algorithms text representation method used bag-of-words method. Bag-of-words method defined a text as the multiset of its words. The algorithm then calculate the probability of a word given the class of spam or not spam. The main difference between normal Naïve-Bayes algorithm and Multinomial Naïve-Bayes is the way the algorithm treats the data itself. Multinomial Naïve-Bayes treats data as a frequency data hence it is suitable for text classification task.

Download Full-text