scholarly journals Research on Multi-label Text Classification Method Based on tALBERT-CNN

Author(s):  
Wenfu Liu ◽  
Jianmin Pang ◽  
Nan Li ◽  
Xin Zhou ◽  
Feng Yue

AbstractSingle-label classification technology has difficulty meeting the needs of text classification, and multi-label text classification has become an important research issue in natural language processing (NLP). Extracting semantic features from different levels and granularities of text is a basic and key task in multi-label text classification research. A topic model is an effective method for the automatic organization and induction of text information. It can reveal the latent semantics of documents and analyze the topics contained in massive information. Therefore, this paper proposes a multi-label text classification method based on tALBERT-CNN: an LDA topic model and ALBERT model are used to obtain the topic vector and semantic context vector of each word (document), a certain fusion mechanism is adopted to obtain in-depth topic and semantic representations of the document, and the multi-label features of the text are extracted through the TextCNN model to train a multi-label classifier. The experimental results obtained on standard datasets show that the proposed method can extract multi-label features from documents, and its performance is better than that of the existing state-of-the-art multi-label text classification algorithms.

2011 ◽  
Vol 179-180 ◽  
pp. 940-944
Author(s):  
Jian Chen ◽  
Wen Rong Jiang ◽  
Ji Hong Yan

Information retrieval is crucial issue for many areas, like industry, national security, disease control, etc. and how to organize massive information into understandable, readable automatically is a key step to understand data or information. Text Classification is a key issue for automatically text understanding, this paper provides a text classification method based on the SOM neural network model and delivers a reasonable performance.


2019 ◽  
Vol 45 (1) ◽  
pp. 11-14
Author(s):  
Zuhair Ali

Automated classification of text into predefined categories has always been considered as a vital method in thenatural language processing field. In this paper new methods based on Radial Basis Function (RBF) and Fuzzy Radial BasisFunction (FRBF) are used to solve the problem of text classification, where a set of features extracted for each sentencein the document collection these set of features introduced to FRBF and RBF to classify documents. Reuters 21578 datasetutilized for the purpose of text classification. The results showed the effectiveness of FRBF is better than RBF.


2021 ◽  
Author(s):  
Yuting Guo ◽  
Yao Ge ◽  
Yuan-Chi Yang ◽  
Mohammed Ali Al-Garadi ◽  
Abeed Sarker

Motivation Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks. There is a need to benchmark such models for targeted NLP tasks, and to explore effective pretraining strategies to improve machine learning performance. Results In this work, we addressed the task of health-related social media text classification. We benchmarked five models-RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT on 22 tasks. We attempted to boost performance for the best models by comparing distinct pretraining strategies-domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and topic-specific pretraining (TSPT). RoBERTa and BERTweet performed comparably in most tasks, and better than others. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT+TSPT showed consistently high performance, with statistically significant improvement in one task. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Jin Dai ◽  
Xin Liu

The similarity between objects is the core research area of data mining. In order to reduce the interference of the uncertainty of nature language, a similarity measurement between normal cloud models is adopted to text classification research. On this basis, a novel text classifier based on cloud concept jumping up (CCJU-TC) is proposed. It can efficiently accomplish conversion between qualitative concept and quantitative data. Through the conversion from text set to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole category concept. According to the cloud similarity between the test text and each category concept, the test text is assigned to the most similar category. By the comparison among different text classifiers in different feature selection set, it fully proves that not only does CCJU-TC have a strong ability to adapt to the different text features, but also the classification performance is also better than the traditional classifiers.


2019 ◽  
Author(s):  
Dimmy Magalhães ◽  
Aurora Pozo ◽  
Roberto Santana

Text Classification is one of the tasks of Natural Language Processing (NLP). In this area, Graph Convolutional Networks (GCN) has achieved values higher than CNN's and other related models. For GCN, the metric that defines the correlation between words in a vector space plays a crucial role in the classification because it determines the weight of the edges between two words (represented by nodes in the graph). In this study, we empirically investigated the impact of thirteen measures of distance/similarity. A representation was built for each document using word embedding from word2vec model. Also, a graph-based representation of five dataset was created for each measure analyzed, where each word is a node in the graph, and each edge is weighted by distance/similarity between words. Finally, each model was run in a simple graph neural network. The results show that, concerning text classification, there is no statistical difference between the analyzed metrics and the Graph Convolution Network. Even with the incorporation of external words or external knowledge, the results were similar to the methods without the incorporation of words. However, the results indicate that some distance metrics behave better than others in relation to context capture, with Euclidean distance reaching the best values or having statistical similarity with the best.


2021 ◽  
Author(s):  
Deniz Kavi

Text generation is the task of generating natural language, and producing outputs similar to or better than human texts. Due to deep learning’s recent success in the field of natural language processing, computer generated text has come closer to becoming indistinguishable to human writing. Genetic Algorithms have not been as popular in the field of text generation. We propose a genetic algorithm combined with text classification and clustering models which automatically grade the texts generated by the genetic algorithm. The genetic algorithm is given poorly generated texts from a Markov chain, these texts are then graded by a text classifier and a text clustering model. We then apply crossover to pairs of texts, with emphasis on those that received higher grades. Changes to the grading system and further improvements to the genetic algorithm are to be the focus of future research.


Author(s):  
Yakobus Wiciaputra ◽  
Julio Young ◽  
Andre Rusli

With the large amount of text information circulating on the internet, there is a need of a solution that can help processing data in the form of text for various purposes. In Indonesia, text information circulating on the internet generally uses 2 languages, English and Indonesian. This research focuses in building a model that is able to classify text in more than one language, or also commonly known as multilingual text classification. The multilingual text classification will use the XLM-RoBERTa model in its implementation. This study applied the transfer learning concept used by XLM-RoBERTa to build a classification model for texts in Indonesian using only the English News Dataset as a training dataset with Matthew Correlation Coefficient value of 42.2%. The results of this study also have the highest accuracy value when tested on a large English News Dataset (37,886) with Matthew Correlation Coefficient value of 90.8%, accuracy of 93.3%, precision of 93.4%, recall of 93.3%, and F1 of 93.3% and the accuracy value when tested on a large Indonesian News Dataset (70,304) with Matthew Correlation Coefficient value of 86.4%, accuracy, precision, recall, and F1 values of 90.2% using the large size Mixed News Dataset (108,190) in the model training process. Keywords: Multilingual Text Classification, Natural Language Processing, News Dataset, Transfer Learning, XLM-RoBERTa


Author(s):  
Jeow Li Huan ◽  
Arif Ahmed Sekh ◽  
Chai Quek ◽  
Dilip K. Prasad

AbstractText classification is one of the widely used phenomena in different natural language processing tasks. State-of-the-art text classifiers use the vector space model for extracting features. Recent progress in deep models, recurrent neural networks those preserve the positional relationship among words achieve a higher accuracy. To push text classification accuracy even higher, multi-dimensional document representation, such as vector sequences or matrices combined with document sentiment, should be explored. In this paper, we show that documents can be represented as a sequence of vectors carrying semantic meaning and classified using a recurrent neural network that recognizes long-range relationships. We show that in this representation, additional sentiment vectors can be easily attached as a fully connected layer to the word vectors to further improve classification accuracy. On the UCI sentiment labelled dataset, using the sequence of vectors alone achieved an accuracy of 85.6%, which is better than 80.7% from ridge regression classifier—the best among the classical technique we tested. Additional sentiment information further increases accuracy to 86.3%. On our suicide notes dataset, the best classical technique—the Naíve Bayes Bernoulli classifier, achieves accuracy of 71.3%, while our classifier, incorporating semantic and sentiment information, exceeds that at 75% accuracy.


Author(s):  
Qian-Wen Zhang ◽  
Ximing Zhang ◽  
Zhao Yan ◽  
Ruifang Liu ◽  
Yunbo Cao ◽  
...  

Multi-label text classification is an essential task in natural language processing. Existing multi-label classification models generally consider labels as categorical variables and ignore the exploitation of label semantics. In this paper, we view the task as a correlation-guided text representation problem: an attention-based two-step framework is proposed to integrate text information and label semantics by jointly learning words and labels in the same space. In this way, we aim to capture high-order label-label correlations as well as context-label correlations. Specifically, the proposed approach works by learning token-level representations of words and labels globally through a multi-layer Transformer and constructing an attention vector through word-label correlation matrix to generate the text representation. It ensures that relevant words receive higher weights than irrelevant words and thus directly optimizes the classification performance. Extensive experiments over benchmark multi-label datasets clearly validate the effectiveness of the proposed approach, and further analysis demonstrates that it is competitive in both predicting low-frequency labels and convergence speed.


Author(s):  
Shujuan Yu ◽  
Danlei Liu ◽  
Yun Zhang ◽  
Shengmei Zhao ◽  
Weigang Wang

As an important branch of Nature Language Processing (NLP), how to extract useful text information and effective long-range associations has always been a bottleneck for text classification. With the great effort of deep learning researchers, deep Convolutional Neural Networks (CNNs) have made remarkable achievements in Computer Vision but still controversial in NLP tasks. In this paper, we propose a novel deep CNN named Deep Pyramid Temporal Convolutional Network (DPTCN) for short text classification, which is mainly consisting of concatenated embedding layer, causal convolution, 1/2 max pooling down-sampling and residual blocks. It is worth mentioning that our work was highly inspired by two well-designed models: one is temporal convolutional network for sequential modeling; another is deep pyramid CNN for text categorization; as their applicability and pertinence remind us how to build a model in a special domain. In the experiments, we evaluate the proposed model on 7 datasets with 6 models and analyze the impact of three different embedding methods. The results prove that our work is a good attempt to apply word-level deep convolutional network in short text classification.


Sign in / Sign up

Export Citation Format

Share Document