DPTCN: A novel deep CNN model for short text classification

As an important branch of Nature Language Processing (NLP), how to extract useful text information and effective long-range associations has always been a bottleneck for text classification. With the great effort of deep learning researchers, deep Convolutional Neural Networks (CNNs) have made remarkable achievements in Computer Vision but still controversial in NLP tasks. In this paper, we propose a novel deep CNN named Deep Pyramid Temporal Convolutional Network (DPTCN) for short text classification, which is mainly consisting of concatenated embedding layer, causal convolution, 1/2 max pooling down-sampling and residual blocks. It is worth mentioning that our work was highly inspired by two well-designed models: one is temporal convolutional network for sequential modeling; another is deep pyramid CNN for text categorization; as their applicability and pertinence remind us how to build a model in a special domain. In the experiments, we evaluate the proposed model on 7 datasets with 6 models and analyze the impact of three different embedding methods. The results prove that our work is a good attempt to apply word-level deep convolutional network in short text classification.

Download Full-text

Aspect-level sentiment analysis merged with knowledge graph and graph convolutional neural network

Journal of Physics Conference Series ◽

10.1088/1742-6596/2083/4/042044 ◽

2021 ◽

Vol 2083 (4) ◽

pp. 042044

Author(s):

Zuhua Dai ◽

Yuanyuan Liu ◽

Shilong Di ◽

Qi Fan

Keyword(s):

Neural Network ◽

Sentiment Analysis ◽

Structural Information ◽

Knowledge Graph ◽

Convolutional Network ◽

Text Data ◽

Short Text ◽

Fine Grained ◽

Syntactic Information ◽

Text Information

Abstract Aspect level sentiment analysis belongs to fine-grained sentiment analysis, w hich has caused extensive research in academic circles in recent years. For this task, th e recurrent neural network (RNN) model is usually used for feature extraction, but the model cannot effectively obtain the structural information of the text. Recent studies h ave begun to use the graph convolutional network (GCN) to model the syntactic depen dency tree of the text to solve this problem. For short text data, the text information is not enough to accurately determine the emotional polarity of the aspect words, and the knowledge graph is not effectively used as external knowledge that can enrich the sem antic information. In order to solve the above problems, this paper proposes a graph co nvolutional neural network (GCN) model that can process syntactic information, know ledge graphs and text semantic information. The model works on the “syntax-knowled ge” graph to extract syntactic information and common sense information at the same t ime. Compared with the latest model, the model in this paper can effectively improve t he accuracy of aspect-level sentiment classification on two datasets.

Download Full-text

Joint Representations of Texts and Labels with Compositional Loss for Short Text Classification

Journal of Web Engineering ◽

10.13052/jwe1540-9589.2035 ◽

2021 ◽

Author(s):

Ming Hao ◽

Weijing Wang ◽

Fang Zhou

Keyword(s):

Language Processing ◽

Text Classification ◽

Ground Truth ◽

Language Models ◽

Text Representation ◽

Short Text ◽

Practical Applications ◽

Classical Models ◽

Multi Class Classification ◽

Public Datasets

Short text classification is an important foundation for natural language processing (NLP) tasks. Though, the text classification based on deep language models (DLMs) has made a significant headway, in practical applications however, some texts are ambiguous and hard to classify in multi-class classification especially, for short texts whose context length is limited. The mainstream method improves the distinction of ambiguous text by adding context information. However, these methods rely only the text representation, and ignore that the categories overlap and are not completely independent of each other. In this paper, we establish a new general method to solve the problem of ambiguous text classification by introducing label embedding to represent each category, which makes measurable difference between the categories. Further, a new compositional loss function is proposed to train the model, which makes the text representation closer to the ground-truth label and farther away from others. Finally, a constraint is obtained by calculating the similarity between the text representation and label embedding. Errors caused by ambiguous text can be corrected by adding constraints to the output layer of the model. We apply the method to three classical models and conduct experiments on six public datasets. Experiments show that our method can effectively improve the classification accuracy of the ambiguous texts. In addition, combining our method with BERT, we obtain the state-of-the-art results on the CNT dataset.

Download Full-text

Low-Rank Deep Convolutional Neural Network for Multitask Learning

Computational Intelligence and Neuroscience ◽

10.1155/2019/7410701 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10

Author(s):

Fang Su ◽

Hai-Yang Shang ◽

Jing-Yan Wang

Keyword(s):

Language Processing ◽

Back Propagation ◽

Multitask Learning ◽

Low Rank ◽

Learning Problem ◽

Convolutional Network ◽

Deep Network ◽

Benchmark Datasets ◽

Deep Cnn ◽

Fully Connected

In this paper, we propose a novel multitask learning method based on the deep convolutional network. The proposed deep network has four convolutional layers, three max-pooling layers, and two parallel fully connected layers. To adjust the deep network to multitask learning problem, we propose to learn a low-rank deep network so that the relation among different tasks can be explored. We proposed to minimize the number of independent parameter rows of one fully connected layer to explore the relations among different tasks, which is measured by the nuclear norm of the parameter of one fully connected layer, and seek a low-rank parameter matrix. Meanwhile, we also propose to regularize another fully connected layer by sparsity penalty so that the useful features learned by the lower layers can be selected. The learning problem is solved by an iterative algorithm based on gradient descent and back-propagation algorithms. The proposed algorithm is evaluated over benchmark datasets of multiple face attribute prediction, multitask natural language processing, and joint economics index predictions. The evaluation results show the advantage of the low-rank deep CNN model over multitask problems.

Download Full-text

Temporal and Contextual Evaluation of Background Knowledge Discovery for Short Text Classification

International Journal of Organizational and Collective Intelligence ◽

10.4018/ijoci.2012070103 ◽

2012 ◽

Vol 3 (3) ◽

pp. 36-55

Author(s):

Isak Taksa ◽

Sarah Zelikovitz ◽

Amanda Spink

Keyword(s):

Text Classification ◽

Query Expansion ◽

Background Knowledge ◽

Short Text ◽

Tuning Parameters ◽

The Past ◽

Dynamic Web ◽

The Impact ◽

Apparent Age ◽

Insight Into

Background Knowledge (BK) plays an essential role in machine learning for short-text and non-topical classification. In this paper the authors present and evaluate two Information Retrieval techniques used to assemble four sets of BK in the past seven years. These sets were applied to classify a commercial corpus of search queries by the apparent age of the user. Temporal and contextual evaluations were used to examine results of various classification scenarios providing insight into choice, significance and range of tuning parameters. The evaluations also demonstrated the impact of the dynamic Web collection on classification results, and the advantages of Automatic Query Expansion (AQE) vs. basic search. The authors discuss other results of this research and its implications on the advancement of short text classification.

Download Full-text

An empirical comparison of distance/similarity measures for Natural Language Processing

10.5753/eniac.2019.9328 ◽

2019 ◽

Author(s):

Dimmy Magalhães ◽

Aurora Pozo ◽

Roberto Santana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Classification ◽

Euclidean Distance ◽

Similarity Measures ◽

Convolutional Networks ◽

Statistical Similarity ◽

The Impact ◽

Better Than

Text Classification is one of the tasks of Natural Language Processing (NLP). In this area, Graph Convolutional Networks (GCN) has achieved values higher than CNN's and other related models. For GCN, the metric that defines the correlation between words in a vector space plays a crucial role in the classification because it determines the weight of the edges between two words (represented by nodes in the graph). In this study, we empirically investigated the impact of thirteen measures of distance/similarity. A representation was built for each document using word embedding from word2vec model. Also, a graph-based representation of five dataset was created for each measure analyzed, where each word is a node in the graph, and each edge is weighted by distance/similarity between words. Finally, each model was run in a simple graph neural network. The results show that, concerning text classification, there is no statistical difference between the analyzed metrics and the Graph Convolution Network. Even with the incorporation of external words or external knowledge, the results were similar to the methods without the incorporation of words. However, the results indicate that some distance metrics behave better than others in relation to context capture, with Euclidean distance reaching the best values or having statistical similarity with the best.

Download Full-text

SALTClass: classifying clinical short notes using background knowledge from unlabeled data

10.1101/801944 ◽

2019 ◽

Author(s):

Ayoub Bagheri ◽

Daniel Oberski ◽

Arjan Sammani ◽

Peter G.M. van der Heijden ◽

Folkert W. Asselbergs

Keyword(s):

Machine Learning ◽

Language Processing ◽

Text Classification ◽

Latent Dirichlet Allocation ◽

Machine Learning Algorithms ◽

Unlabeled Data ◽

Specific Information ◽

Short Text ◽

Link Type ◽

Python Package

AbstractBackgroundWith the increasing use of unstructured text in electronic health records, extracting useful related information has become a necessity. Text classification can be applied to extract patients’ medical history from clinical notes. However, the sparsity in clinical short notes, that is, excessively small word counts in the text, can lead to large classification errors. Previous studies demonstrated that natural language processing (NLP) can be useful in the text classification of clinical outcomes. We propose incorporating the knowledge from unlabeled data, as this may alleviate the problem of short noisy sparse text.ResultsThe software package SALTClass (short and long text classifier) is a machine learning NLP toolkit. It uses seven clustering algorithms, namely, latent Dirichlet allocation, K-Means, MiniBatchK-Means, BIRCH, MeanShift, DBScan, and GMM. Smoothing methods are applied to the resulting cluster information to enrich the representation of sparse text. For the subsequent prediction step, SALTClass can be used on either the original document-term matrix or in an enrichment pipeline. To this end, ten different supervised classifiers have also been integrated into SALTClass. We demonstrate the effectiveness of the SALTClass NLP toolkit in the identification of patients’ family history in a Dutch clinical cardiovascular text corpus from University Medical Center Utrecht, the Netherlands.ConclusionsThe considerable amount of unstructured short text in healthcare applications, particularly in clinical cardiovascular notes, has created an urgent need for tools that can parse specific information from text reports. Using machine learning algorithms for enriching short text can improve the representation for further applications.AvailabilitySALTClass can be downloaded as a Python package from Python Package Index (PyPI) website athttps://pypi.org/project/saltclassand from GitHub athttps://github.com/bagheria/saltclass.

Download Full-text

PAAD: POLITICAL ARABIC ARTICLES DATASET FOR AUTOMATIC TEXT CATEGORIZATION

Iraqi Journal for Computers and Informatics ◽

10.25195/ijci.v46i1.246 ◽

2020 ◽

Vol 46 (1) ◽

pp. 1-10

Author(s):

Dhafar Hamed Abd ◽

Ahmed T. Sadiq ◽

Ayad R. Abbas

Keyword(s):

Computational Linguistics ◽

Language Processing ◽

Text Classification ◽

Text Categorization ◽

Political Orientation ◽

Huge Amount ◽

Textual Data ◽

Automatic Text ◽

Excel File ◽

Modern Standard

Now day’s text Classification and Sentiment analysis is considered as one of the popular Natural Language Processing (NLP) tasks. This kind of technique plays significant role in human activities and has impact on the daily behaviours. Each article in different fields such as politics and business represent different opinions according to the writer tendency. A huge amount of data will be acquired through that differentiation. The capability to manage the political orientation of an online article automatically. Therefore, there is no corpus for political categorization was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. However, we introduce political Arabic articles dataset (PAAD) of textual data collected from newspapers, social network, general forum and ideology website. The dataset is 206 articles distributed into three categories as (Reform, Conservative and Revolutionary) that we offer to the research community on Arabic computational linguistics. We anticipate that this dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic, political text classification purposes. We present the data in raw form and excel file. Excel file will be in four types such as V1 raw data, V2 preprocessing, V3 root stemming and V4 light stemming.

Download Full-text

Bilingual Text Classification in English and Indonesian via Transfer Learning using XLM-RoBERTa

International Journal of Advances in Soft Computing and its Applications ◽

10.15849/ijasca.211128.06 ◽

2021 ◽

Vol 13 (3) ◽

pp. 73-87

Author(s):

Yakobus Wiciaputra ◽

Julio Young ◽

Andre Rusli

Keyword(s):

Correlation Coefficient ◽

Transfer Learning ◽

Language Processing ◽

Text Classification ◽

Classification Model ◽

Training Dataset ◽

The Internet ◽

Matthew Correlation Coefficient ◽

Text Information ◽

Multilingual Text

With the large amount of text information circulating on the internet, there is a need of a solution that can help processing data in the form of text for various purposes. In Indonesia, text information circulating on the internet generally uses 2 languages, English and Indonesian. This research focuses in building a model that is able to classify text in more than one language, or also commonly known as multilingual text classification. The multilingual text classification will use the XLM-RoBERTa model in its implementation. This study applied the transfer learning concept used by XLM-RoBERTa to build a classification model for texts in Indonesian using only the English News Dataset as a training dataset with Matthew Correlation Coefficient value of 42.2%. The results of this study also have the highest accuracy value when tested on a large English News Dataset (37,886) with Matthew Correlation Coefficient value of 90.8%, accuracy of 93.3%, precision of 93.4%, recall of 93.3%, and F1 of 93.3% and the accuracy value when tested on a large Indonesian News Dataset (70,304) with Matthew Correlation Coefficient value of 86.4%, accuracy, precision, recall, and F1 values of 90.2% using the large size Mixed News Dataset (108,190) in the model training process. Keywords: Multilingual Text Classification, Natural Language Processing, News Dataset, Transfer Learning, XLM-RoBERTa

Download Full-text

An Ensemble Learning Strategy for Eligibility Criteria Text Classification for Clinical Trial Recruitment: Algorithm Development and Validation

JMIR Medical Informatics ◽

10.2196/17832 ◽

2020 ◽

Vol 8 (7) ◽

pp. e17832

Author(s):

Kun Zeng ◽

Zhiwei Pan ◽

Yibin Xu ◽

Yingying Qu

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Natural Language Processing ◽

Ensemble Learning ◽

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Shared Task ◽

Eligibility Criteria ◽

Short Text

Background Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. Objective We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. Methods We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. Results Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. Conclusions We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.

Download Full-text

Improving Arabic Text Classification Using P-Stemmer

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904114023 ◽

2020 ◽

Vol 13 ◽

Author(s):

Tarek Kanan ◽

Bilal Hawashin ◽

Shadi Alzubi ◽

Eyad Almaita ◽

Ahmad Alkhatib ◽

...

Keyword(s):

Language Processing ◽

Text Classification ◽

Text Categorization ◽

English Language ◽

Arabic Language ◽

Online News ◽

Support Vector ◽

Arabic Text ◽

Fast Learning ◽

Arabic Text Classification

Introduction: Stemming is an important preprocessing step in text classification, and could contribute in increasing text classification accuracy. Although many works proposed stemmers for English language, few stemmers were proposed for Arabic text. Arabic language has gained increasing attention in the previous decades and the need is vital to further improve Arabic text classification. Method: This work combined the use of the recently proposed P-Stemmer with various classifiers to find the optimal classifier for the P-stemmer in term of Arabic text classification. As part of this work, a synthesized dataset was collected. Result: The previous experiments show that the use of P-Stemmer has a positive effect on classification. The degree of improvement was classifier-dependent, which is reasonable as classifiers vary in their methodologies. Moreover, the experiments show that the best classifier with the P-Stemmer was NB. This is an interesting result as this classifier is wellknown for its fast learning and classification time. Discussion: First, the continuous improvement of the P-Stemmer by more optimization steps is necessary to further improve the Arabic text categorization. This can be made by combining more classifiers with the stemmer, by optimizing the other natural language processing steps, and by improving the set of stemming rules. Second, the lack of sufficient Arabic datasets, especially large ones, is still an issue. Conclusion: In this work, an improved P-Stemmer was proposed by combining its use with various classifiers. In order to evaluate its performance, and due to the lack of Arabic datasets, a novel Arabic dataset was synthesized from various online news pages. Next, the P-Stemmer was combined with Naïve Bayes, Random Forest, Support Vector Machines, KNearest Neighbor, and K-Star.

Download Full-text