Text classification with semantically enriched word embeddings

Abstract The recent breakthroughs in deep neural architectures across multiple machine learning fields have led to the widespread use of deep neural models. These learners are often applied as black-box models that ignore or insufficiently utilize a wealth of preexisting semantic information. In this study, we focus on the text classification task, investigating methods for augmenting the input to deep neural networks (DNNs) with semantic information. We extract semantics for the words in the preprocessed text from the WordNet semantic graph, in the form of weighted concept terms that form a semantic frequency vector. Concepts are selected via a variety of semantic disambiguation techniques, including a basic, a part-of-speech-based, and a semantic embedding projection method. Additionally, we consider a weight propagation mechanism that exploits semantic relationships in the concept graph and conveys a spreading activation component. We enrich word2vec embeddings with the resulting semantic vector through concatenation or replacement and apply the semantically augmented word embeddings on the classification task via a DNN. Experimental results over established datasets demonstrate that our approach of semantic augmentation in the input space boosts classification performance significantly, with concatenation offering the best performance. We also note additional interesting findings produced by our approach regarding the behavior of term frequency - inverse document frequency normalization on semantic vectors, along with the radical dimensionality reduction potential with negligible performance loss.

Download Full-text

Pronunciation-Enhanced Chinese Word Embedding

Cognitive Computation ◽

10.1007/s12559-021-09850-9 ◽

2021 ◽

Author(s):

Qinjuan Yang ◽

Haoran Xie ◽

Gary Cheng ◽

Fu Lee Wang ◽

Yanghui Rao

Keyword(s):

Sentiment Analysis ◽

Text Classification ◽

Semantic Information ◽

Word Embedding ◽

Chinese Characters ◽

Learning Method ◽

Word Embeddings ◽

Chinese Word ◽

Word Similarity ◽

Meaning Structure

AbstractChinese word embeddings have recently garnered considerable attention. Chinese characters and their sub-character components, which contain rich semantic information, are incorporated to learn Chinese word embeddings. Chinese characters can represent a combination of meaning, structure, and pronunciation. However, existing embedding learning methods focus on the structure and meaning of Chinese characters. In this study, we aim to develop an embedding learning method that can make complete use of the information represented by Chinese characters, including phonology, morphology, and semantics. Specifically, we propose a pronunciation-enhanced Chinese word embedding learning method, where the pronunciations of context characters and target characters are simultaneously encoded into the embeddings. Evaluation of word similarity, word analogy reasoning, text classification, and sentiment analysis validate the effectiveness of our proposed method.

Download Full-text

Near-Lossless Binarization of Word Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017104 ◽

2019 ◽

Vol 33 ◽

pp. 7104-7111 ◽

Cited By ~ 3

Author(s):

Julien Tissier ◽

Christophe Gravier ◽

Amaury Habrard

Keyword(s):

Sentiment Analysis ◽

Semantic Similarity ◽

Text Classification ◽

Semantic Information ◽

State Of The Art ◽

Floating Point ◽

Word Embeddings ◽

Binary Vectors ◽

Starting Point ◽

Memory Footprint

Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ∼2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.

Download Full-text

Extract Semantic Information from WordNet to Improve Text Classification Performance

Advances in Computer Science and Information Technology - Lecture Notes in Computer Science ◽

10.1007/978-3-642-13577-4_36 ◽

2010 ◽

pp. 409-420 ◽

Cited By ~ 5

Author(s):

Rujiang Bai ◽

Xiaoyue Wang ◽

Junhua Liao

Keyword(s):

Text Classification ◽

Semantic Information ◽

Classification Performance

Download Full-text

Text Classification Based on Title Semantic Information

2020 5th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS) ◽

10.1109/iciibms50712.2020.9336401 ◽

2020 ◽

Author(s):

YunXiang Liu ◽

Qi Xu ◽

ChunYa Wang

Keyword(s):

Text Classification ◽

Semantic Information

Download Full-text

A Comparative Study on Word Embeddings in Deep Learning for Text Classification

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval ◽

10.1145/3443279.3443304 ◽

2020 ◽

Author(s):

Congcong Wang ◽

Paul Nulty ◽

David Lillis

Keyword(s):

Deep Learning ◽

Comparative Study ◽

Text Classification ◽

Word Embeddings

Download Full-text

Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese

Journal of the Brazilian Computer Society ◽

10.1186/s13173-014-0020-x ◽

2015 ◽

Vol 21 (1) ◽

Cited By ~ 16

Author(s):

Erick R Fonseca ◽

João Luís G Rosa ◽

Sandra Maria Aluísio

Keyword(s):

Word Embeddings ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Text Classification of Public Feedbacks using Convolutional Neural Network Based on Differential Evolution Algorithm

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2019.1.3420 ◽

2019 ◽

Vol 14 (1) ◽

pp. 124-134 ◽

Cited By ~ 2

Author(s):

Shuai Zhang ◽

Yong Chen ◽

Xiaoling Huang ◽

Yishuai Cai

Keyword(s):

Neural Network ◽

Differential Evolution ◽

Convolutional Neural Network ◽

Text Classification ◽

Differential Evolution Algorithm ◽

Classification Performance ◽

Classification Model ◽

Evolution Algorithm ◽

Classification Prediction

Online feedback is an effective way of communication between government departments and citizens. However, the daily high number of public feedbacks has increased the burden on government administrators. The deep learning method is good at automatically analyzing and extracting deep features of data, and then improving the accuracy of classification prediction. In this study, we aim to use the text classification model to achieve the automatic classification of public feedbacks to reduce the work pressure of administrator. In particular, a convolutional neural network model combined with word embedding and optimized by differential evolution algorithm is adopted. At the same time, we compared it with seven common text classification models, and the results show that the model we explored has good classification performance under different evaluation metrics, including accuracy, precision, recall, and F1-score.

Download Full-text

ConAnomaly: Content-Based Anomaly Detection for System Logs

Sensors ◽

10.3390/s21186125 ◽

2021 ◽

Vol 21 (18) ◽

pp. 6125

Author(s):

Dan Lv ◽

Nurbol Luktarhan ◽

Yiyong Chen

Keyword(s):

Anomaly Detection ◽

Semantic Information ◽

Short Term Memory ◽

Weighted Average ◽

Detection Methods ◽

Detection Model ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

System Logs ◽

System Maintenance

Enterprise systems typically produce a large number of logs to record runtime states and important events. Log anomaly detection is efficient for business management and system maintenance. Most existing log-based anomaly detection methods use log parser to get log event indexes or event templates and then utilize machine learning methods to detect anomalies. However, these methods cannot handle unknown log types and do not take advantage of the log semantic information. In this article, we propose ConAnomaly, a log-based anomaly detection model composed of a log sequence encoder (log2vec) and multi-layer Long Short Term Memory Network (LSTM). We designed log2vec based on the Word2vec model, which first vectorized the words in the log content, then deleted the invalid words through part of speech tagging, and finally obtained the sequence vector by the weighted average method. In this way, ConAnomaly not only captures semantic information in the log but also leverages log sequential relationships. We evaluate our proposed approach on two log datasets. Our experimental results show that ConAnomaly has good stability and can deal with unseen log types to a certain extent, and it provides better performance than most log-based anomaly detection methods.

Download Full-text

Specializing Word Embeddings (for Parsing) by Information Bottleneck (Extended Abstract)

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/658 ◽

2020 ◽

Author(s):

Xiang Lisa Li ◽

Jason Eisner

Keyword(s):

Dimensionality Reduction ◽

Semantic Information ◽

State Of The Art ◽

Word Embedding ◽

Discrete Version ◽

Word Embeddings ◽

Continuous Version ◽

Continuous Vector ◽

Information Bottleneck ◽

Art Performance

Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.

Download Full-text

Pre-Trained Transformer-Based Language Models for Sundanese

10.21203/rs.3.rs-907893/v1 ◽

2021 ◽

Author(s):

Wilson Wongso ◽

Henry Lucky ◽

Derwin Suhartono

Keyword(s):

Natural Language ◽

Text Classification ◽

Training Data ◽

Language Models ◽

Classification Task ◽

Language Understanding ◽

Training Corpus ◽

Low Resource ◽

Corpus Size ◽

Fine Tune

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Download Full-text