Text Classification Based on Title Semantic Information

Abstract Recently, Graph Convolutional Neural Network (GCN) is widely used in text classification tasks, and has effectively completed tasks that are considered to have a rich relational structure. However, due to the sparse adjacency matrix constructed by GCN, GCN cannot make full use of context-dependent information in text classification, and cannot capture local information. The Bidirectional Encoder Representation from Transformers (BERT) has been shown to have the ability to capture the contextual information in a sentence or document, but its ability to capture global information about the vocabulary of a language is relatively limited. The latter is the advantage of GCN. Therefore, in this paper, Mutual Graph Convolution Networks (MGCN) is proposed to solve the above problems. It introduces semantic dictionary (WordNet), dependency and BERT. MGCN uses dependency to solve the problem of context dependence and WordNet to obtain more semantic information. Then the local information generated by BERT and the global information generated by GCN are interacted through the attention mechanism, so that they can influence each other and improve the classification effect of the model. The experimental results show that our model is more effective than previous research reports on three text classification data sets.

Download Full-text

On Document Representation and Term Weights in Text Classification

Handbook of Research on Text and Web Mining Technologies ◽

10.4018/978-1-59904-990-8.ch001 ◽

2010 ◽

pp. 1-22 ◽

Cited By ~ 1

Author(s):

Ying Liu

Keyword(s):

Text Classification ◽

Semantic Information ◽

Weighting Scheme ◽

Bag Of Words ◽

Document Representation ◽

Term Weighting ◽

Word Sequence ◽

Sentence Level ◽

Sequence Method ◽

Classic Approach

In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.

Download Full-text

CrowdTC: Crowd-powered Learning for Text Classification

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3457216 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Keyu Yang ◽

Yunjun Gao ◽

Lei Liang ◽

Song Bian ◽

Lu Chen ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Text Classification ◽

Deep Neural Networks ◽

Semantic Information ◽

Human Beings ◽

Hybrid Neural Network ◽

Public Datasets ◽

Almost All ◽

The Cost

Text classification is a fundamental task in content analysis. Nowadays, deep learning has demonstrated promising performance in text classification compared with shallow models. However, almost all the existing models do not take advantage of the wisdom of human beings to help text classification. Human beings are more intelligent and capable than machine learning models in terms of understanding and capturing the implicit semantic information from text. In this article, we try to take guidance from human beings to classify text. We propose Crowd-powered learning for Text Classification (CrowdTC for short). We design and post the questions on a crowdsourcing platform to extract keywords in text. Sampling and clustering techniques are utilized to reduce the cost of crowdsourcing. Also, we present an attention-based neural network and a hybrid neural network to incorporate the extracted keywords as human guidance into deep neural networks. Extensive experiments on public datasets confirm that CrowdTC improves the text classification accuracy of neural networks by using the crowd-powered keyword guidance.

Download Full-text

Text classification with semantically enriched word embeddings

Natural Language Engineering ◽

10.1017/s1351324920000170 ◽

2020 ◽

pp. 1-35

Author(s):

N. Pittaras ◽

G. Giannakopoulos ◽

G. Papadakis ◽

V. Karkaletsis

Keyword(s):

Text Classification ◽

Semantic Information ◽

Classification Performance ◽

Classification Task ◽

Propagation Mechanism ◽

Word Embeddings ◽

Performance Loss ◽

Part Of Speech ◽

Box Models ◽

Document Frequency

Abstract The recent breakthroughs in deep neural architectures across multiple machine learning fields have led to the widespread use of deep neural models. These learners are often applied as black-box models that ignore or insufficiently utilize a wealth of preexisting semantic information. In this study, we focus on the text classification task, investigating methods for augmenting the input to deep neural networks (DNNs) with semantic information. We extract semantics for the words in the preprocessed text from the WordNet semantic graph, in the form of weighted concept terms that form a semantic frequency vector. Concepts are selected via a variety of semantic disambiguation techniques, including a basic, a part-of-speech-based, and a semantic embedding projection method. Additionally, we consider a weight propagation mechanism that exploits semantic relationships in the concept graph and conveys a spreading activation component. We enrich word2vec embeddings with the resulting semantic vector through concatenation or replacement and apply the semantically augmented word embeddings on the classification task via a DNN. Experimental results over established datasets demonstrate that our approach of semantic augmentation in the input space boosts classification performance significantly, with concatenation offering the best performance. We also note additional interesting findings produced by our approach regarding the behavior of term frequency - inverse document frequency normalization on semantic vectors, along with the radical dimensionality reduction potential with negligible performance loss.

Download Full-text

Pronunciation-Enhanced Chinese Word Embedding

Cognitive Computation ◽

10.1007/s12559-021-09850-9 ◽

2021 ◽

Author(s):

Qinjuan Yang ◽

Haoran Xie ◽

Gary Cheng ◽

Fu Lee Wang ◽

Yanghui Rao

Keyword(s):

Sentiment Analysis ◽

Text Classification ◽

Semantic Information ◽

Word Embedding ◽

Chinese Characters ◽

Learning Method ◽

Word Embeddings ◽

Chinese Word ◽

Word Similarity ◽

Meaning Structure

AbstractChinese word embeddings have recently garnered considerable attention. Chinese characters and their sub-character components, which contain rich semantic information, are incorporated to learn Chinese word embeddings. Chinese characters can represent a combination of meaning, structure, and pronunciation. However, existing embedding learning methods focus on the structure and meaning of Chinese characters. In this study, we aim to develop an embedding learning method that can make complete use of the information represented by Chinese characters, including phonology, morphology, and semantics. Specifically, we propose a pronunciation-enhanced Chinese word embedding learning method, where the pronunciations of context characters and target characters are simultaneously encoded into the embeddings. Evaluation of word similarity, word analogy reasoning, text classification, and sentiment analysis validate the effectiveness of our proposed method.

Download Full-text

Feature Selection for Effective Text Classification using Semantic Information

International Journal of Computer Applications ◽

10.5120/19861-1818 ◽

2015 ◽

Vol 113 (10) ◽

pp. 18-25

Author(s):

Rajul Jain ◽

Nitin Pise

Keyword(s):

Feature Selection ◽

Text Classification ◽

Semantic Information ◽

Selection For

Download Full-text

Large-Scale Hierarchical Text Classification Based on Path Semantic Information

2009 International Conference on Business Intelligence and Financial Engineering ◽

10.1109/bife.2009.60 ◽

2009 ◽

Author(s):

Feng Gao ◽

Chengrong Wu ◽

Naiwang Guo ◽

Danfeng Zhao

Keyword(s):

Text Classification ◽

Large Scale ◽

Semantic Information ◽

Hierarchical Text Classification

Download Full-text

Using Graph-Kernels to Represent Semantic Information in Text Classification

Machine Learning and Data Mining in Pattern Recognition - Lecture Notes in Computer Science ◽

10.1007/978-3-642-03070-3_48 ◽

2009 ◽

pp. 632-646 ◽

Cited By ~ 1

Author(s):

Teresa Gonçalves ◽

Paulo Quaresma

Keyword(s):

Text Classification ◽

Semantic Information ◽

Graph Kernels

Download Full-text

Near-Lossless Binarization of Word Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017104 ◽

2019 ◽

Vol 33 ◽

pp. 7104-7111 ◽

Cited By ~ 3

Author(s):

Julien Tissier ◽

Christophe Gravier ◽

Amaury Habrard

Keyword(s):

Sentiment Analysis ◽

Semantic Similarity ◽

Text Classification ◽

Semantic Information ◽

State Of The Art ◽

Floating Point ◽

Word Embeddings ◽

Binary Vectors ◽

Starting Point ◽

Memory Footprint

Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ∼2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.

Download Full-text