A New Vector Representation of Short Texts for Classification

Yangyang Li; Bo Liu

doi:10.34028/iajit/17/2/12

A New Vector Representation of Short Texts for Classification

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/2/12 ◽

2019 ◽

Vol 17 (2) ◽

pp. 241-249

Author(s):

Yangyang Li ◽

Bo Liu

Keyword(s):

Text Classification ◽

Web Search ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Classification Performance ◽

New Method ◽

Data Sets ◽

Text Data ◽

Short Text ◽

Space Model

Short and sparse characteristics and synonyms and homonyms are main obstacles for short-text classification. In recent years, research on short-text classification has focused on expanding short texts but has barely guaranteed the validity of expanded words. This study proposes a new method to weaken these effects without external knowledge. The proposed method analyses short texts by using the topic model based on Latent Dirichlet Allocation (LDA), represents each short text by using a vector space model and presents a new method to adjust the vector of short texts. In the experiments, two open short-text data sets composed of google news and web search snippets are utilised to evaluate the classification performance and prove the effectiveness of our method.

Download Full-text

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Download Full-text

Improving Term Weighting Schemes for Short Text Classification in Vector Space Model

IEEE Access ◽

10.1109/access.2019.2953918 ◽

2019 ◽

Vol 7 ◽

pp. 166578-166592

Author(s):

Surender Singh Samant ◽

N. L. Bhanu Murthy ◽

Aruna Malapati

Keyword(s):

Vector Space ◽

Text Classification ◽

Vector Space Model ◽

Term Weighting ◽

Weighting Schemes ◽

Short Text ◽

Space Model

Download Full-text

An exploration of text mining of narrative reports of injury incidents to assess risk

MATEC Web of Conferences ◽

10.1051/matecconf/201825106020 ◽

2018 ◽

Vol 251 ◽

pp. 06020 ◽

Cited By ~ 4

Author(s):

David Passmore ◽

Chungil Chae ◽

Yulia Kustikova ◽

Rose Baker ◽

Jeong-Ha Yim

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Surface Mining ◽

Modeling Processes ◽

Free Text ◽

Text Data ◽

Injury Occurrence ◽

The Usa ◽

Musculoskeletal Systems ◽

Topic Mining

A topic model was explored using unsupervised machine learning to summarized free-text narrative reports of 77,215 injuries that occurred in coal mines in the USA between 2000 and 2015. Latent Dirichlet Allocation modeling processes identified six topics from the free-text data. One topic, a theme describing primarily injury incidents resulting in strains and sprains of musculoskeletal systems, revealed differences in topic emphasis by the location of the mine property at which injuries occurred, the degree of injury, and the year of injury occurrence. Text narratives clustered around this topic refer most frequently to surface or other locations rather than underground locations that resulted in disability and that, also, increased secularly over time. The modeling success enjoyed in this exploratory effort suggests that additional topic mining of these injury text narratives is justified, especially using a broad set of covariates to explain variations in topic emphasis and for comparison of surface mining injuries with injuries occurring during site preparation for construction.

Download Full-text

Topic-BERT: Detecting harmful information from social media

Intelligent Decision Technologies ◽

10.3233/idt-200094 ◽

2021 ◽

pp. 1-10

Author(s):

Wang Gao ◽

Hongtao Deng ◽

Xun Zhu ◽

Yuan Fang

Keyword(s):

Social Media ◽

Language Processing ◽

Topic Model ◽

Classification Performance ◽

Critical Research ◽

Short Text ◽

Additional Information ◽

Proposed Model ◽

Weight Calculation ◽

Two Stages

Harmful information identification is a critical research topic in natural language processing. Existing approaches have been focused either on rule-based methods or harmful text identification of normal documents. In this paper, we propose a BERT-based model to identify harmful information from social media, called Topic-BERT. Firstly, Topic-BERT utilizes BERT to take additional information as input to alleviate the sparseness of short texts. The GPU-DMM topic model is used to capture hidden topics of short texts for attention weight calculation. Secondly, the proposed model divides harmful short text identification into two stages, and different granularity labels are identified by two similar sub-models. Finally, we conduct extensive experiments on a real-world social media dataset to evaluate our model. Experimental results demonstrate that our model can significantly improve the classification performance compared with baseline methods.

Download Full-text

A word embedding topic model for topic detection and summary in social networks

Measurement and Control ◽

10.1177/0020294019865750 ◽

2019 ◽

Vol 52 (9-10) ◽

pp. 1289-1298 ◽

Cited By ~ 1

Author(s):

Lei Shi ◽

Gang Cheng ◽

Shang-ru Xie ◽

Gang Xie

Keyword(s):

Social Networks ◽

Social Network ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Word Embedding ◽

Probabilistic Latent Semantic Analysis ◽

Topic Detection ◽

Short Text ◽

Internal Relationship

The aim of topic detection is to automatically identify the events and hot topics in social networks and continuously track known topics. Applying the traditional methods such as Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis is difficult given the high dimensionality of massive event texts and the short-text sparsity problems of social networks. The problem also exists of unclear topics caused by the sparse distribution of topics. To solve the above challenge, we propose a novel word embedding topic model by combining the topic model and the continuous bag-of-words mode (Cbow) method in word embedding method, named Cbow Topic Model (CTM), for topic detection and summary in social networks. We conduct similar word clustering of the target social network text dataset by introducing the classic Cbow word vectorization method, which can effectively learn the internal relationship between words and reduce the dimensionality of the input texts. We employ the topic model-to-model short text for effectively weakening the sparsity problem of social network texts. To detect and summarize the topic, we propose a topic detection method by leveraging similarity computing for social networks. We collected a Sina microblog dataset to conduct various experiments. The experimental results demonstrate that the CTM method is superior to the existing topic model method.

Download Full-text

The Automatic Non-Negative Matrix Factorization of the Hierarchy Clustering Method

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.325-326.1489 ◽

2013 ◽

Vol 325-326 ◽

pp. 1489-1492

Author(s):

Tie Qi Li ◽

Wen Shuo Zhang

Keyword(s):

Matrix Factorization ◽

Vector Space Model ◽

Real Data ◽

Data Sets ◽

Text Data ◽

Space Model ◽

Hierarchical Relations ◽

Weight Calculation ◽

Novel Method ◽

Non Negative Matrix Factorization

People in such huge information how to find useful information becomes a problem. In order to deal with hierarchical relations in text data, a novel method, called automatic non-negative matrix factorization of the hierarchy clustering, is proposed for the text mining. We use the vector space model as the research foundation, mainly discusses the feature selection and weight calculation two problems. The experimental results on the real data sets demonstrate that our method outperforms, on average, all the other 6 methods.

Download Full-text

Filtering and Classifying Relevant Short Text with a Few Seed Words

Data and Information Management ◽

10.2478/dim-2019-0011 ◽

2019 ◽

Vol 3 (3) ◽

pp. 165-186 ◽

Cited By ~ 1

Author(s):

Chenliang Li ◽

Shiqian Chen ◽

Yan Qi

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

State Of The Art ◽

Superior Performance ◽

Support Vector ◽

Short Text ◽

Text Filtering ◽

Supervised Classifiers ◽

Real World Datasets ◽

Weakly Supervised

Abstract Filtering out irrelevant documents and classifying the relevant ones into topical categories is a de facto task in many applications. However, supervised learning solutions require extravagant human efforts on document labeling. In this paper, we propose a novel seed-guided topic model for dataless short text classification and filtering, named SSCF. Without using any labeled documents, SSCF takes a few “seed words” for each category of interest, and conducts short text filtering and classification in a weakly supervised manner. To overcome the issues of data sparsity and imbalance, the short text collection is mapped to a collection of pseudodocuments, one for each word. SSCF infers two kinds of topics on pseudo-documents: category-topics and general-topics. Each category-topic is associated with one category of interest, covering the meaning of the latter. In SSCF, we devise a novel word relevance estimation process based on the seed words, for hidden topic inference. The dominating topic of a short text is identified through post inference and then used for filtering and classification. On two real-world datasets in two languages, experimental results show that our proposed SSCF consistently achieves better classification accuracy than state-of-the-art baselines. We also observe that SSCF can even achieve superior performance than the supervised classifiers supervised latent dirichlet allocation (sLDA) and support vector machine (SVM) on some testing tasks.

Download Full-text

Lender Trust on the P2P Lending: Analysis Based on Sentiment Analysis of Comment Text

Sustainability ◽

10.3390/su12083293 ◽

2020 ◽

Vol 12 (8) ◽

pp. 3293 ◽

Cited By ~ 2

Author(s):

Beibei Niu ◽

Jinzheng Ren ◽

Ansa Zhao ◽

Xiaotao Li

Keyword(s):

Theoretical Basis ◽

Analytical Approach ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Text Data ◽

The Core ◽

Operational Level ◽

Core Subject ◽

P2p Lending ◽

Subject Areas

Lender trust is important to ensure the sustainability of P2P lending. This paper uses web crawling to collect more than 240,000 unique pieces of comment text data. Based on the mapping relationship between emotion and trust, we use the lexicon-based method and deep learning to check the trust of a given lender in P2P lending. Further, we use the Latent Dirichlet Allocation (LDA) topic model to mine topics concerned with this research. The results show that lenders are positive about P2P lending, though this tendency fluctuates downward with time. The security, rate of return, and compliance of P2P lending are the issues of greatest concern to lenders. This study reveals the core subject areas that influence a lender’s emotions and trusts and provides a theoretical basis and empirical reference for relevant platforms to improve their operational level while enhancing competitiveness. This analytical approach offers insights for researchers to understand the hidden content behind the text data.

Download Full-text

SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings

Data Technologies and Applications ◽

10.1108/dta-02-2021-0039 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Heng-Yang Lu ◽

Yi Zhang ◽

Yuntao Du

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Second Phase ◽

Word Embeddings ◽

Two Phase ◽

Content Type ◽

Short Text ◽

Topic Discovery ◽

Two Phases ◽

Sense Unit

PurposeTopic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.Design/methodology/approachSenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.FindingsExperimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.Originality/valueThe originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text