Latent Topic Model for Indexing Arabic Documents

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.

Download Full-text

Latent Topic Model for Indexing Arabic Documents

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2014040104 ◽

2014 ◽

Vol 4 (2) ◽

pp. 57-72

Author(s):

Rami Ayadi ◽

Mohsen Maraoui ◽

Mounir Zrigui

Keyword(s):

Topic Model ◽

Inflectional Morphology ◽

Arabic Text ◽

Text Representation ◽

Text Documents ◽

Latent Topic ◽

Latent Topics ◽

F Measure

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe their approach for analyzing and preprocessing Arabic text then they describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then they compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.

Download Full-text

Latent Topic Estimation Based on Events in a Document

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2012.p0603 ◽

2012 ◽

Vol 16 (5) ◽

pp. 603-610

Author(s):

Risa Kitajima ◽

◽

Ichiro Kobayashi

Keyword(s):

Text Analysis ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Latent Semantic Indexing ◽

Document Retrieval ◽

Retrieval Task ◽

Latent Topic ◽

Latent Topics ◽

Definition Of ◽

The Relationship

Several latent topic model-based methods such as Latent Semantic Indexing (LSI), Probabilistic LSI (pLSI), and Latent Dirichlet Allocation (LDA) have been widely used for text analysis. These methods basically assign topics to words, however, and the relationship between words in a document is therefore not considered. Considering this, we propose a latent topic extraction method that assigns topics to events that represent the relation between words in a document. There are several ways to express events, and the accuracy of estimating latent topics differs depending on the definition of an event. We therefore propose five event types and examine which event type works well in estimating latent topics in a document with a common document retrieval task. As an application of our proposed method, we also show multidocument summarization based on latent topics. Through these experiments, we have confirmed that our proposed method results in higher accuracy than the conventional method.

Download Full-text

Semi-supervised Max-margin Topic Model with Manifold Posterior Regularization

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/259 ◽

2017 ◽

Cited By ~ 2

Author(s):

Wenbo Hu ◽

Jun Zhu ◽

Hang Su ◽

Jingwei Zhuo ◽

Bo Zhang

Keyword(s):

Supervised Learning ◽

Topic Model ◽

Topic Models ◽

Stochastic Gradient ◽

Mcmc Method ◽

Tight Coupling ◽

Label Information ◽

Latent Topic ◽

Latent Topics ◽

Qualitative Performance

Supervised topic models leverage label information to learn discriminative latent topic representations. As collecting a fully labeled dataset is often time-consuming, semi-supervised learning is of high interest. In this paper, we present an effective semi-supervised max-margin topic model by naturally introducing manifold posterior regularization to a regularized Bayesian topic model, named LapMedLDA. The model jointly learns latent topics and a related classifier with only a small fraction of labeled documents. To perform the approximate inference, we derive an efficient stochastic gradient MCMC method. Unlike the previous semi-supervised topic models, our model adopts a tight coupling between the generative topic model and the discriminative classifier. Extensive experiments demonstrate that such tight coupling brings significant benefits in quantitative and qualitative performance.

Download Full-text

Shape-DNA: Effective Character Restoration and Enhancement for Arabic Text Documents

2010 20th International Conference on Pattern Recognition ◽

10.1109/icpr.2010.506 ◽

2010 ◽

Cited By ~ 10

Author(s):

Gulcin Caner ◽

Ismail Haritaoglu

Keyword(s):

Arabic Text ◽

Text Documents

Download Full-text

Analysis and Visualization Latent Topic on COVID-19 Vaccine Tweet use two-stage topic modeling (Preprint)

10.2196/preprints.30290 ◽

2021 ◽

Author(s):

Faizah Faizah ◽

Bor-Shen Lin

Keyword(s):

Topic Modeling ◽

Public Perception ◽

Latent Dirichlet Allocation ◽

World Health ◽

Two Stage ◽

The Public ◽

Global Pandemic ◽

Difficult Time ◽

Latent Topic ◽

Latent Topics

BACKGROUND The World Health Organization (WHO) declared COVID-19 as a global pandemic on January 30, 2020. However, the pandemic has not been over yet. Furthermore, in the first quartal of 2021, some countries face the third wave of the pandemic. During the difficult time, the development of the vaccines for COVID-19 accelerates rapidly. Understanding the public perception of the COVID-19 Vaccine according to the data collected from social media can widen the perspective on the state of the global pandemic OBJECTIVE This study explores and analyzes the latent topic on COVID-19 Vaccine Tweet posted by individuals from various countries by using two-stage topic modeling. METHODS A two-stage analysis in topic modeling was proposed to investigating people’s reactions in five countries. The first stage is Latent Dirichlet Allocation that produces the latent topics with the corresponding term distributions that facilitate the investigators to understand the main issues or opinions. The second stage then performs agglomerative clustering on the latent topics based on Hellinger distance, which merges close topics hierarchically into topic clusters to visualize those topics in either tree or graph views. RESULTS In general, the topic discussion regarding the COVID-19 Vaccine in five countries is similar. Topic themes such as "first vaccine" and & "vaccine effect" dominate the public discussion. The remarkable point is that people in some countries have some topic themes, such as "politician opinion" and " stay home" in Canada, "emergency" in India, and & "blood clots" in the United Kingdom. The analysis also shows the most popular COVID-19 Vaccine, which is gaining more public interest. CONCLUSIONS With LDA and Hierarchical clustering, two-stage topic modeling is powerful for visualizing the latent topics and understanding the public perception regarding the COVID-19 Vaccine.

Download Full-text

Context-Based Query Using Dependency Structures Based on Latent Topic Model

Journal on Data Semantics ◽

10.1007/s13740-013-0031-3 ◽

2013 ◽

Vol 3 (3) ◽

pp. 157-168

Author(s):

Masato Shirai ◽

Takashi Yanagisawa ◽

Takao Miura

Keyword(s):

Topic Model ◽

Latent Topic ◽

Dependency Structures

Download Full-text

The comparative study of text documents clustering algorithms

Environment Conservation Journal ◽

10.36953/ecj.2015.se1614 ◽

2015 ◽

Vol 16 (SE) ◽

pp. 133-138

Author(s):

Mohammad Eiman Jamnezhad ◽

Reza Fattahi

Keyword(s):

Data Mining ◽

Dna Analysis ◽

Clustering Algorithms ◽

Research Area ◽

Large Set ◽

Text Documents ◽

Web Documents ◽

Significant Research ◽

The Comparative Study ◽

F Measure

Clustering is one of the most significant research area in the field of data mining and considered as an important tool in the fast developing information explosion era.Clustering systems are used more and more often in text mining, especially in analyzing texts and to extracting knowledge they contain. Data are grouped into clusters in such a way that the data of the same group are similar and those in other groups are dissimilar. It aims to minimizing intra-class similarity and maximizing inter-class dissimilarity. Clustering is useful to obtain interesting patterns and structures from a large set of data. It can be applied in many areas, namely, DNA analysis, marketing studies, web documents, and classification. This paper aims to study and compare three text documents clustering, namely, k-means, k-medoids, and SOM through F-measure.

Download Full-text

IPHITS: An Incremental Latent Topic Model for Link Structure

Information Retrieval Technology - Lecture Notes in Computer Science ◽

10.1007/978-3-642-04769-5_21 ◽

2009 ◽

pp. 242-253 ◽

Cited By ~ 1

Author(s):

Huifang Ma ◽

Weizhong Zhao ◽

Zhixin Li ◽

Zhongzhi Shi

Keyword(s):

Topic Model ◽

Link Structure ◽

Latent Topic

Download Full-text

The Effect of Stemming on Arabic Text Classification

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2011070104 ◽

2011 ◽

Vol 1 (3) ◽

pp. 54-70 ◽

Cited By ~ 11

Author(s):

Abdullah Wahbeh ◽

Mohammed Al-Kabi ◽

Qasem Al-Radaideh ◽

Emad Al-Shawakfa ◽

Izzat Alsmadi

Keyword(s):

Text Classification ◽

Digital Libraries ◽

Arabic Language ◽

Support Vector ◽

Svm Classifier ◽

Arabic Text ◽

Text Documents ◽

Information Retrieval Systems ◽

Arabic Text Classification ◽

The Web

The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.

Download Full-text

An enhanced Kashida-based watermarking approach for Arabic text-documents

2013 International Conference on Electronics, Computer and Computation (ICECCO) ◽

10.1109/icecco.2013.6718288 ◽

2013 ◽

Cited By ~ 8

Author(s):

Yasser M. Alginahi ◽

Muhammad N. Kabir ◽

Omar Tayan

Keyword(s):

Arabic Text ◽

Text Documents

Download Full-text