Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text

Evaluation of Text Semantic Features using Latent Dirichlet Allocation Model

International Journal of Performability Engineering ◽

10.23940/ijpe.20.06.p15.968978 ◽

2020 ◽

Vol 16 (6) ◽

pp. 968

Author(s):

Zhou Chunjie ◽

Li Nao ◽

Zhang Chi ◽

Yang Xiaoyu

Keyword(s):

Latent Dirichlet Allocation ◽

Semantic Features ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Dirichlet Allocation

Download Full-text

Intelligent radar software defect classification approach based on the latent Dirichlet allocation topic model

EURASIP Journal on Advances in Signal Processing ◽

10.1186/s13634-021-00761-3 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Xi Liu ◽

Yongfeng Yin ◽

Haifeng Li ◽

Jiabin Chen ◽

Chang Liu ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Recall Rate ◽

Defect Classification ◽

Software Defects ◽

Classification Approach ◽

Software Defect ◽

Model Combining ◽

Dirichlet Allocation

AbstractExisting software intelligent defect classification approaches do not consider radar characters and prior statistics information. Thus, when applying these appaoraches into radar software testing and validation, the precision rate and recall rate of defect classification are poor and have effect on the reuse effectiveness of software defects. To solve this problem, a new intelligent defect classification approach based on the latent Dirichlet allocation (LDA) topic model is proposed for radar software in this paper. The proposed approach includes the defect text segmentation algorithm based on the dictionary of radar domain, the modified LDA model combining radar software requirement, and the top acquisition and classification approach of radar software defect based on the modified LDA model. The proposed approach is applied on the typical radar software defects to validate the effectiveness and applicability. The application results illustrate that the prediction precison rate and recall rate of the poposed approach are improved up to 15 ~ 20% compared with the other defect classification approaches. Thus, the proposed approach can be applied in the segmentation and classification of radar software defects effectively to improve the identifying adequacy of the defects in radar software.

Download Full-text

Research progress and trend of leader member exchange based on social complex network and latent dirichlet allocation topic model

2020 2nd International Conference on Economic Management and Model Engineering (ICEMME) ◽

10.1109/icemme51517.2020.00090 ◽

2020 ◽

Author(s):

Zhang chunyang ◽

Ding kun ◽

Zhang chunbo ◽

Zhang li

Keyword(s):

Complex Network ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Research Progress ◽

Leader Member Exchange ◽

Member Exchange ◽

Dirichlet Allocation

Download Full-text

Augmented Latent Dirichlet Allocation (Lda) Topic Model with Gaussian Mixture Topics

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8462003 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kedar S. Prabhudesai ◽

Boyla O. Mainsah ◽

Leslie M. Collins ◽

Chandra S. Throckmorton

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Gaussian Mixture ◽

Dirichlet Allocation

Download Full-text

Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection

Scholarly Ethics and Publishing ◽

10.4018/978-1-5225-8057-7.ch015 ◽

2019 ◽

pp. 319-336

Author(s):

Ali Daud ◽

Jamal Ahmad Khan ◽

Jamal Abdul Nasir ◽

Rabeeh Ayaz Abbasi ◽

Naif Radi Aljohani ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Plagiarism Detection ◽

Text Documents ◽

Parts Of Speech ◽

Stop Word ◽

Processing Step ◽

Syntactic Information ◽

N Gram ◽

Basic Hypothesis ◽

Dirichlet Allocation

In this article we present a new semantic and syntactic-based method for external plagiarism detection. In the proposed approach, latent dirichlet allocation (LDA) and parts of speech (POS) tags are used together to detect plagiarism between the sample and a number of source documents. The basic hypothesis is that considering semantic and syntactic information between two text documents may improve the performance of the plagiarism detection task. Our method is based on two steps, naming, which is a pre-processing where we detect the topics from the sentences in documents using the LDA and convert each sentence in POS tags array; then a post processing step where the suspicious cases are verified purely on the basis of semantic rules. For two types of external plagiarism (copy and random obfuscation), we empirically compare our approach to the state-of-the-art N-gram based and stop-word N-gram based methods and observe significant improvements.

Download Full-text

A Document Clustering Algorithm Based on Semi-constrained Hierarchical Latent Dirichlet Allocation

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-319-12096-6_5 ◽

2014 ◽

pp. 49-60 ◽

Cited By ~ 2

Author(s):

Jungang Xu ◽

Shilong Zhou ◽

Lin Qiu ◽

Shengyuan Liu ◽

Pengfei Li

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Document Clustering ◽

Dirichlet Allocation

Download Full-text

The surveillance of a supreme audit institution on related party transactions

Journal of Public Budgeting Accounting & Financial Management ◽

10.1108/jpbafm-12-2019-0181 ◽

2020 ◽

Vol 32 (4) ◽

pp. 577-603

Author(s):

Gustavo Cesário ◽

Ricardo Lopes Cardoso ◽

Renato Santos Aranha

Keyword(s):

Public Sector ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Conflicts Of Interest ◽

International Standards ◽

Content Type ◽

Related Party Transactions ◽

The Public ◽

Audit Reports ◽

Dirichlet Allocation

PurposeThis paper aims to analyse how the supreme audit institution (SAI) monitors related party transactions (RPTs) in the Brazilian public sector. It considers definitions and disclosure policies of RPTs by international accounting and auditing standards and their evolution since 1980.Design/methodology/approachBased on archival research on international standards and using an interpretive approach, the authors investigated definitions and disclosure policies. Using a topic model based on latent Dirichlet allocation, the authors performed a content analysis on over 59,000 SAI decisions to assess how the SAI monitors RPTs.FindingsThe SAI investigates nepotism (a kind of RPT) and conflicts of interest up to eight times more frequently than related parties. Brazilian laws prevent nepotism and conflicts of interest, but not RPTs in general. Indeed, Brazilian public-sector accounting standards have not converged towards IPSAS 20, and ISSAI 1550 does not adjust auditing procedures to suit the public sector.Research limitations/implicationsThe SAI follows a legalistic auditing approach, indicating a need for regulation of related public-sector parties to improve surveillance. In addition to Brazil, other code law countries might face similar circumstances.Originality/valuePublic-sector RPTs are an under-investigated field, calling for attention by academics and standard-setters. Text mining and latent Dirichlet allocation, while mature techniques, are underexplored in accounting and auditing studies. Additionally, the Python script created to analyse the audit reports is available at Mendeley Data and may be used to perform similar analyses with minor adaptations.

Download Full-text

Topic Modelling on Pharmaceutical Incident Data

European Journal of Mathematics and Statistics ◽

10.24018/ejmath.2021.2.3.33 ◽

2021 ◽

Vol 2 (3) ◽

pp. 92-96

Author(s):

Deepu Dileep ◽

Soumya Rudraraju ◽

V. V. HaraGopal

Keyword(s):

Pharmaceutical Industry ◽

Key Words ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Probability Of Occurrence ◽

Proposed Model ◽

Textual Data ◽

Incident Data ◽

Dirichlet Allocation

Focus of the current study is to explore and analyse textual data in the form of incidents in pharmaceutical industry using topic modelling. Topic modelling applied in the current study is based on Latent Dirichlet Allocation. The proposed model is applied on a corpus containing 190 incidents to retrieve key words with highest probability of occurrence. It is used to form informative topics related to incidents.

Download Full-text

A Decision Support System for Inbound Marketers: An Empirical Use of Latent Dirichlet Allocation Topic Model to Guide Infographic Designers

SSRN Electronic Journal ◽

10.2139/ssrn.2863111 ◽

2015 ◽

Author(s):

Meisam Hejazi Nia

Keyword(s):

Decision Support ◽

Decision Support System ◽

Support System ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Dirichlet Allocation

Download Full-text