Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.

Download Full-text

Intelligent radar software defect classification approach based on the latent Dirichlet allocation topic model

EURASIP Journal on Advances in Signal Processing ◽

10.1186/s13634-021-00761-3 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Xi Liu ◽

Yongfeng Yin ◽

Haifeng Li ◽

Jiabin Chen ◽

Chang Liu ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Recall Rate ◽

Defect Classification ◽

Software Defects ◽

Classification Approach ◽

Software Defect ◽

Model Combining ◽

Dirichlet Allocation

AbstractExisting software intelligent defect classification approaches do not consider radar characters and prior statistics information. Thus, when applying these appaoraches into radar software testing and validation, the precision rate and recall rate of defect classification are poor and have effect on the reuse effectiveness of software defects. To solve this problem, a new intelligent defect classification approach based on the latent Dirichlet allocation (LDA) topic model is proposed for radar software in this paper. The proposed approach includes the defect text segmentation algorithm based on the dictionary of radar domain, the modified LDA model combining radar software requirement, and the top acquisition and classification approach of radar software defect based on the modified LDA model. The proposed approach is applied on the typical radar software defects to validate the effectiveness and applicability. The application results illustrate that the prediction precison rate and recall rate of the poposed approach are improved up to 15 ~ 20% compared with the other defect classification approaches. Thus, the proposed approach can be applied in the segmentation and classification of radar software defects effectively to improve the identifying adequacy of the defects in radar software.

Download Full-text

Application of Latent Dirichlet Allocation (LDA) for clustering financial tweets

E3S Web of Conferences ◽

10.1051/e3sconf/202129701071 ◽

2021 ◽

Vol 297 ◽

pp. 01071

Author(s):

Sifi Fatima-Zahrae ◽

Sabbar Wafae ◽

El Mzabi Amal

Keyword(s):

Language Processing ◽

Latent Dirichlet Allocation ◽

Sentiment Classification ◽

Research Areas ◽

Preprocessing Method ◽

Long Time ◽

Standard Text ◽

The Given ◽

Dirichlet Allocation

Sentiment classification is one of the hottest research areas among the Natural Language Processing (NLP) topics. While it aims to detect sentiment polarity and classification of the given opinion, requires a large number of aspect extractions. However, extracting aspect takes human effort and long time. To reduce this, Latent Dirichlet Allocation (LDA) method have come out recently to deal with this issue.In this paper, an efficient preprocessing method for sentiment classification is presented and will be used for analyzing user’s comments on Twitter social network. For this purpose, different text preprocessing techniques have been used on the dataset to achieve an acceptable standard text. Latent Dirichlet Allocation has been applied on the obtained data after this fast and accurate preprocessing phase. The implementation of different sentiment analysis methods and the results of these implementations have been compared and evaluated. The experimental results show that the combined uses of the preprocessing method of this paper and Latent Dirichlet Allocation have an acceptable results compared to other basic methods.

Download Full-text

Topic Modelling Twitter Data with Latent Dirichlet Allocation Method

2019 International Conference on Electrical Engineering and Computer Science (ICECOS) ◽

10.1109/icecos47637.2019.8984523 ◽

2019 ◽

Cited By ~ 1

Author(s):

Edi Surya Negara ◽

Dendi Triadi ◽

Ria Andryani

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Twitter Data ◽

Allocation Method ◽

Dirichlet Allocation

Download Full-text

Topic Modelling: A Comparison of The Performance of Latent Dirichlet Allocation and LDA2vec Model on Bangla Newspaper

2019 International Conference on Bangla Speech and Language Processing (ICBSLP) ◽

10.1109/icbslp47725.2019.202047 ◽

2019 ◽

Author(s):

Md. Hasan ◽

Md. Motaher Hossain ◽

Adnan Ahmed ◽

Mohammad Shahidur Rahman

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Dirichlet Allocation

Download Full-text

Topic Modelling on Pharmaceutical Incident Data

European Journal of Mathematics and Statistics ◽

10.24018/ejmath.2021.2.3.33 ◽

2021 ◽

Vol 2 (3) ◽

pp. 92-96

Author(s):

Deepu Dileep ◽

Soumya Rudraraju ◽

V. V. HaraGopal

Keyword(s):

Pharmaceutical Industry ◽

Key Words ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Probability Of Occurrence ◽

Proposed Model ◽

Textual Data ◽

Incident Data ◽

Dirichlet Allocation

Focus of the current study is to explore and analyse textual data in the form of incidents in pharmaceutical industry using topic modelling. Topic modelling applied in the current study is based on Latent Dirichlet Allocation. The proposed model is applied on a corpus containing 190 incidents to retrieve key words with highest probability of occurrence. It is used to form informative topics related to incidents.

Download Full-text

Understanding barriers to linking novel consumer and lifestyle data for health research, results from the LifeInfo Survey: a topic modelling approach (Preprint)

10.2196/preprints.24236 ◽

2020 ◽

Author(s):

Holly Clarke ◽

Stephen Clark ◽

Mark Birkin ◽

Heather Iles-Smith ◽

Adam Glaser ◽

...

Keyword(s):

Data Sharing ◽

Large Scale ◽

Best Practice ◽

Latent Dirichlet Allocation ◽

Data Linkage ◽

Public Attitudes ◽

Future Research ◽

Store Loyalty ◽

Topic Modelling ◽

Personal Privacy

BACKGROUND Novel consumer and lifestyle data, for example those collected by supermarket loyalty cards or mobile phone exercise tracking apps, offer numerous benefits for researchers wishing to understand diet and exercise related risk factors for diseases. Yet, limited research has addressed public attitudes towards linking these data with individual health records for research purposes. OBJECTIVE The aim of this research was to identify key barriers for data linkage and recommend safeguards and procedures that would encourage individuals to share these data for potential future research. METHODS The LifeInfo Survey consulted the public on their attitudes towards sharing consumer and lifestyle data for research purposes. Where barriers to data sharing existed, participants provided unstructured survey responses detailing what would make them more likely to share data for linkage with their health record in the future. The topic modelling technique Latent Dirichlet Allocation (LDA) was used to analyse these textual responses to uncover common thematic topics within the texts. RESULTS Participants provided responses related to sharing their store loyalty card data (n = 2,338) and health/fitness app data (n = 1,531). Key barriers to data sharing identified through topic modelling included: data safety and security, personal privacy, requirements of further information, fear of data being accessed by others, problems with data accuracy, not understanding the reason for data linkage and not using data production services. We provide recommendations for addressing these issues to establish best practice for future researchers wishing to utilise these data. CONCLUSIONS This study formulates large-scale consultation of public attitudes towards data linkage of this kind, as such, it is an important first step in understanding and addressing barriers to participation for research utilising novel consumer and lifestyle data.

Download Full-text

Analisis Trending Topik untuk Percakapan Media Sosial dengan Menggunakan Topic Modelling Berbasis Algoritme LDA

Journal of Intelligent System and Computation ◽

10.52985/insyst.v2i1.150 ◽

2021 ◽

Vol 2 (1) ◽

pp. 12-19

Author(s):

Ahmad Syaifuddin ◽

Reddy Alexandro Harianto ◽

Joan Santoso

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Human In The Loop ◽

Text Preprocessing ◽

F Measure ◽

Bahasa Indonesia ◽

Dirichlet Allocation

Aplikasi WhatsApp merupakan salah satu aplikasi chatting yang sangat populer terutama di Indonesia. WhatsApp mempunyai data unik karena memiliki pola pesan dan topik yang beragam dan sangat cepat berubah, sehingga untuk mengidentifikasi suatu topik dari kumpulan pesan tersebut sangat sulit dan menghabiskan banyak waktu jika dilakukan secara manual. Salah satu cara untuk mendapatkan informasi tersirat dari media sosial tersebut yaitu dengan melakukan pemodelan topik. Penelitian ini dilakukan untuk menganalisis penerapan metode LDA (Latent Dirichlet Allocation) dalam mengidentifikasi topik apa saja yang sedang dibahas pada grup WhatsApp di Universitas Islam Majapahit serta melakukan eksperimen pemodelan topik dengan menambahkan atribut waktu dalam penyusunan dokumen. Penelitian ini menghasilkan model topic dan nilai evaluasi f-measure dari model topik berdasarkan uji coba yang dilakukan. Metode LDA dipilih untuk melakukan pemodelan topik dengan memanfaatkan library LDA pada python serta menerapkan standar text-preprocessing dan menambahkan slang words removal untuk menangani kata tidak baku dan singkatan pada chat logs. Pengujian model topik dilakukan dengan uji human in the loop menggunakan word instrusion task kepada pakar Bahasa Indonesia. Hasil evaluasi LDA didapatkan hasil percobaan terbaik dengan mengubah dokumen menjadi 10 menit dan menggabungkan dengan reply chat pada percakapan grup WhatsApp merupakan salah satu cara dalam meningkatkan hasil pemodelan topik menggunakan algoritma Latent Dirichlet Allocation (LDA), didapatkan nilai precision sebesar 0.9294, nilai recall sebesar 0.7900 dan nilai f-measure sebesar 0.8541.

Download Full-text

Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals

Sustainability ◽

10.3390/su131910856 ◽

2021 ◽

Vol 13 (19) ◽

pp. 10856

Author(s):

I-Cheng Chang ◽

Tai-Kuei Yu ◽

Yu-Jie Chang ◽

Tai-Yi Yu

Keyword(s):

Artificial Intelligence ◽

Cluster Analysis ◽

Text Mining ◽

Environmental Education ◽

Hierarchical Clustering ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Word Analysis ◽

Dirichlet Allocation

Facing the big data wave, this study applied artificial intelligence to cite knowledge and find a feasible process to play a crucial role in supplying innovative value in environmental education. Intelligence agents of artificial intelligence and natural language processing (NLP) are two key areas leading the trend in artificial intelligence; this research adopted NLP to analyze the research topics of environmental education research journals in the Web of Science (WoS) database during 2011–2020 and interpret the categories and characteristics of abstracts for environmental education papers. The corpus data were selected from abstracts and keywords of research journal papers, which were analyzed with text mining, cluster analysis, latent Dirichlet allocation (LDA), and co-word analysis methods. The decisions regarding the classification of feature words were determined and reviewed by domain experts, and the associated TF-IDF weights were calculated for the following cluster analysis, which involved a combination of hierarchical clustering and K-means analysis. The hierarchical clustering and LDA decided the number of required categories as seven, and the K-means cluster analysis classified the overall documents into seven categories. This study utilized co-word analysis to check the suitability of the K-means classification, analyzed the terms with high TF-IDF wights for distinct K-means groups, and examined the terms for different topics with the LDA technique. A comparison of the results demonstrated that most categories that were recognized with K-means and LDA methods were the same and shared similar words; however, two categories had slight differences. The involvement of field experts assisted with the consistency and correctness of the classified topics and documents.

Download Full-text

Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya

PLoS ONE ◽

10.1371/journal.pone.0243208 ◽

2021 ◽

Vol 16 (1) ◽

pp. e0243208

Author(s):

Leacky Muchene ◽

Wende Safari

Keyword(s):

Hierarchical Clustering ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Two Stage ◽

Scientific Publications ◽

Statistical Tool ◽

Second Stage ◽

The University ◽

Dirichlet Allocation

Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.

Download Full-text

Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray

Pattern Recognition in Bioinformatics - Lecture Notes in Computer Science ◽

10.1007/978-3-642-16001-1_20 ◽

2010 ◽

pp. 230-241 ◽

Cited By ~ 12

Author(s):

Alessandro Perina ◽

Pietro Lovato ◽

Vittorio Murino ◽

Manuele Bicego

Keyword(s):

Latent Dirichlet Allocation ◽

Expression Microarray ◽

Dirichlet Allocation

Download Full-text