Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

2019 ◽  
Vol 0 (8/2018) ◽  
pp. 17-28
Author(s):  
Maciej Jankowski

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.

Author(s):  
Xi Liu ◽  
Yongfeng Yin ◽  
Haifeng Li ◽  
Jiabin Chen ◽  
Chang Liu ◽  
...  

AbstractExisting software intelligent defect classification approaches do not consider radar characters and prior statistics information. Thus, when applying these appaoraches into radar software testing and validation, the precision rate and recall rate of defect classification are poor and have effect on the reuse effectiveness of software defects. To solve this problem, a new intelligent defect classification approach based on the latent Dirichlet allocation (LDA) topic model is proposed for radar software in this paper. The proposed approach includes the defect text segmentation algorithm based on the dictionary of radar domain, the modified LDA model combining radar software requirement, and the top acquisition and classification approach of radar software defect based on the modified LDA model. The proposed approach is applied on the typical radar software defects to validate the effectiveness and applicability. The application results illustrate that the prediction precison rate and recall rate of the poposed approach are improved up to 15 ~ 20% compared with the other defect classification approaches. Thus, the proposed approach can be applied in the segmentation and classification of radar software defects effectively to improve the identifying adequacy of the defects in radar software.


2021 ◽  
Vol 297 ◽  
pp. 01071
Author(s):  
Sifi Fatima-Zahrae ◽  
Sabbar Wafae ◽  
El Mzabi Amal

Sentiment classification is one of the hottest research areas among the Natural Language Processing (NLP) topics. While it aims to detect sentiment polarity and classification of the given opinion, requires a large number of aspect extractions. However, extracting aspect takes human effort and long time. To reduce this, Latent Dirichlet Allocation (LDA) method have come out recently to deal with this issue.In this paper, an efficient preprocessing method for sentiment classification is presented and will be used for analyzing user’s comments on Twitter social network. For this purpose, different text preprocessing techniques have been used on the dataset to achieve an acceptable standard text. Latent Dirichlet Allocation has been applied on the obtained data after this fast and accurate preprocessing phase. The implementation of different sentiment analysis methods and the results of these implementations have been compared and evaluated. The experimental results show that the combined uses of the preprocessing method of this paper and Latent Dirichlet Allocation have an acceptable results compared to other basic methods.


2021 ◽  
Vol 2 (3) ◽  
pp. 92-96
Author(s):  
Deepu Dileep ◽  
Soumya Rudraraju ◽  
V. V. HaraGopal

Focus of the current study is to explore and analyse textual data in the form of incidents in pharmaceutical industry using topic modelling. Topic modelling applied in the current study is based on Latent Dirichlet Allocation. The proposed model is applied on a corpus containing 190 incidents to retrieve key words with highest probability of occurrence. It is used to form informative topics related to incidents.


2020 ◽  
Author(s):  
Holly Clarke ◽  
Stephen Clark ◽  
Mark Birkin ◽  
Heather Iles-Smith ◽  
Adam Glaser ◽  
...  

BACKGROUND Novel consumer and lifestyle data, for example those collected by supermarket loyalty cards or mobile phone exercise tracking apps, offer numerous benefits for researchers wishing to understand diet and exercise related risk factors for diseases. Yet, limited research has addressed public attitudes towards linking these data with individual health records for research purposes. OBJECTIVE The aim of this research was to identify key barriers for data linkage and recommend safeguards and procedures that would encourage individuals to share these data for potential future research. METHODS The LifeInfo Survey consulted the public on their attitudes towards sharing consumer and lifestyle data for research purposes. Where barriers to data sharing existed, participants provided unstructured survey responses detailing what would make them more likely to share data for linkage with their health record in the future. The topic modelling technique Latent Dirichlet Allocation (LDA) was used to analyse these textual responses to uncover common thematic topics within the texts. RESULTS Participants provided responses related to sharing their store loyalty card data (n = 2,338) and health/fitness app data (n = 1,531). Key barriers to data sharing identified through topic modelling included: data safety and security, personal privacy, requirements of further information, fear of data being accessed by others, problems with data accuracy, not understanding the reason for data linkage and not using data production services. We provide recommendations for addressing these issues to establish best practice for future researchers wishing to utilise these data. CONCLUSIONS This study formulates large-scale consultation of public attitudes towards data linkage of this kind, as such, it is an important first step in understanding and addressing barriers to participation for research utilising novel consumer and lifestyle data.


2021 ◽  
Vol 2 (1) ◽  
pp. 12-19
Author(s):  
Ahmad Syaifuddin ◽  
Reddy Alexandro Harianto ◽  
Joan Santoso

Aplikasi WhatsApp merupakan salah satu aplikasi chatting yang sangat populer terutama di Indonesia. WhatsApp mempunyai data unik karena memiliki pola pesan dan topik yang beragam dan sangat cepat berubah, sehingga untuk mengidentifikasi suatu topik dari kumpulan pesan tersebut sangat sulit dan menghabiskan banyak waktu jika dilakukan secara manual. Salah satu cara untuk mendapatkan informasi tersirat dari media sosial tersebut yaitu dengan melakukan pemodelan topik. Penelitian ini dilakukan untuk menganalisis penerapan metode LDA (Latent Dirichlet Allocation) dalam mengidentifikasi topik apa saja yang sedang dibahas pada grup WhatsApp di Universitas Islam Majapahit serta melakukan eksperimen pemodelan topik dengan menambahkan atribut waktu dalam penyusunan dokumen. Penelitian ini menghasilkan model topic dan nilai evaluasi f-measure dari model topik berdasarkan uji coba yang dilakukan. Metode LDA dipilih untuk melakukan pemodelan topik dengan memanfaatkan library LDA pada python serta menerapkan standar text-preprocessing dan menambahkan slang words removal untuk menangani kata tidak baku dan singkatan pada chat logs. Pengujian model topik dilakukan dengan uji human in the loop menggunakan word instrusion task kepada pakar Bahasa Indonesia. Hasil evaluasi LDA didapatkan hasil percobaan terbaik dengan mengubah dokumen menjadi 10 menit dan menggabungkan dengan reply chat pada percakapan grup WhatsApp merupakan salah satu cara dalam meningkatkan hasil pemodelan topik menggunakan algoritma Latent Dirichlet Allocation (LDA), didapatkan nilai precision sebesar 0.9294, nilai recall sebesar 0.7900 dan nilai f-measure sebesar 0.8541.


2021 ◽  
Vol 13 (19) ◽  
pp. 10856
Author(s):  
I-Cheng Chang ◽  
Tai-Kuei Yu ◽  
Yu-Jie Chang ◽  
Tai-Yi Yu

Facing the big data wave, this study applied artificial intelligence to cite knowledge and find a feasible process to play a crucial role in supplying innovative value in environmental education. Intelligence agents of artificial intelligence and natural language processing (NLP) are two key areas leading the trend in artificial intelligence; this research adopted NLP to analyze the research topics of environmental education research journals in the Web of Science (WoS) database during 2011–2020 and interpret the categories and characteristics of abstracts for environmental education papers. The corpus data were selected from abstracts and keywords of research journal papers, which were analyzed with text mining, cluster analysis, latent Dirichlet allocation (LDA), and co-word analysis methods. The decisions regarding the classification of feature words were determined and reviewed by domain experts, and the associated TF-IDF weights were calculated for the following cluster analysis, which involved a combination of hierarchical clustering and K-means analysis. The hierarchical clustering and LDA decided the number of required categories as seven, and the K-means cluster analysis classified the overall documents into seven categories. This study utilized co-word analysis to check the suitability of the K-means classification, analyzed the terms with high TF-IDF wights for distinct K-means groups, and examined the terms for different topics with the LDA technique. A comparison of the results demonstrated that most categories that were recognized with K-means and LDA methods were the same and shared similar words; however, two categories had slight differences. The involvement of field experts assisted with the consistency and correctness of the classified topics and documents.


PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0243208
Author(s):  
Leacky Muchene ◽  
Wende Safari

Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.


Sign in / Sign up

Export Citation Format

Share Document