scholarly journals Network Pseudohealth Information Recognition Model: An Integrated Architecture of Latent Dirichlet Allocation and Data Block Update

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Jie Zhang ◽  
Pingping Sun ◽  
Feng Zhao ◽  
Qianru Guo ◽  
Yue Zou

The wanton dissemination of network pseudohealth information has brought great harm to people’s health, life, and property. It is important to detect and identify network pseudohealth information. Based on this, this paper defines the concepts of pseudohealth information, data block, and data block integration, designs an architecture that combines the latent Dirichlet allocation (LDA) algorithm and data block update integration, and proposes the combination algorithm model. In addition, crawler technology is used to crawl the pseudohealth information transmitted on the Sina Weibo platform during the “epidemic situation” from February to March 2020 for the simulation test on the experimental case dataset. The research results show that (1) the LDA model can deeply mine the semantic information of network pseudohealth information, obtain the features of document-topic distribution, and classify and train topic features as input variables; (2) the dataset partitioning method can effectively block data according to the text attributes and class labels of network pseudohealth information and can accurately classify and integrate the block data through the data block reintegration method; and (3) considering that the combination model has certain limitations on the detection of network pseudohealth information, the support vector machine (SVM) model can extract the granularity content of data blocks in pseudohealth information in real time, thus greatly improving the recognition performance of the combination model.

2021 ◽  
Vol 13 (3) ◽  
pp. 128-133
Author(s):  
Attala Rafid Abelard ◽  
Yuliant Sibaroni

Among many film streaming platforms that have sprung up, Netflix is ​​the platform that has the most subscribers compared to the other platforms. However, not all reviews provided by the Netflix users are good reviews. These reviews will later be analyzed to determine what aspects are reviewed by the users based on reviews written on the Google Play Store, using the Latent Dirichlet Allocation (LDA) method. Then, the classification process using the Support Vector Machine (SVM) method will be carried out to determine whether each of these reviews is included in the positive or negative class (Sentiment Analysis). There are 2 scenarios that were carried out in this study. The first scenario resulted that the best number of LDA topics to be used is 40, and the second scenario resulted that the use of filtering process in the preprocessing stage reduces the score of the f1-score. Thus, this study resulted in the best performance score on LDA and SVM testing with 40 topics, and without running the filtering process with the score of 78.15%.


2021 ◽  
Vol 8 (6) ◽  
pp. 1265
Author(s):  
Muhammad Alkaff ◽  
Andreyan Rizky Baskara ◽  
Irham Maulani

<p>Sebuah sistem layanan untuk menyampaikan aspirasi dan keluhan masyarakat terhadap layanan pemerintah Indonesia, bernama Lapor! Pemerintah sudah lama memanfaatkan sistem tersebut untuk menjawab permasalahan masyarakat Indonesia terkait permasalahan birokrasi. Namun, peningkatan volume laporan dan pemilahan laporan yang dilakukan oleh operator dengan membaca setiap keluhan yang masuk melalui sistem menyebabkan sering terjadi kesalahan dimana operator meneruskan laporan tersebut ke instansi yang salah. Oleh karena itu, diperlukan suatu solusi yang dapat menentukan konteks laporan secara otomatis dengan menggunakan teknik Natural Language Processing. Penelitian ini bertujuan untuk membangun klasifikasi laporan secara otomatis berdasarkan topik laporan yang ditujukan kepada instansi yang berwenang dengan menggabungkan metode Latent Dirichlet Allocation (LDA) dan Support Vector Machine (SVM). Proses pemodelan topik untuk setiap laporan dilakukan dengan menggunakan metode LDA. Metode ini mengekstrak laporan untuk menemukan pola tertentu dalam dokumen yang akan menghasilkan keluaran dalam nilai distribusi topik. Selanjutnya, proses klasifikasi untuk menentukan laporan agensi tujuan dilakukan dengan menggunakan SVM berdasarkan nilai topik yang diekstraksi dengan metode LDA. Performa model LDA-SVM diukur dengan menggunakan confusion matrix dengan menghitung nilai akurasi, presisi, recall, dan F1 Score. Hasil pengujian menggunakan teknik split train-test dengan skor 70:30 menunjukkan bahwa model menghasilkan kinerja yang baik dengan akurasi 79,85%, presisi 79,98%, recall 72,37%, dan Skor F1 74,67%.</p><p> </p><p><em><strong>Abstract</strong></em></p><p><em>A service system to convey aspirations and complaints from the public against Indonesia's government services, named Lapor! The Government has used the Government for a long time to answer the problems of the Indonesian people related to bureaucratic problems. However, the increasing volume of reports and the sorting of reports carried out by operators by reading every complaint that comes through the system cause frequent errors where operators forward the reports to the wrong agencies. Therefore, we need a solution that can automatically determine the report's context using Natural Language Processing techniques. This study aims to build automatic report classifications based on report topics addressed to authorized agencies by combining Latent Dirichlet Allocation (LDA) and Support Vector Machine (SVM). The topic-modeling process for each report was carried out using the LDA method. This method extracts reports to find specific patterns in documents that will produce output in topic distribution values. Furthermore, the classification process to determine the report's destination agency carried out using the SVM based on the value of the topics extracted by the LDA method. The LDA-SVM model's performance is measured using a confusion matrix by calculating the value of accuracy, precision, recall, and F1 Score. The test results using the train-test split technique with a 70:30 show that the model produces good performance with 79.85% accuracy, 79.98% precision, 72.37% recall, and 74.67% F1 Score</em></p><p><em><strong><br /></strong></em></p>


2018 ◽  
Vol 61 (1) ◽  
pp. 64-76 ◽  
Author(s):  
Susan (Sixue) Jia

Fitness clubs have never ceased searching for quality improvement opportunities to better serve their exercisers, whereas exercisers have been posting online ratings and reviews regarding fitness clubs. Studied together, the quantitative rating and qualitative review can provide a comprehensive depiction of exercisers’ perception of fitness clubs. However, the typological and dimensional discrepancies of online rating and review have hindered the joint study of the two data sets to fully exploit their business value. To this end, this study bridges the gap by examined 53,979 pairs of exerciser online rating and review from 100 fitness clubs in Shanghai, China. Using latent Dirichlet allocation (LDA) based text mining, we identified the 17 major topics on which the exercisers were writing. A support vector machine (SVM) classifier was then employed to establish the rating-review relations, with an accuracy rate of up to 86%. Finally, the relative impact of each topic on exerciser satisfaction was computed and compared by introducing virtual reviews. The significance of this study is that it systematically creates a standardized protocol of mining and correlating the massive structured/quantitative and unstructured/qualitative data available online, which is readily transferable to the other service and product sectors.


PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241701
Author(s):  
P. Celard ◽  
A. Seara Vieira ◽  
E. L. Iglesias ◽  
L. Borrajo

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.


2021 ◽  
Vol 7 ◽  
pp. e412
Author(s):  
Chiranjibi Sitaula ◽  
Anish Basnet ◽  
Sunil Aryal

Document representation with outlier tokens exacerbates the classification performance due to the uncertain orientation of such tokens. Most existing document representation methods in different languages including Nepali mostly ignore the strategies to filter them out from documents before learning their representations. In this article, we propose a novel document representation method based on a supervised codebook to represent the Nepali documents, where our codebook contains only semantic tokens without outliers. Our codebook is domain-specific as it is based on tokens in a given corpus that have higher similarities with the class labels in the corpus. Our method adopts a simple yet prominent representation method for each word, called probability-based word embedding. To show the efficacy of our method, we evaluate its performance in the document classification task using Support Vector Machine and validate against widely used document representation methods such as Bag of Words, Latent Dirichlet allocation, Long Short-Term Memory, Word2Vec, Bidirectional Encoder Representations from Transformers and so on, using four Nepali text datasets (we denote them shortly as A1, A2, A3 and A4). The experimental results show that our method produces state-of-the-art classification performance (77.46% accuracy on A1, 67.53% accuracy on A2, 80.54% accuracy on A3 and 89.58% accuracy on A4) compared to the widely used existing document representation methods. It yields the best classification accuracy on three datasets (A1, A2 and A3) and a comparable accuracy on the fourth dataset (A4). Furthermore, we introduce the largest Nepali document dataset (A4), called NepaliLinguistic dataset, to the linguistic community.


Sign in / Sign up

Export Citation Format

Share Document