scholarly journals LDA filter: A Latent Dirichlet Allocation preprocess method for Weka

PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241701
Author(s):  
P. Celard ◽  
A. Seara Vieira ◽  
E. L. Iglesias ◽  
L. Borrajo

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.

2021 ◽  
Vol 8 (2) ◽  
pp. 311
Author(s):  
Mohammad Farid Naufal

<p class="Abstrak">Cuaca merupakan faktor penting yang dipertimbangkan untuk berbagai pengambilan keputusan. Klasifikasi cuaca manual oleh manusia membutuhkan waktu yang lama dan inkonsistensi. <em>Computer vision</em> adalah cabang ilmu yang digunakan komputer untuk mengenali atau melakukan klasifikasi citra. Hal ini dapat membantu pengembangan <em>self autonomous machine</em> agar tidak bergantung pada koneksi internet dan dapat melakukan kalkulasi sendiri secara <em>real time</em>. Terdapat beberapa algoritma klasifikasi citra populer yaitu K-Nearest Neighbors (KNN), Support Vector Machine (SVM), dan Convolutional Neural Network (CNN). KNN dan SVM merupakan algoritma klasifikasi dari <em>Machine Learning</em> sedangkan CNN merupakan algoritma klasifikasi dari Deep Neural Network. Penelitian ini bertujuan untuk membandingkan performa dari tiga algoritma tersebut sehingga diketahui berapa gap performa diantara ketiganya. Arsitektur uji coba yang dilakukan adalah menggunakan 5 cross validation. Beberapa parameter digunakan untuk mengkonfigurasikan algoritma KNN, SVM, dan CNN. Dari hasil uji coba yang dilakukan CNN memiliki performa terbaik dengan akurasi 0.942, precision 0.943, recall 0.942, dan F1 Score 0.942.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>Weather is an important factor that is considered for various decision making. Manual weather classification by humans is time consuming and inconsistent. Computer vision is a branch of science that computers use to recognize or classify images. This can help develop self-autonomous machines so that they are not dependent on an internet connection and can perform their own calculations in real time. There are several popular image classification algorithms, namely K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Convolutional Neural Network (CNN). KNN and SVM are Machine Learning classification algorithms, while CNN is a Deep Neural Networks classification algorithm. This study aims to compare the performance of that three algorithms so that the performance gap between the three is known. The test architecture is using 5 cross validation. Several parameters are used to configure the KNN, SVM, and CNN algorithms. From the test results conducted by CNN, it has the best performance with 0.942 accuracy, 0.943 precision, 0.942 recall, and F1 Score 0.942.</em></p><p class="Abstrak"><em><strong><br /></strong></em></p>


2021 ◽  
Vol 5 (1) ◽  
pp. 566-576
Author(s):  
Azeez A. Nureni ◽  
Victor E. Ogunlusi ◽  
Emmanuel Junior Uloko

Sentiment analysis involves techniques used in analyzing texts in order to identify the sentiment and emotion dominant in such texts and classify them accordingly. Techniques involved include but not limited to preprocessing of texts and the use a machine learning or lexical based approach in classifying these texts. In this research, attempt was made to adopt a machine learning approach to classify tweets on Covid-19 which is considered a global pandemic. To achieve this noble objective, a cross-dataset approach was applied to train four machine learning classification algorithms: Support Vector Machine (SVM), Random Forest (RF) and Naïve Bayes (NB), as well as K-Nearest Neighbors algorithm (KNN). The final result will not only assist us in knowing the best performing algorithm, it will also assist in creating awareness on Covid-19 with the final objective of destigmatizing the patients through the analysis of sentiments and emotions on Covid-19  and finally use the same result for containing the spread of the pandemic


2021 ◽  
Author(s):  
Leonardo Dias Martins ◽  
Fabíola Pantoja Oliveira Araújo

Daily, a large amount of data circulates on the Internet, producing a lot of information in the form of images, videos and texts. Then, it is necessary to analyze and extract these information automatically. Therefore, this work presents a case study that applies text mining to extract the emotional and sentimental profiles from the comments of the Last Day of June game users, where the results and the information extracted from the analysis of sentiments were presented. Three classification algorithms were used: Naive Bayes, Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) to predict the class of elements according to the emotions or feelings identified in the comments analysis. As a result, SVM with radial kernel was the one with the best accuracy, with 79%, followed by KNN with 3 closest neighbors, with 75%, and finally, Naive Bayes, with 62%.


2019 ◽  
Vol 8 (2) ◽  
pp. 1139-1143

As social media is in boom, it is becoming very easier for customers to share their views and comments and express their feelings regarding any products which are present in online social media. . If these data can be analyzed efficiently different suggestions can be provided to the company regarding to improvise their products sale. It becomes easier for the company to understand the customer’s reaction after seeing the advertisements of the products posted on social media. This research focuses on analyzing the sentiments of customers based on the comments and reviews of products available in Facebook. Sentimental Analysis is performed to analyze the customer comments as positive, negative and neutral and later they are labeled as 0 or 1. After the labeling process, a comparative analysis is performed using different classification algorithms. The classification algorithms used are K Nearest Neighbors (KNN), Support Vector Machine (SVM) and Naïve Bayes Classifier. The classification algorithm with the highest accuracy is identified to predict the sales of online products


10.29007/h71z ◽  
2020 ◽  
Author(s):  
Waleed Almutairi ◽  
Ryszard Janicki

The paper deals with problems that imbalanced and overlapping datasets often en- counter. Performance indicators as accuracy, precision and recall of imbalanced data sets, both with and without overlapping, are discussed and compared with the same performance indicators of balanced datasets with overlapping. Three popular classification algorithms, namely, Decision Tree, KNN (k-Nearest Neighbors) and SVM (Support Vector Machines) classifiers are analyzed and compared.


Author(s):  
Hedieh Sajedi ◽  
Mehran Bahador

In this paper, a new approach for segmentation and recognition of Persian handwritten numbers is presented. This method utilizes the framing feature technique in combination with outer profile feature that we named this the adapted framing feature. In our proposed approach, segmentation of the numbers into digits has been carried out automatically. In the classification stage of the proposed method, Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN) are used. Experimentations are conducted on the IFHCDB database consisting 17,740 numeral images and HODA database consisting 102,352 numeral images. In isolated digit level on IFHCDB, the recognition rate of 99.27%, is achieved by using SVM with polynomial kernel. Furthermore, in isolated digit level on HODA, the recognition rate of 99.07% is achieved by using SVM with polynomial kernel. The experiments illustrate that applying our proposed method resulted higher accuracy compared to previous researches.


2019 ◽  
Vol 58 (06) ◽  
pp. 205-212
Author(s):  
Cirruse Salehnasab ◽  
Abbas Hajifathali ◽  
Farkhondeh Asadi ◽  
Elham Roshandel ◽  
Alireza Kazemi ◽  
...  

Abstract Background The acute graft-versus-host disease (aGvHD) is the most important cause of mortality in patients receiving allogeneic hematopoietic stem cell transplantation. Given that it occurs at the stage of severe tissue damage, its diagnosis is late. With the advancement of machine learning (ML), promising real-time models to predict aGvHD have emerged. Objective This article aims to synthesize the literature on ML classification algorithms for predicting aGvHD, highlighting algorithms and important predictor variables used. Methods A systemic review of ML classification algorithms used to predict aGvHD was performed using a search of the PubMed, Embase, Web of Science, Scopus, Springer, and IEEE Xplore databases undertaken up to April 2019 based on Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statements. The studies with a focus on using the ML classification algorithms in the process of predicting of aGvHD were considered. Results After applying the inclusion and exclusion criteria, 14 studies were selected for evaluation. The results of the current analysis showed that the algorithms used were Artificial Neural Network (79%), Support Vector Machine (50%), Naive Bayes (43%), k-Nearest Neighbors (29%), Regression (29%), and Decision Trees (14%), respectively. Also, many predictor variables have been used in these studies so that we have divided them into more abstract categories, including biomarkers, demographics, infections, clinical, genes, transplants, drugs, and other variables. Conclusion Each of these ML algorithms has a particular characteristic and different proposed predictors. Therefore, it seems these ML algorithms have a high potential for predicting aGvHD if the process of modeling is performed correctly.


Sign in / Sign up

Export Citation Format

Share Document