scholarly journals Performance of Classifiers on Newsgroups using Specific Subset of Terms

Text Classification plays a vital role in the world of data mining and same is true for the classification algorithms in text categorization. There are many techniques for text classification but this paper mainly focuses on these approaches Support vector machine (SVM), Naïve Bayes (NB), k-nearest neighbor (k-NN). This paper reveals results of the classifiers on mini-newsgroups data which consists of the classifies on mini-newsgroups data which consists a lot of documents and step by step tasks like a listing of files, preprocessing, the creation of terms(a specific subset of terms), using classifiers on specific subset of datasets. Finally, after the results and experiments over the dataset, it is concluded that SVM achieves good classification output corresponding to accuracy, precision, F-measure and recall but execution time is good for the k-NN approach.

Author(s):  
Maria Morgan ◽  
Carla Blank ◽  
Raed Seetan

<p>This paper investigates the capability of six existing classification algorithms (Artificial Neural Network, Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Decision Tree and Random Forest) in classifying and predicting diseases in soybean and mushroom datasets using datasets with numerical or categorical attributes. While many similar studies have been conducted on datasets of images to predict plant diseases, the main objective of this study is to suggest classification methods that can be used for disease classification and prediction in datasets that contain raw measurements instead of images. A fungus and a plant dataset, which had many differences, were chosen so that the findings in this paper could be applied to future research for disease prediction and classification in a variety of datasets which contain raw measurements. A key difference between the two datasets, other than one being a fungus and one being a plant, is that the mushroom dataset is balanced and only contained two classes while the soybean dataset is imbalanced and contained eighteen classes. All six algorithms performed well on the mushroom dataset, while the Artificial Neural Network and k-Nearest Neighbor algorithms performed best on the soybean dataset. The findings of this paper can be applied to future research on disease classification and prediction in a variety of dataset types such as fungi, plants, humans, and animals.</p>


2016 ◽  
Vol 1 (1) ◽  
pp. 13 ◽  
Author(s):  
Debby Erce Sondakh

Penelitian ini bertujuan untuk mengukur dan membandingkan kinerja lima algoritma klasifikasi teks berbasis pembelajaran mesin, yaitu decision rules, decision tree, k-nearest neighbor (k-NN), naïve Bayes, dan Support Vector Machine (SVM), menggunakan dokumen teks multi-class. Perbandingan dilakukan pada efektifiatas algoritma, yaitu kemampuan untuk mengklasifikasi dokumen pada kategori yang tepat, menggunakan metode holdout atau percentage split. Ukuran efektifitas yang digunakan adalah precision, recall, F-measure, dan akurasi. Hasil eksperimen menunjukkan bahwa untuk algoritma naïve Bayes, semakin besar persentase dokumen pelatihan semakin tinggi akurasi model yang dihasilkan. Akurasi tertinggi naïve Bayes pada persentase 90/10, SVM pada 80/20, dan decision tree pada 70/30. Hasil eksperimen juga menunjukkan, algoritma naïve Bayes memiliki nilai efektifitas tertinggi di antara lima algoritma yang diuji, dan waktu membangun model klasiifikasi yang tercepat, yaitu 0.02 detik. Algoritma decision tree dapat mengklasifikasi dokumen teks dengan nilai akurasi yang lebih tinggi dibanding SVM, namun waktu membangun modelnya lebih lambat. Dalam hal waktu membangun model, k-NN adalah yang tercepat namun nilai akurasinya kurang.


Author(s):  
Jiahua Jin ◽  
Lu Lu

Hotel social media provides access to dissatisfied customers and their experiences with services. However, due to massive topics and posts in social media, and the sparse distribution of complaint-related posts and, manually identifying complaints is inefficient and time-consuming. In this study, we propose a supervised learning method including training samples enlargement and classifier construction. We first identified reliable complaint and noncomplaint samples from the unlabeled dataset by using small labeled samples as training samples. Combining the labeled samples and enlarged samples, classification algorithms support vector machine and k-nearest neighbor were then adopted to build binary classifiers during the classifier construction process. Experimental results indicate the proposed method can identify complaints from social media efficiently, especially when the amount of labeled training samples is small. This study provides an efficient approach for hotel companies to distinguish a certain kind of consumer complaint information from large number of unrelated information in hotel social media.


Author(s):  
Seyma Kiziltas Koc ◽  
Mustafa Yeniad

Technologies which are used in the healthcare industry are changing rapidly because the technology is evolving to improve people's lifestyles constantly. For instance, different technological devices are used for the diagnosis and treatment of diseases. It has been revealed that diagnosis of disease can be made by computer systems with developing technology.Machine learning algorithms are frequently used tools because of their high performance in the field of health as well as many field. The aim of this study is to investigate different machine learning classification algorithms that can be used in the diagnosis of diabetes and to make comparative analyzes according to the metrics in the literature. In the study, seven classification algorithms were used in the literature. These algorithms are Logistic Regression, K-Nearest Neighbor, Multilayer Perceptron, Random Forest, Decision Trees, Support Vector Machine and Naive Bayes. Firstly, classification performance of algorithms are compared. These comparisons are based on accuracy, sensitivity, precision, and F1-score. The results obtained showed that support vector machine algorithm had the highest accuracy with 78.65%.


The world today has made giant leaps in the field of Medicine. There is tremendous amount of researches being carried out in this field leading to new discoveries that is making a heavy impact on the mankind. Data being generated in this field is increasing enormously. A need has arisen to analyze these data in order to find out the meaningful and relevant hidden patterns. These patterns can be used for clinical diagnosis. Data mining is an efficient approach in discovering these patterns. Among the many data mining techniques that exists, this paper aims at analyzing the medical data using various Classification techniques. The classification techniques used in this study include k-Nearest neighbor (kNN), Decision Tree, Naive Bayes which are hard computing algorithms, whereas the soft computing algorithms used in this study include Support Vector Machine (SVM), Artificial Neural Networks (ANN) and Fuzzy k-Means clustering. We have applied these algorithms to three kinds of datasets that are Breast Cancer Wisconsin, Haberman Data and Contraceptive Method Choice dataset. Our results show that soft computing based classification algorithms better classifications than the traditional classification algorithms in terms of various classification performance measures


Author(s):  
MAYY M. AL-TAHRAWI ◽  
RAED ABU ZITAR

Many techniques and algorithms for automatic text categorization had been devised and proposed in the literature. However, there is still much space for researchers in this area to improve existing algorithms or come up with new techniques for text categorization (TC). Polynomial Networks (PNs) were never used before in TC. This can be attributed to the huge datasets used in TC, as well as the technique itself which has high computational demands. In this paper, we investigate and propose using PNs in TC. The proposed PN classifier has achieved a competitive classification performance in our experiments. More importantly, this high performance is achieved in one shot training (noniteratively) and using just 0.25%–0.5% of the corpora features. Experiments are conducted on the two benchmark datasets in TC: Reuters-21578 and the 20 Newsgroups. Five well-known classifiers are experimented on the same data and feature subsets: the state-of-the-art Support Vector Machines (SVM), Logistic Regression (LR), the k-nearest-neighbor (kNN), Naive Bayes (NB), and the Radial Basis Function (RBF) networks.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 2029-2029 ◽  
Author(s):  
Estela Pineda ◽  
Anna Esteve-Codina ◽  
Maria Martinez-Garcia ◽  
Francesc Alameda ◽  
Cristina Carrato ◽  
...  

2029 Background: Glioblastoma (GBM) gene expression subtypes have been described in last years, data in homogeneously treated patients is lacking. Methods: Clinical, molecular and immunohistochemistry (IHC) analysis from patients with newly diagnosed GBM homogeneously treated with standard radiochemotherapy were studied. Samples were classified based on the expression profiles into three different subtypes (classical, mesenchymal, proneural) using Support Vector Machine (SVM), the K-nearest neighbor (K-NN) and the single sample Gene Set Enrichment Analysis (ssGSEA) classification algorithms provided by GlioVis web application. Results: GLIOCAT Project recruited 432 patients from 6 catalan institutions, all of whom received standard first-line treatment (2004 -2015). Best paraffin tissue samples were selected for RNAseq and reliable data were obtained from 124. 82 cases (66%) were classified into the same subtype by all three classification algorithms. SVM and ssGEA algorithms obtain more similar results (87%). No differences in clinical variables were found between the 3 GBM subtypes. Proneural subtype was enriched with IDH1 mutated and G-CIMP positive tumors. Mesenchymal subtype (SVM) was enriched in unmethylated MGMT tumors (p = 0.008), and classical (SVM) in methylated MGMT tumors (p = 0.008). Long survivors ( > 30 months) were rarely classified as mesenchymal (0-7.5%) and were more frequently classified as Proneural (23.1-26.). Clinical (age, resection, KPS) and molecular ( IDH1, MGMT) known prognostic factors were confirmed in this serie. Overall, no differences in prognosis were observed between 3 subtypes, but a trend to worse survival in mesenchymal was observed in K-NN (9.6 vs 15 ). Mesenchymal subtype presented less expression of Olig2 (p < 0.001) and SOX2 (p = 0.003) by IHC, but more YLK-40 expression (p = 0.023, SVM). On the other hand, classical subtype expressed more Nestin (p = 0.004) compared to the other subtypes (K-NN). Conclusions: In our study we have not found correlation between glioblastoma expression subtype and outcome. This large serie provides reproducible data regarding clinical-molecular-immunohistochemistry features of glioblastoma genetic subtypes.


2019 ◽  
Vol 13 (11) ◽  
pp. 31
Author(s):  
Jafar Ababneh

Nowadays, many applications that use large data have been developed due to the existence of the Internet of Things. These applications are translated into different languages and require automated text classification (ATC). The ATC process depends on the content of one or more predefined classes. However, this process is problematic for the Arabic translation of the data. This study aims to solve this issue by investigating the performances of three classification algorithms, namely, k-nearest neighbor (KNN), decision tree (DT), and na&iuml;ve Bayes (NB) classifiers, on Saudi Press Agency datasets. Results showed that the NB algorithm outperformed DT and KNN algorithms in terms of precision, recall, and F1. In future works, a new algorithm that can improve the handling of the ATC problem will be developed.


2018 ◽  
Vol 6 (4) ◽  
pp. 129-134 ◽  
Author(s):  
Jumoke Falilat Ajao ◽  
David Olufemi Olawuyi ◽  
Odetunji Ode Odejobi

This work presents a recognition system for Offline Yoruba characters recognition using Freeman chain code and K-Nearest Neighbor (KNN). Most of the Latin word recognition and character recognition have used k-nearest neighbor classifier and other classification algorithms. Research tends to explore the same recognition capability on Yoruba characters recognition. Data were collected from adult indigenous writers and the scanned images were subjected to some level of preprocessing to enhance the quality of the digitized images. Freeman chain code was used to extract the features of THE digitized images and KNN was used to classify the characters based on feature space. The performance of the KNN was compared with other classification algorithms that used Support Vector Machine (SVM) and Bayes classifier for recognition of Yoruba characters. It was observed that the recognition accuracy of the KNN classification algorithm and the Freeman chain code is 87.7%, which outperformed other classifiers used on Yoruba characters.


Sign in / Sign up

Export Citation Format

Share Document