Performance of Classifiers on Newsgroups using Specific Subset of Terms

Text Classification plays a vital role in the world of data mining and same is true for the classification algorithms in text categorization. There are many techniques for text classification but this paper mainly focuses on these approaches Support vector machine (SVM), Naïve Bayes (NB), k-nearest neighbor (k-NN). This paper reveals results of the classifiers on mini-newsgroups data which consists of the classifies on mini-newsgroups data which consists a lot of documents and step by step tasks like a listing of files, preprocessing, the creation of terms(a specific subset of terms), using classifiers on specific subset of datasets. Finally, after the results and experiments over the dataset, it is concluded that SVM achieves good classification output corresponding to accuracy, precision, F-measure and recall but execution time is good for the k-NN approach.

Download Full-text

Plant disease prediction using classification algorithms

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp257-264 ◽

2021 ◽

Vol 10 (1) ◽

pp. 257

Author(s):

Maria Morgan ◽

Carla Blank ◽

Raed Seetan

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Nearest Neighbor ◽

Disease Classification ◽

Future Research ◽

Support Vector ◽

Classification Algorithms ◽

Disease Prediction ◽

K Nearest Neighbor ◽

Artificial Neural

<p>This paper investigates the capability of six existing classification algorithms (Artificial Neural Network, Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Decision Tree and Random Forest) in classifying and predicting diseases in soybean and mushroom datasets using datasets with numerical or categorical attributes. While many similar studies have been conducted on datasets of images to predict plant diseases, the main objective of this study is to suggest classification methods that can be used for disease classification and prediction in datasets that contain raw measurements instead of images. A fungus and a plant dataset, which had many differences, were chosen so that the findings in this paper could be applied to future research for disease prediction and classification in a variety of datasets which contain raw measurements. A key difference between the two datasets, other than one being a fungus and one being a plant, is that the mushroom dataset is balanced and only contained two classes while the soybean dataset is imbalanced and contained eighteen classes. All six algorithms performed well on the mushroom dataset, while the Artificial Neural Network and k-Nearest Neighbor algorithms performed best on the soybean dataset. The findings of this paper can be applied to future research on disease classification and prediction in a variety of dataset types such as fungi, plants, humans, and animals.</p>

Download Full-text

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHMS: HOLDOUTS AS ACCURACY ESTIMATION

CogITo Smart Journal ◽

10.31154/cogito.v1i1.2.13-23 ◽

2016 ◽

Vol 1 (1) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Debby Erce Sondakh

Keyword(s):

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Decision Rules ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Accuracy Estimation ◽

F Measure

Penelitian ini bertujuan untuk mengukur dan membandingkan kinerja lima algoritma klasifikasi teks berbasis pembelajaran mesin, yaitu decision rules, decision tree, k-nearest neighbor (k-NN), naïve Bayes, dan Support Vector Machine (SVM), menggunakan dokumen teks multi-class. Perbandingan dilakukan pada efektifiatas algoritma, yaitu kemampuan untuk mengklasifikasi dokumen pada kategori yang tepat, menggunakan metode holdout atau percentage split. Ukuran efektifitas yang digunakan adalah precision, recall, F-measure, dan akurasi. Hasil eksperimen menunjukkan bahwa untuk algoritma naïve Bayes, semakin besar persentase dokumen pelatihan semakin tinggi akurasi model yang dihasilkan. Akurasi tertinggi naïve Bayes pada persentase 90/10, SVM pada 80/20, dan decision tree pada 70/30. Hasil eksperimen juga menunjukkan, algoritma naïve Bayes memiliki nilai efektifitas tertinggi di antara lima algoritma yang diuji, dan waktu membangun model klasiifikasi yang tercepat, yaitu 0.02 detik. Algoritma decision tree dapat mengklasifikasi dokumen teks dengan nilai akurasi yang lebih tinggi dibanding SVM, namun waktu membangun modelnya lebih lambat. Dalam hal waktu membangun model, k-NN adalah yang tercepat namun nilai akurasinya kurang.

Download Full-text

Service complaint identification in hotel social media: A two-step classification approach

International Journal of Electrical Engineering Education ◽

10.1177/0020720920928467 ◽

2020 ◽

pp. 002072092092846

Author(s):

Jiahua Jin ◽

Lu Lu

Keyword(s):

Social Media ◽

Nearest Neighbor ◽

Support Vector ◽

Classification Algorithms ◽

Construction Process ◽

K Nearest Neighbor ◽

Classification Approach ◽

Consumer Complaint ◽

Training Samples ◽

Binary Classifiers

Hotel social media provides access to dissatisfied customers and their experiences with services. However, due to massive topics and posts in social media, and the sparse distribution of complaint-related posts and, manually identifying complaints is inefficient and time-consuming. In this study, we propose a supervised learning method including training samples enlargement and classifier construction. We first identified reliable complaint and noncomplaint samples from the unlabeled dataset by using small labeled samples as training samples. Combining the labeled samples and enlarged samples, classification algorithms support vector machine and k-nearest neighbor were then adopted to build binary classifiers during the classifier construction process. Experimental results indicate the proposed method can identify complaints from social media efficiently, especially when the amount of labeled training samples is small. This study provides an efficient approach for hotel companies to distinguish a certain kind of consumer complaint information from large number of unrelated information in hotel social media.

Download Full-text

Diabetes Prediction Using Machine Learning Techniques

Journal of Intelligent Systems with Applications ◽

10.54856/10.54856/jiswa.202112183 ◽

2021 ◽

pp. 150-152

Author(s):

Seyma Kiziltas Koc ◽

Mustafa Yeniad

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

High Performance ◽

Nearest Neighbor ◽

Classification Performance ◽

Machine Learning Techniques ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Machine Learning Classification

Technologies which are used in the healthcare industry are changing rapidly because the technology is evolving to improve people's lifestyles constantly. For instance, different technological devices are used for the diagnosis and treatment of diseases. It has been revealed that diagnosis of disease can be made by computer systems with developing technology.Machine learning algorithms are frequently used tools because of their high performance in the field of health as well as many field. The aim of this study is to investigate different machine learning classification algorithms that can be used in the diagnosis of diabetes and to make comparative analyzes according to the metrics in the literature. In the study, seven classification algorithms were used in the literature. These algorithms are Logistic Regression, K-Nearest Neighbor, Multilayer Perceptron, Random Forest, Decision Trees, Support Vector Machine and Naive Bayes. Firstly, classification performance of algorithms are compared. These comparisons are based on accuracy, sensitivity, precision, and F1-score. The results obtained showed that support vector machine algorithm had the highest accuracy with 78.65%.

Download Full-text

Performance Research on Medical Data Classification using Traditional and Soft Computing Techniques

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1185.0782s319 ◽

2019 ◽

Vol 8 (2S3) ◽

pp. 990-995

Keyword(s):

Data Mining ◽

Soft Computing ◽

Nearest Neighbor ◽

Classification Performance ◽

Medical Data ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Classification Techniques ◽

Soft Computing Techniques

The world today has made giant leaps in the field of Medicine. There is tremendous amount of researches being carried out in this field leading to new discoveries that is making a heavy impact on the mankind. Data being generated in this field is increasing enormously. A need has arisen to analyze these data in order to find out the meaningful and relevant hidden patterns. These patterns can be used for clinical diagnosis. Data mining is an efficient approach in discovering these patterns. Among the many data mining techniques that exists, this paper aims at analyzing the medical data using various Classification techniques. The classification techniques used in this study include k-Nearest neighbor (kNN), Decision Tree, Naive Bayes which are hard computing algorithms, whereas the soft computing algorithms used in this study include Support Vector Machine (SVM), Artificial Neural Networks (ANN) and Fuzzy k-Means clustering. We have applied these algorithms to three kinds of datasets that are Breast Cancer Wisconsin, Haberman Data and Contraceptive Method Choice dataset. Our results show that soft computing based classification algorithms better classifications than the traditional classification algorithms in terms of various classification performance measures

Download Full-text

POLYNOMIAL NETWORKS VERSUS OTHER TECHNIQUES IN TEXT CATEGORIZATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001408006247 ◽

2008 ◽

Vol 22 (02) ◽

pp. 295-322 ◽

Cited By ~ 5

Author(s):

MAYY M. AL-TAHRAWI ◽

RAED ABU ZITAR

Keyword(s):

High Performance ◽

Text Categorization ◽

Nearest Neighbor ◽

Classification Performance ◽

Support Vector ◽

K Nearest Neighbor ◽

New Techniques ◽

Vector Machines ◽

Benchmark Datasets ◽

Automatic Text

Many techniques and algorithms for automatic text categorization had been devised and proposed in the literature. However, there is still much space for researchers in this area to improve existing algorithms or come up with new techniques for text categorization (TC). Polynomial Networks (PNs) were never used before in TC. This can be attributed to the huge datasets used in TC, as well as the technique itself which has high computational demands. In this paper, we investigate and propose using PNs in TC. The proposed PN classifier has achieved a competitive classification performance in our experiments. More importantly, this high performance is achieved in one shot training (noniteratively) and using just 0.25%–0.5% of the corpora features. Experiments are conducted on the two benchmark datasets in TC: Reuters-21578 and the 20 Newsgroups. Five well-known classifiers are experimented on the same data and feature subsets: the state-of-the-art Support Vector Machines (SVM), Logistic Regression (LR), the k-nearest-neighbor (kNN), Naive Bayes (NB), and the Radial Basis Function (RBF) networks.

Download Full-text

Glioblastoma gene expression subtypes and correlation with clinical, molecular and immunohistochemical characteristics in a homogenously treated cohort: GLIOCAT project.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.2029 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. 2029-2029 ◽

Cited By ~ 1

Author(s):

Estela Pineda ◽

Anna Esteve-Codina ◽

Maria Martinez-Garcia ◽

Francesc Alameda ◽

Cristina Carrato ◽

...

Keyword(s):

Gene Expression ◽

Web Application ◽

Nearest Neighbor ◽

Expression Profiles ◽

Gene Set Enrichment Analysis ◽

The Other ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Tissue Samples

2029 Background: Glioblastoma (GBM) gene expression subtypes have been described in last years, data in homogeneously treated patients is lacking. Methods: Clinical, molecular and immunohistochemistry (IHC) analysis from patients with newly diagnosed GBM homogeneously treated with standard radiochemotherapy were studied. Samples were classified based on the expression profiles into three different subtypes (classical, mesenchymal, proneural) using Support Vector Machine (SVM), the K-nearest neighbor (K-NN) and the single sample Gene Set Enrichment Analysis (ssGSEA) classification algorithms provided by GlioVis web application. Results: GLIOCAT Project recruited 432 patients from 6 catalan institutions, all of whom received standard first-line treatment (2004 -2015). Best paraffin tissue samples were selected for RNAseq and reliable data were obtained from 124. 82 cases (66%) were classified into the same subtype by all three classification algorithms. SVM and ssGEA algorithms obtain more similar results (87%). No differences in clinical variables were found between the 3 GBM subtypes. Proneural subtype was enriched with IDH1 mutated and G-CIMP positive tumors. Mesenchymal subtype (SVM) was enriched in unmethylated MGMT tumors (p = 0.008), and classical (SVM) in methylated MGMT tumors (p = 0.008). Long survivors ( > 30 months) were rarely classified as mesenchymal (0-7.5%) and were more frequently classified as Proneural (23.1-26.). Clinical (age, resection, KPS) and molecular ( IDH1, MGMT) known prognostic factors were confirmed in this serie. Overall, no differences in prognosis were observed between 3 subtypes, but a trend to worse survival in mesenchymal was observed in K-NN (9.6 vs 15 ). Mesenchymal subtype presented less expression of Olig2 (p < 0.001) and SOX2 (p = 0.003) by IHC, but more YLK-40 expression (p = 0.023, SVM). On the other hand, classical subtype expressed more Nestin (p = 0.004) compared to the other subtypes (K-NN). Conclusions: In our study we have not found correlation between glioblastoma expression subtype and outcome. This large serie provides reproducible data regarding clinical-molecular-immunohistochemistry features of glioblastoma genetic subtypes.

Download Full-text

A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine

Expert Systems with Applications ◽

10.1016/j.eswa.2012.02.068 ◽

2012 ◽

Vol 39 (15) ◽

pp. 11880-11888 ◽

Cited By ~ 46

Author(s):

Chin Heng Wan ◽

Lam Hong Lee ◽

Rajprasad Rajkumar ◽

Dino Isa

Keyword(s):

Support Vector Machine ◽

Text Classification ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Classification Approach

Download Full-text

Application of Naïve Bayes, Decision Tree, and K-Nearest Neighbors for Automated Text Classification

Modern Applied Science ◽

10.5539/mas.v13n11p31 ◽

2019 ◽

Vol 13 (11) ◽

pp. 31

Author(s):

Jafar Ababneh

Keyword(s):

Decision Tree ◽

Text Classification ◽

Nearest Neighbor ◽

Large Data ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

Ve Bayes ◽

Automated Text Classification ◽

The Internet Of Things

Nowadays, many applications that use large data have been developed due to the existence of the Internet of Things. These applications are translated into different languages and require automated text classification (ATC). The ATC process depends on the content of one or more predefined classes. However, this process is problematic for the Arabic translation of the data. This study aims to solve this issue by investigating the performances of three classification algorithms, namely, k-nearest neighbor (KNN), decision tree (DT), and naïve Bayes (NB) classifiers, on Saudi Press Agency datasets. Results showed that the NB algorithm outperformed DT and KNN algorithms in terms of precision, recall, and F1. In future works, a new algorithm that can improve the handling of the ATC problem will be developed.

Download Full-text

Yoruba Handwritten Character Recognition using Freeman Chain Code and K-Nearest Neighbor Classifier

Jurnal Teknologi dan Sistem Komputer ◽

10.14710/jtsiskom.6.4.2018.129-134 ◽

2018 ◽

Vol 6 (4) ◽

pp. 129-134 ◽

Cited By ~ 4

Author(s):

Jumoke Falilat Ajao ◽

David Olufemi Olawuyi ◽

Odetunji Ode Odejobi

Keyword(s):

Character Recognition ◽

Nearest Neighbor ◽

Recognition System ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Chain Code ◽

Nearest Neighbor Classifier ◽

Digitized Images ◽

Neighbor Classifier

This work presents a recognition system for Offline Yoruba characters recognition using Freeman chain code and K-Nearest Neighbor (KNN). Most of the Latin word recognition and character recognition have used k-nearest neighbor classifier and other classification algorithms. Research tends to explore the same recognition capability on Yoruba characters recognition. Data were collected from adult indigenous writers and the scanned images were subjected to some level of preprocessing to enhance the quality of the digitized images. Freeman chain code was used to extract the features of THE digitized images and KNN was used to classify the characters based on feature space. The performance of the KNN was compared with other classification algorithms that used Support Vector Machine (SVM) and Bayes classifier for recognition of Yoruba characters. It was observed that the recognition accuracy of the KNN classification algorithm and the Freeman chain code is 87.7%, which outperformed other classifiers used on Yoruba characters.

Download Full-text