Effect of information gain on document classification using  k-nearest neighbor

State universities have a library as a facility to support students’ education and science, which contains various books, journals, and final assignments. An intelligent system for classifying documents is needed to ease library visitors in higher education as a form of service to students. The documents that are in the library are generally the result of research. Various complaints related to the imbalance of data texts and categories based on irrelevant document titles and words that have the ambiguity of meaning when searching for documents are the main reasons for the need for a classification system. This research uses k-Nearest Neighbor (k-NN) to categorize documents based on study interests with information gain features selection to handle unbalanced data and cosine similarity to measure the distance between test and training data. Based on the results of tests conducted with 276 training data, the highest results using the information gain selection feature using 80% training data and 20% test data produce an accuracy of 87.5% with a parameter value of k=5. The highest accuracy results of 92.9% are achieved without information gain feature selection, with the proportion of training data of 90% and 10% test data and parameters k=5, 7, and 9. This paper concludes that without information gain feature selection, the system has better accuracy than using the feature selection because every word in the document title is considered to have an essential role in forming the classification.

Download Full-text

Feature Selection Algorithm for Hyperlipidemia Classification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.110 ◽

2014 ◽

Vol 701-702 ◽

pp. 110-113

Author(s):

Qi Rui Zhang ◽

He Xian Wang ◽

Jiang Wei Qin

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Classification Systems ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Document Frequency ◽

Selection Algorithms ◽

Term Weights

This paper reports a comparative study of feature selection algorithms on a hyperlipimedia data set. Three methods of feature selection were evaluated, including document frequency (DF), information gain (IG) and aχ2 statistic (CHI). The classification systems use a vector to represent a document and use tfidfie (term frequency, inverted document frequency, and inverted entropy) to compute term weights. In order to compare the effectives of feature selection, we used three classification methods: Naïve Bayes (NB), k Nearest Neighbor (kNN) and Support Vector Machines (SVM). The experimental results show that IG and CHI outperform significantly DF, and SVM and NB is more effective than KNN when macro-averagingF1 measure is used. DF is suitable for the task of large text classification.

Download Full-text

Feature Selection and K-nearest Neighbor for Diagnosis Cow Disease

International journal of science, engineering, and information technology ◽

10.21107/ijseit.v5i02.10218 ◽

2021 ◽

Vol 5 (02) ◽

pp. 249-253

Author(s):

Yeni Kustiyahningsih

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Disease Classification ◽

Training Data ◽

Test Results ◽

K Nearest Neighbor ◽

Data Set ◽

Cattle Disease ◽

Cattle Diseases ◽

Cattle Breeders

The large number of cattle population that exists can increase the potential for developing cow disease. Lack of knowledge about various kinds of cattle diseases and their handling solutions is one of the causes of decreasing cow productivity. The aim of this research is to classify cattle disease quickly and accurately to assist cattle breeders in accelerating detection and handling of cattle disease. This study uses K-Nearest Neighbour (KNN) classification method with the F-Score feature selection. The KNN method is used for disease classification based on the distance between training data and test data, while F-Score feature selection is used to reduce the attribute dimensions in order to obtain the relevant attributes. The data set used was data on cattle disease in Madura with a total of 350 data consisting of 21 features and 7 classes. Data were broken down using K-fold Cross Validation using k = 5. Based on the test results, the best accuracy was obtained with the number of features = 18 and KNN (k = 3) which resulted in an accuracy of 94.28571, a recall of 0.942857 and a precision of 0.942857.

Download Full-text

MMAS Algorithm for Features Selection Using 1D-DWT for Video-Based Face Recognition in the Online Video Contextual Advertisement User-Oriented System

Journal of Global Information Management ◽

10.4018/jgim.2017100107 ◽

2017 ◽

Vol 25 (4) ◽

pp. 103-124 ◽

Cited By ~ 1

Author(s):

Le Nguyen Bao ◽

Dac-Nhuong Le ◽

Gia Nhu Nguyen ◽

Le Van Chung ◽

Nilanjan Dey

Keyword(s):

Feature Selection ◽

Face Recognition ◽

Nearest Neighbor ◽

Discrete Wavelet ◽

Feature Subset ◽

Features Selection ◽

Ant System ◽

K Nearest Neighbor ◽

Video Based Face Recognition ◽

Optimal Feature

Face recognition is an importance step which can affect the performance of the system. In this paper, the authors propose a novel Max-Min Ant System algorithm to optimal feature selection based on Discrete Wavelet Transform feature for Video-based face recognition. The length of the culled feature vector is adopted as heuristic information for ant's pheromone in their algorithm. They selected the optimal feature subset in terms of shortest feature length and the best performance of classifier used k-nearest neighbor classifier. The experiments were analyzed on face recognition show that the authors' algorithm can be easily implemented and without any priori information of features. The evaluated performance of their algorithm is better than previous approaches for feature selection.

Download Full-text

KLASIFIKASI DOKUMEN TUGAS AKHIR (SKRIPSI) MENGGUNAKAN K-NEAREST NEIGHBOR

JISKA (Jurnal Informatika Sunan Kalijaga) ◽

10.14421/jiska.2019.41-07 ◽

2019 ◽

Vol 4 (1) ◽

pp. 69

Author(s):

Kitami Akromunnisa ◽

Rahmat Hidayat

Keyword(s):

Test Data ◽

Cross Validation ◽

Nearest Neighbor ◽

Data Distribution ◽

Training Data ◽

K Nearest Neighbor ◽

Electronic Documents ◽

Digital Version ◽

Abstract Data

Various scientific works from academicians such as theses, research reports, practical work reports and so forth are available in the digital version. However, in general this phenomenon is not accompanied by a growth in the amount of information or knowledge that can be extracted from these electronic documents. This study aims to classify the abstract data of informatics engineering thesis. The algorithm used in this study is K-Nearest Neighbor. Amount of data used 50 abstract data of Indonesian language, 454 data of English abstract and 504 title data. Each data is divided into training data and test data. Test data will be classified automatically with the classifier model that has been made. Based on the research conducted, the classification of the Indonesian essential data resulted in greater accuracy without going through a stemming process that had a 9: 1 ratio of 100.0% compared to an 8: 2 ratio of 90.0%, 7: 3 which was 80.0%, 6: 4 which is 60.0% and the data distribution using Kfold cross validation is 80.0%.

Download Full-text

A Comparative Study of Feature Selection Techniques for Bat Algorithm in Various Applications

MATEC Web of Conferences ◽

10.1051/matecconf/201815006006 ◽

2018 ◽

Vol 150 ◽

pp. 06006 ◽

Cited By ~ 2

Author(s):

Rozlini Mohamed ◽

Munirah Mohd Yusof ◽

Noorhaniza Wahidi

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Bat Algorithm ◽

Classification Performance ◽

Finite Sample ◽

K Nearest Neighbor ◽

Chi Square ◽

Finite Sample Size ◽

Feature Selection Techniques

Feature selection is a process to select the best feature among huge number of features in dataset, However, the problem in feature selection is to select a subset that give the better performs under some classifier. In producing better classification result, feature selection been applied in many of the classification works as part of preprocessing step; where only a subset of feature been used rather than the whole features from a particular dataset. This procedure not only can reduce the irrelevant features but in some cases able to increase classification performance due to finite sample size. In this study, Chi-Square (CH), Information Gain (IG) and Bat Algorithm (BA) are used to obtain the subset features on fourteen well-known dataset from various applications. To measure the performance of these selected features three benchmark classifier are used; k-Nearest Neighbor (kNN), Naïve Bayes (NB) and Decision Tree (DT). This paper then analyzes the performance of all classifiers with feature selection in term of accuracy, sensitivity, F-Measure and ROC. The objective of these study is to analyse the outperform feature selection techniques among conventional and heuristic techniques in various applications.

Download Full-text

ANALYSIS OF WARGANET COMMENTS ON IT SERVICES IN MANDIRI BANK USING K-NEAREST NEIGHBOR (K-NN) ALGORITHM BASED ON ITSM CRITERIA

ADI Journal on Recent Innovation (AJRI) ◽

10.34306/ajri.v1i1.9 ◽

2019 ◽

Vol 1 (1) ◽

pp. 14-19

Author(s):

Febrian Wahyu Ramadhan ◽

Husni Teja Sukmana ◽

Lee Kyung Oh ◽

Luh Kesuma Wardhani

Keyword(s):

Sentiment Analysis ◽

Test Data ◽

Nearest Neighbor ◽

Training Data ◽

It Services ◽

K Nearest Neighbor ◽

Data Indexing ◽

Positive Sentiment ◽

Negative Sentiment ◽

F Measure

Sentiment analysis is a method for reviewing products or services to determine opinions or feelings about a product. The results of the analysis can be used by companies as evaluation materials and considerations to improve the products or services provided. This study aims to test the level of public sentiment on the quality of Bank Mandiri services that have received ISO 20000-1 with the application of sentiment analysis using the K-NN algorithm based on ITSM criteria. The initial classification in this study uses the lexicon method by detecting words included in sentiment words, the results of which are included as labels on training data and test data. Formation of the classification with the K-NN algorithm by taking into account the results of the training data indexing and weighting of the test data, with the value of k as the decision maker limit. The trial results of 10 scenarios show that the classification using the K-NN algorithm as a sentiment classification is 98% accuracy value of 50 test data to 600 training data, with 24% getting positive sentiment, 22% negative sentiment and 55% neutral sentiment, with f -measure 95.83%. while in testing 100 the test data obtained 79% accuracy value with 21% getting positive sentiment, 42% negative sentiment and 38% neutral with an f-measure value of 68.42%.

Download Full-text

ANALYSIS OF WARGANET COMMENTS ON IT SERVICES IN MANDIRI BANK USING K-NEAREST NEIGHBOR (K-NN) ALGORITHM BASED ON ITSM CRITERIA

ADI Journal on Recent Innovation (AJRI) ◽

10.34306/ajri.v1i1.91 ◽

2019 ◽

Vol 1 (1) ◽

pp. 14-19

Author(s):

Febrian Wahyu Ramadhan ◽

Husni Teja Sukmana ◽

Lee Kyung Oh ◽

Luh Kesuma Wardhani

Keyword(s):

Sentiment Analysis ◽

Test Data ◽

Nearest Neighbor ◽

Training Data ◽

It Services ◽

K Nearest Neighbor ◽

Data Indexing ◽

Positive Sentiment ◽

Negative Sentiment ◽

F Measure

Download Full-text

Fake News (Hoax) Identification on Social Media Twitter using Decision Tree C4.5 Method

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i4.2125 ◽

2020 ◽

Vol 4 (4) ◽

pp. 711-716 ◽

Cited By ~ 1

Author(s):

Brenda Irena ◽

Erwin Budi Setiawan

Keyword(s):

Social Media ◽

Feature Selection ◽

Decision Tree ◽

Test Data ◽

Information Gain ◽

Training Data ◽

Fake News ◽

Weighting Method ◽

Exchange Information ◽

Test Scenarios

Social media is a means to communicate and exchange information between people, and Twitter is one of them. But the information disseminated is not entirely true, but there is some news that is not in accordance with the truth or often called hoaxes. There have been many cases of spreading hoaxes that cause concern and often harm a particular individual or group. So in this research, the authors build a system to identify hoax news on social media Twitter using the Decision Tree C4.5 classification method to the 50,610 tweet data. What distinguishes this research from some researches before is the existence of several test scenarios, classification only, classification using weighting feature, and also classification using weighting feature and feature selection. The weighting method used is TF-IDF, and the feature selection uses Information Gain. The features used are also generated using n-grams consisting of unigram, bigram, and also trigrams. The final results show that the classification test that uses weighting feature and feature selection produces the best accuracy of 72.91% with a ratio of 90% training data and 10% test data (90:10) and the number of features used is 5000 in unigram features.

Download Full-text

KLASIFIKASI SENTIMENT ANALYSIS PADA KOMENTAR PESERTA DIKLAT MENGGUNAKAN METODE K-NEAREST NEIGHBOR

Kilat ◽

10.33322/kilat.v8i1.421 ◽

2019 ◽

Vol 8 (1) ◽

Author(s):

Riki Ruli A. Siregar ◽

Zuhdiyyah Ulfah Siregar ◽

Rakhmat Arianto

Keyword(s):

Sentiment Analysis ◽

Test Data ◽

Nearest Neighbor ◽

Cosine Similarity ◽

Training Data ◽

K Nearest Neighbor ◽

Term Frequency ◽

Document Frequency ◽

Negative Comments

The process of analyzing and classifying comment data done by reading and sorting one by one negative comments and classifying them one by one using Ms. Excel not effective if the data to be processed in large quantities. Therefore, this study aims to apply sentiment analysis on comment data using K-Nearest Neighbor (KNN) method. The comment data used is the comments of the participants of the training on Udiklat Jakarta filled by each participant who followed the training. Furthermore, the comment data is processed by pre-processing, weighting the word using Term Frequency-Invers Document Frequency, calculating the similarity level between the training data and test data with cosine similarity. The process of applying sentiment analysis is done to determine whether the comment is positive or negative. Furthermore, these comments will be classified into four categories, namely: instructors, materials, facilities and infrastructure. The results of this study resulted in a system that can classify comment data automatically with an accuracy of 94.23%

Download Full-text

Analisis Kesehatan Bank Menggunakan Local Mean K-Nearest Neighbor dan Multi Local Means K-Harmonic Nearest Neighbor

Jurnal Gaussian ◽

10.14710/j.gauss.v8i3.26679 ◽

2019 ◽

Vol 8 (3) ◽

pp. 343-355

Author(s):

Alwi Assegaf ◽

Moch. Abdul Mukid ◽

Abdul Hoyyi

Keyword(s):

Health Status ◽

Test Data ◽

Nearest Neighbor ◽

Financial Statements ◽

Classification Performance ◽

Capital Adequacy ◽

Training Data ◽

K Nearest Neighbor ◽

Local Means ◽

Local Mean

The classification method continues to develop in order to get more accurate classification results than before. The purpose of the research is comparing the two k-Nearest Neighbor (KNN) methods that have been developed, namely the Local Mean k-Nearest Neighbor (LMKNN) and Multi Local Means k-Harmonic Nearest Neighbor (MLM-KHNN) by taking a case study of listed bank financial statements and financial statements complete recorded at Bank Indonesia in 2017. LMKNN is a method that aims to improve classification performance and reduce the influence of outliers, and MLM-KHNN is a method that aims to reduce sensitivity to a single value. This study uses seven indicators to measure the soundness of a bank, including the Capital Adequacy Ratio, Non Performing Loans, Loan to Deposit Ratio, Return on Assets, Return on Equity, Net Interest Margin, and Operating Expenses on Operational Income with a classification of bank health status is very good (class 1), good (class 2), quite good (class 3) and poor (class 4). The measure of the accuracy of the classification results used is the Apparent Error Rate (APER). The best classification results of the LMKNN method are in the proportion of 80% training data and 20% test data with k=7 which produces the smallest APER 0,0556 and an accuracy of 94,44%, while the best classification results of the MLM-KHNN method are in the proportion of 80% training data and 20% test data with k=3 which produces the smallest APER 0,1667 and an accuracy of 83,33%. Based on APER calculation shows that the LMKNN method is better than MLM-KHNN in classifying the health status of banks in Indonesia.Keywords: Classification, Local Mean k-Nearest Neighbor (LMKNN), Multi Local Means k-Harmonic Nearest Neighbor (MLM-KHNN), Measure of accuracy of classification

Download Full-text