The k conditional nearest neighbor algorithm for classification and class probability estimation

The k nearest neighbor (kNN) approach is a simple and effective nonparametric algorithm for classification. One of the drawbacks of kNN is that the method can only give coarse estimates of class probabilities, particularly for low values of k. To avoid this drawback, we propose a new nonparametric classification method based on nearest neighbors conditional on each class: the proposed approach calculates the distance between a new instance and the kth nearest neighbor from each class, estimates posterior probabilities of class memberships using the distances, and assigns the instance to the class with the largest posterior. We prove that the proposed approach converges to the Bayes classifier as the size of the training data increases. Further, we extend the proposed approach to an ensemble method. Experiments on benchmark data sets show that both the proposed approach and the ensemble version of the proposed approach on average outperform kNN, weighted kNN, probabilistic kNN and two similar algorithms (LMkNN and MLM-kHNN) in terms of the error rate. A simulation shows that kCNN may be useful for estimating posterior probabilities when the class distributions overlap.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

PREDIKSI KELULUSAN MAHASISWA MAGISTER TEKNIK INFORMATIKA UNIVERSITAS AMIKOM YOGYAKARTA MENGGUNAKAN METODE K-NEAREST NEIGHBOR

Respati ◽

10.35842/jtir.v13i2.260 ◽

2018 ◽

Vol 13 (2) ◽

Author(s):

Eri Sasmita Susanto ◽

Kusrini Kusrini ◽

Hanif Al Fatta

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbors ◽

Training Data ◽

K Nearest Neighbor ◽

Process Data ◽

K Nearest Neighbors ◽

Testing Data ◽

Estimation Scheme ◽

Student Graduation ◽

Feasibility Test

INTISARIPenelitian ini difokuskan untuk mengetahui uji kelayakan prediksi kelulusan mahasiswa Universitas AMIKOM Yogyakarta. Dalam hal ini penulis memilih algoritma K-Nearest Neighbors (K-NN) karena K-Nearest Neighbors (K-NN) merupakan algoritma yang bisa digunakan untuk mengolah data yang bersifat numerik dan tidak membutuhkan skema estimasi parameter perulangan yang rumit, ini berarti bisa diaplikasikan untuk dataset berukuran besar.Input dari sistem ini adalah Data sampel berupa data mahasiswa tahun 2014-2015. pengujian pada penelitian ini menggunakn dua pengujian yaitu data testing dan data training. Kriteria yang digunakan dalam penelitian ini adalah , IP Semester 1-4, capaian SKS, Status Kelulusan. Output dari sistem ini berupa hasil prediksi kelulusan mahasiswa yang terbagi menjadi dua yaitu tepat waktu dan kelulusan tidak tepat waktu.Hasil pengujian menunjukkan bahwa Berdasarkan penerapan k=14 dan k-fold=5 menghasilkan performa yang terbaik dalam memprediksi kelulusan mahasiswa dengan metode K-Nearest Neighbor menggunakan indeks prestasi 4 semester dengan nilai akurasi= 98,46%, precision= 99.53% dan recall =97.64%.Kata kunci: Algoritma K-Nearest Neighbors, Prediksi Kelulusan, Data Testing, Data Training ABSTRACTThis research is focused on knowing the feasibility test of students' graduation prediction of AMIKOM University Yogyakarta. In this case the authors chose the K-Nearest Neighbors (K-NN) algorithm because K-Nearest Neighbors (K-NN) is an algorithm that can be used to process data that is numerical and does not require complicated repetitive parameter estimation scheme, this means it can be applied for large datasets.The input of this system is the sample data in the form of student data from 2014-2015. test in this research use two test that is data testing and training data. The criteria used in this study are, IP Semester 1-4, achievement of SKS, Graduation Status. The output of this system in the form of predicted results of student graduation which is divided into two that is timely and graduation is not timely.The result of the test shows that based on the application of k = 14 and k-fold = 5, the best performance in predicting the students' graduation using K-Nearest Neighbor method uses 4 semester achievement index with accuracy value = 98,46%, precision = 99.53% and recall = 97.64%.Keywords: K-Nearest Neighbors Algorithm, Graduation Prediction, Testing Data, Training Data

Download Full-text

GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195905001622 ◽

2005 ◽

Vol 15 (02) ◽

pp. 101-150 ◽

Cited By ~ 44

Author(s):

GODFRIED TOUSSAINT

Keyword(s):

Data Mining ◽

Decision Rule ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Training Data ◽

K Nearest Neighbor ◽

Training Set ◽

Open Problems ◽

Proximity Graphs ◽

Instance Based Learning

In the typical nonparametric approach to classification in instance-based learning and data mining, random data (the training set of patterns) are collected and used to design a decision rule (classifier). One of the most well known such rules is the k-nearest-neighbor decision rule (also known as lazy learning) in which an unknown pattern is classified into the majority class among its k nearest neighbors in the training set. Several questions related to this rule have received considerable attention over the years. Such questions include the following. How can the storage of the training set be reduced without degrading the performance of the decision rule? How should the reduced training set be selected to represent the different classes? How large should k be? How should the value of k be chosen? Should all k neighbors be equally weighted when used to decide the class of an unknown pattern? If not, how should the weights be chosen? Should all the features (attributes) we weighted equally and if not how should the feature weights be chosen? What distance metric should be used? How can the rule be made robust to overlapping classes or noise present in the training data? How can the rule be made invariant to scaling of the measurements? How can the nearest neighbors of a new point be computed efficiently? What is the smallest neural network that can implement nearest neighbor decision rules? Geometric proximity graphs such as Voronoi diagrams and their many relatives provide elegant solutions to these problems, as well as other related data mining problems such as outlier detection. After a non-exhaustive review of some of the classical canonical approaches to these problems, the methods that use proximity graphs are discussed, some new observations are made, and open problems are listed.

Download Full-text

Modification of KNN Algorithm

International Journal Of Engineering And Computer Science ◽

10.18535/ijecs/v8i11.4383 ◽

2019 ◽

Vol 8 (11) ◽

pp. 24869-24877 ◽

Cited By ~ 1

Author(s):

Shubham Pandey ◽

Vivek Sharma ◽

Garima Agrawal

Keyword(s):

Prior Knowledge ◽

Nearest Neighbor ◽

Main Idea ◽

Classification Algorithm ◽

Training Data ◽

Data Sets ◽

Classification Methods ◽

K Nearest Neighbor ◽

Knn Classification ◽

And Performance

K-Nearest Neighbor (KNN) classification is one of the most fundamental and simple classification methods. It is among the most frequently used classification algorithm in the case when there is little or no prior knowledge about the distribution of the data. In this paper a modification is taken to improve the performance of KNN. The main idea of KNN is to use a set of robust neighbors in the training data. This modified KNN proposed in this paper is better from traditional KNN in both terms: robustness and performance. Inspired from the traditional KNN algorithm, the main idea is to classify an input query according to the most frequent tag in set of neighbor tags with the say of the tag closest to the new tuple being the highest. Proposed Modified KNN can be considered a kind of weighted KNN so that the query label is approximated by weighting the neighbors of the query. The procedure computes the frequencies of the same labeled neighbors to the total number of neighbors with value associated with each label multiplied by a factor which is inversely proportional to the distance between new tuple and neighbours. The proposed method is evaluated on a variety of several standard UCI data sets. Experiments show the significant improvement in the performance of KNN method.

Download Full-text

Sistem Pendukung Keputusan Kredit Usaha Rakyat PT. Bank Rakyat Indonesia Unit Kaliangkrik Magelang

Creative Information Technology Journal ◽

10.24076/citec.2014v2i1.33 ◽

2015 ◽

Vol 2 (1) ◽

pp. 1

Author(s):

Agung Nugroho ◽

Kusrini Kusrini ◽

M. Rudyanto Arief

Keyword(s):

Data Mining ◽

Decision Support ◽

Decision Maker ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Training Data ◽

Classification Rule ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

Viable Solution

Banyak faktor dan variabel yang mempengaruhi risiko kredit dalam pengambilan keputusan pada permasalahan Kredit Usaha Rakyat (KUR). Faktor-faktor yang digunakan sebagai dasar penilaian Kredit Usaha Rakyat pada PT.Bank Rakyat Indonesia Unit Kaliangkrik menggunakan prinsip dasar yang dikenal dengan prinsip “5 of Credit” yaitu Character, Capacity, Capital, Condition dan Collateral. Dari factor-faktor yang digunakan sebagai dasar penilaian kredit, digunakan metode Mining Classification Rule dalam membuat Sistem Pendukung Keputusan pemberian KUR. Terdapat beberapa algoritma yang dapat digunakan dalam data mining untuk metode klasifikasi salah satunya adalah algoritma k-nearest neightbor. Konsep sistem pendukung keputusan pemberian KUR ini dirancang dapat melakukan klasifikasi terhadap objek berdasarkan data pembelajaran yang jaraknya paling dekat dengan objek tersebut dan memberikan solusi nasabah yang layak menerima KUR berdasarkan masukan dari user dengan menggunakan metode k-nearest neighbors (knn). Data-data transaksi pembayaran nasabah lama akan dijadikan sebagai data training dimana sebelumnya akan ditentukan kelasnya terlebih dahulu. Penentuan kelas dilakukan dengan proses klasifikasi data berdasarkan kategori status nasabah sesuai jumlah tunggakan pembayaran kreditnya. Dari hasil perhitungan kemiripan kasus antara data calon nasabah baru dengan nasabah lama atau data training menggunakan algoritma K-Nearest Neighbor, hasil dengan nilai tertinggi akan dijadikan acuan seorang decision maker dalam mengambil keputusan.Many factors and variables that affect credit risk in decision-making on issues People's Business Credit (KUR). The factors are used as the basis of assessment of the People's Business Credit Unit at PT Bank Rakyat Indonesia Kaliangkrik using basic principle known as the principle of "5 of Credit" ie Character, Capacity, Capital, Collateral Condition and. Of the factors that are used as a basis for credit assessment, Classification Rule Mining method used in making the administration of KUR Decision Support Systems. There are several algorithms that can be used in data mining for classification methods one of which is the k-nearest algorithm neightbor. The concept of the provision of decision support system is designed KUR can perform the classification of objects based on distance learning data that is closest to the object and provide a viable solution customers receive KUR based on input from the user by using the k-nearest neighbors (KNN). Payment transaction data will be used as a customer long training data which will be determined prior to first class. Grading is done with the data classification process based on customer status categories according to the amount of credit outstanding payments. From the calculation of the similarity between the case of data with prospective new customers or old customers training data using the K-Nearest Neighbor algorithm, the results with the highest scores will be used as a reference to a decision maker in making decisions.

Download Full-text

Learning from Imbalanced Multi-label Data Sets by Using Ensemble Strategies

Computer Engineering and Applications Journal ◽

10.18495/comengapp.v4i1.109 ◽

2015 ◽

Vol 4 (1) ◽

pp. 61-81

Author(s):

Mohammad Masoud Javidi

Keyword(s):

Logistic Regression ◽

Ensemble Learning ◽

Nearest Neighbor ◽

Imbalanced Data ◽

Classification Performance ◽

Training Data ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Stable Algorithm

Multi-label classification is an extension of conventional classification in which a single instance can be associated with multiple labels. Problems of this type are ubiquitous in everyday life. Such as, a movie can be categorized as action, crime, and thriller. Most algorithms on multi-label classification learning are designed for balanced data and donâ€™t work well on imbalanced data. On the other hand, in real applications, most datasets are imbalanced. Therefore, we focused to improve multi-label classification performance on imbalanced datasets. In this paper, a state-of-the-art multi-label classification algorithm, which called IBLR_ML, is employed. This algorithm is produced from combination of k-nearest neighbor and logistic regression algorithms. Logistic regression part of this algorithm is combined with two ensemble learning algorithms, Bagging and Boosting. My approach is called IB-ELR. In this paper, for the first time, the ensemble bagging method whit stable learning as the base learner and imbalanced data sets as the training data is examined. Finally, to evaluate the proposed methods; they are implemented in JAVA language. Experimental results show the effectiveness of proposed methods. Keywords: Multi-label classification, Imbalanced data set, Ensemble learning, Stable algorithm, Logistic regression, Bagging, Boosting

Download Full-text

Implementation of Zoning and K-Nearest Neighbor in Character Recognition of Wrésastra Script

Lontar Komputer Jurnal Ilmiah Teknologi Informasi ◽

10.24843/lkjiti.2019.v10.i01.p02 ◽

2019 ◽

pp. 9 ◽

Cited By ~ 1

Author(s):

I Wayan Agus Surya Darma

Keyword(s):

Feature Extraction ◽

Character Recognition ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Extraction Process ◽

Training Data ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

Technological Advances ◽

Different Types

Balinese script is an important aspect that packs the Balinese culture from time to time which continues to experience development along with technological advances. Balinese script consists of three types (1) Wrésastra, (2) Swalalita and (3) Modre which have different types of characters. The Wrésastra and Swalalita script are Balinese scripts which grouped into the script criteria that are used to write in the field of everyday life. In this research, the zoning method will be implemented in the feature extraction process to produce special features owned by Balinese script. The results of the feature extraction process will produce special features owned by Balinese script which will be used in the classification process to recognize the character of Balinese script. Special features are produced using the zoning method, it will divide the image characters area of ??Balinese scripts into several regions, to enrich the features of each Balinese script. The result of feature extractions is stored as training data that will be used in the classification process. K-Nearest Neighbors is implemented in the special feature classification process that is owned by the character of Balinese script. Based on the results of the test, the highest level of accuracy was obtained using the value K=3 and reference=10 with the accuracy of Balinese script recognition 97.5%.

Download Full-text

Combined Clustering Methods for Microarray Data Analysis

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.8-9.508 ◽

2013 ◽

Vol 8-9 ◽

pp. 508-515

Author(s):

Raul Malutan ◽

Pedro Gómez Vilda ◽

Monica Borda

Keyword(s):

Supervised Classification ◽

Nearest Neighbor ◽

Training Data ◽

Microarray Data Analysis ◽

Support Vector ◽

Data Sets ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Gene Shaving

Data classification has an important role in analyzing high dimensional data. In this paper Gene Shaving algorithm was used for a previous supervised classification and once the cluster information was obtained, data was classified again with supervised algorithms like Support Vector Machine (SVM) and k-Nearest Neighbor (k-NN) for an optimal clustering. These algorithms have proven to be useful when the classes of the training data and the attributes of each class are well established. The algorithms were run on several data sets, observing that the quality of the obtained clusters is dependent on the number of clusters specified.

Download Full-text

Estimating the Posterior Probabilities Using the K-Nearest Neighbor Rule

Neural Computation ◽

10.1162/0899766053019971 ◽

2005 ◽

Vol 17 (3) ◽

pp. 731-740 ◽

Cited By ~ 25

Author(s):

Amir F. Atiya

Keyword(s):

Posterior Probability ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Posterior Probabilities ◽

Classification Problems ◽

K Nearest Neighbor ◽

Confidence Measure ◽

K Nearest Neighbors ◽

Nearest Neighbor Rule ◽

Likelihood Approach

In many pattern classification problems, an estimate of the posterior probabilities (rather than only a classification) is required. This is usually the case when some confidence measure in the classification is needed. In this article, we propose a new posterior probability estimator. The proposed estimator considers the K-nearest neighbors. It attaches a weight to each neighbor that contributes in an additive fashion to the posterior probability estimate. The weights corresponding to the K-nearest-neighbors (which add to 1) are estimated from the data using a maximum likelihood approach. Simulation studies confirm the effectiveness of the proposed estimator.

Download Full-text

Tropical Balls and Its Applications to K Nearest Neighbor over the Space of Phylogenetic Trees

Mathematics ◽

10.3390/math9070779 ◽

2021 ◽

Vol 9 (7) ◽

pp. 779

Author(s):

Ruriko Yoshida

Keyword(s):

Supervised Learning ◽

Phylogenetic Trees ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

High Dimensional ◽

Learning Method ◽

Dimensional Vector ◽

K Nearest Neighbor ◽

K Nearest Neighbors

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.

Download Full-text