MODIFIED CORRELATION WEIGHT K-NEAREST  NEIGHBOR CLASSIFIER USING TRAINING  DATASET CLEANING METHOD

Efraim Kurniawan Dairo Kette

doi:10.5614/itb.ijp.2021.32.2.5

MODIFIED CORRELATION WEIGHT K-NEAREST NEIGHBOR CLASSIFIER USING TRAINING DATASET CLEANING METHOD

Indonesian Journal of Physics ◽

10.5614/itb.ijp.2021.32.2.5 ◽

2021 ◽

Vol 32 (2) ◽

pp. 20-25

Author(s):

Efraim Kurniawan Dairo Kette

Keyword(s):

Nearest Neighbor ◽

Training Sample ◽

Classification Performance ◽

Training Data ◽

Training Dataset ◽

Classification Problems ◽

K Nearest Neighbor ◽

Neighborhood Structure ◽

Data Set ◽

Sample Data

In pattern recognition, the k-Nearest Neighbor (kNN) algorithm is the simplest non-parametric algorithm. Due to its simplicity, the model cases and the quality of the training data itself usually influence kNN algorithm classification performance. Therefore, this article proposes a sparse correlation weight model, combined with the Training Data Set Cleaning (TDC) method by Classification Ability Ranking (CAR) called the CAR classification method based on Coefficient-Weighted kNN (CAR-CWKNN) to improve kNN classifier performance. Correlation weight in Sparse Representation (SR) has been proven can increase classification accuracy. The SR can show the 'neighborhood' structure of the data, which is why it is very suitable for classification based on the Nearest Neighbor. The Classification Ability (CA) function is applied to classify the best training sample data based on rank in the cleaning stage. The Leave One Out (LV1) concept in the CA works by cleaning data that is considered likely to have the wrong classification results from the original training data, thereby reducing the influence of the training sample data quality on the kNN classification performance. The results of experiments with four public UCI data sets related to classification problems show that the CAR-CWKNN method provides better performance in terms of accuracy.

Download Full-text

Learning from Imbalanced Multi-label Data Sets by Using Ensemble Strategies

Computer Engineering and Applications Journal ◽

10.18495/comengapp.v4i1.109 ◽

2015 ◽

Vol 4 (1) ◽

pp. 61-81

Author(s):

Mohammad Masoud Javidi

Keyword(s):

Logistic Regression ◽

Ensemble Learning ◽

Nearest Neighbor ◽

Imbalanced Data ◽

Classification Performance ◽

Training Data ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Stable Algorithm

Multi-label classification is an extension of conventional classification in which a single instance can be associated with multiple labels. Problems of this type are ubiquitous in everyday life. Such as, a movie can be categorized as action, crime, and thriller. Most algorithms on multi-label classification learning are designed for balanced data and donâ€™t work well on imbalanced data. On the other hand, in real applications, most datasets are imbalanced. Therefore, we focused to improve multi-label classification performance on imbalanced datasets. In this paper, a state-of-the-art multi-label classification algorithm, which called IBLR_ML, is employed. This algorithm is produced from combination of k-nearest neighbor and logistic regression algorithms. Logistic regression part of this algorithm is combined with two ensemble learning algorithms, Bagging and Boosting. My approach is called IB-ELR. In this paper, for the first time, the ensemble bagging method whit stable learning as the base learner and imbalanced data sets as the training data is examined. Finally, to evaluate the proposed methods; they are implemented in JAVA language. Experimental results show the effectiveness of proposed methods. Keywords: Multi-label classification, Imbalanced data set, Ensemble learning, Stable algorithm, Logistic regression, Bagging, Boosting

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Identification of Leukemia Subtypes from Microscopic Images Using Convolutional Neural Network

Diagnostics ◽

10.3390/diagnostics9030104 ◽

2019 ◽

Vol 9 (3) ◽

pp. 104 ◽

Cited By ~ 11

Author(s):

Ahmed ◽

Yigit ◽

Isik ◽

Alpkocak

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Leukemia Data

Leukemia is a fatal cancer and has two main types: Acute and chronic. Each type has two more subtypes: Lymphoid and myeloid. Hence, in total, there are four subtypes of leukemia. This study proposes a new approach for diagnosis of all subtypes of leukemia from microscopic blood cell images using convolutional neural networks (CNN), which requires a large training data set. Therefore, we also investigated the effects of data augmentation for an increasing number of training samples synthetically. We used two publicly available leukemia data sources: ALL-IDB and ASH Image Bank. Next, we applied seven different image transformation techniques as data augmentation. We designed a CNN architecture capable of recognizing all subtypes of leukemia. Besides, we also explored other well-known machine learning algorithms such as naive Bayes, support vector machine, k-nearest neighbor, and decision tree. To evaluate our approach, we set up a set of experiments and used 5-fold cross-validation. The results we obtained from experiments showed that our CNN model performance has 88.25% and 81.74% accuracy, in leukemia versus healthy and multiclass classification of all subtypes, respectively. Finally, we also showed that the CNN model has a better performance than other wellknown machine learning algorithms.

Download Full-text

Application of K-Nearest Neighbor Algorithm on Classification of Disk Hernia and Spondylolisthesis in Vertebral Column

Indonesian Journal of Information Systems ◽

10.24002/ijis.v2i1.2352 ◽

2019 ◽

Vol 2 (1) ◽

pp. 57 ◽

Cited By ~ 1

Author(s):

Irma Handayani

Keyword(s):

Vertebral Column ◽

Nearest Neighbor ◽

Average Length ◽

Data Classification ◽

The Body ◽

Training Data ◽

K Nearest Neighbor ◽

Sample Data ◽

K Nearest Neighbor Algorithm

Vertebral column as a part of backbone has important role in human body. Trauma in vertebral column can affect spinal cord capability to send and receive messages from brain to the body system that controls sensory and motoric movement. Disk hernia and spondylolisthesis are examples of pathologies on the vertebral column. Research about pathology or damage bones and joints of skeletal system classification is rare whereas the classification system can be used by radiologists as a second opinion so that can improve productivity and diagnosis consistency of the radiologists. This research used dataset Vertebral Column that has three classes (Disk Hernia, Spondylolisthesis and Normal) and instances in UCI Machine Learning. This research applied the K-NN algorithm for classification of disk hernia and spondylolisthesis in vertebral column. The data were then classified into two different but related classification tasks: “normal” and “abnormal”. K-NN algorithm adopts the approach of data classification by optimizing sample data that can be used as a reference for training data to produce vertebral column data classification based on the learning process. The results showed that the accuracy of K-NN classifier was 83%. The average length of time needed to classify the K-NN classifier was 0.000212303 seconds.

Download Full-text

Analysis K-Nearest Neighbor Algorithm for Improving Prediction Student Graduation Time

SinkrOn ◽

10.33395/sinkron.v4i2.10480 ◽

2020 ◽

Vol 4 (2) ◽

pp. 42

Author(s):

Rizki Muliono ◽

Juanda Hakim Lubis ◽

Nurul Khairina

Keyword(s):

Higher Education ◽

Nearest Neighbor ◽

Training Data ◽

K Nearest Neighbor ◽

Nearest Neighbor Algorithm ◽

Study Program ◽

Sample Data ◽

Student Graduation ◽

K Nearest Neighbor Algorithm

Higher education plays a major role in improving the quality of education in Indonesia. The BAN-PT institution established by the government has a standard of higher education accreditation and study program accreditation. With the 4.0-based accreditation instrument, it encourages university leaders to improve the quality and quality of their education. One indicator that determines the accreditation of study programs is the timely graduation of students. This study uses the K-Nearest Neighbor algorithm to predict student graduation times. Students' GPA at the time of the seventh semester will be used as training data, and data of students who graduate are used as sample data. K-Nearest Neighbor works in accordance with the given sample data. The results of prediction testing on 60 data for students of 2015-2016, obtained the highest level of accuracy of 98.5% can be achieved when k = 3. Prediction results depend on the pattern of data entered, the more samples and training data used, the calculation of the K-Nearest Neighbor algorithm is also more accurate.

Download Full-text

KLASIFIKASI CALON DEBITUR KREDIT PEMILIKAN RUMAH (KPR) MULTIGUNA TAKE OVER MENGGUNAKAN METODE k NEAREST NEIGHBOR DENGAN PEMBOBOTAN GLOBAL GINI DIVERSITY INDEX

Jurnal Gaussian ◽

10.14710/j.gauss.v8i4.26721 ◽

2019 ◽

Vol 8 (4) ◽

pp. 407-417

Author(s):

Inas Hasimah ◽

Moch. Abdul Mukid ◽

Hasbi Yasin

Keyword(s):

Nearest Neighbor ◽

Home Ownership ◽

Financial Institution ◽

Diversity Index ◽

Training Data ◽

Final Decision ◽

K Nearest Neighbor ◽

Data Set ◽

Independent Variable ◽

Credit Analysis

House credit (KPR) is a credit facilities for buying or other comsumptive needs with house warranty. The warranty for KPR is the house that will be purchased. The warranty for KPR multiguna take over is the house that will be owned by debtor, and then debtor is taking over KPR to another financial institution. For fulfilled the credit to prospective debtor is done by passing through the process of credit application and credit analysis. With the credit analysis, will acknowledge the ability of debtor for repay a credit. Final decision of credit application is classified into approved and refused. k Nearest Neighbor by attributes weighting using Global Gini Diversity Index is a statistical method that can be used to classify the credit decision of prospective debtor. This research use 2443 data of KPR multiguna take over’s prospective debtor in 2018 with credit decision of prospective debtor as dependent variable and four selected independent variable such as home ownership status, job, loans amount, and income. The best classification result of k-NN by Global Gini Diversity Index weighting is when using 80% training data set and 20% testing data set with k=7 obtained APER value 0,0798 and accuracy 92,02%. Keywords: KPR Multiguna Take Over, Classification, KNN by Global Gini Diversity Index weighting, Evaluation of Classification

Download Full-text

Feature Selection and K-nearest Neighbor for Diagnosis Cow Disease

International journal of science, engineering, and information technology ◽

10.21107/ijseit.v5i02.10218 ◽

2021 ◽

Vol 5 (02) ◽

pp. 249-253

Author(s):

Yeni Kustiyahningsih

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Disease Classification ◽

Training Data ◽

Test Results ◽

K Nearest Neighbor ◽

Data Set ◽

Cattle Disease ◽

Cattle Diseases ◽

Cattle Breeders

The large number of cattle population that exists can increase the potential for developing cow disease. Lack of knowledge about various kinds of cattle diseases and their handling solutions is one of the causes of decreasing cow productivity. The aim of this research is to classify cattle disease quickly and accurately to assist cattle breeders in accelerating detection and handling of cattle disease. This study uses K-Nearest Neighbour (KNN) classification method with the F-Score feature selection. The KNN method is used for disease classification based on the distance between training data and test data, while F-Score feature selection is used to reduce the attribute dimensions in order to obtain the relevant attributes. The data set used was data on cattle disease in Madura with a total of 350 data consisting of 21 features and 7 classes. Data were broken down using K-fold Cross Validation using k = 5. Based on the test results, the best accuracy was obtained with the number of features = 18 and KNN (k = 3) which resulted in an accuracy of 94.28571, a recall of 0.942857 and a precision of 0.942857.

Download Full-text

A Training Data Set Cleaning Method by Classification Ability Ranking for the $k$ -Nearest Neighbor Classifier

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2019.2920864 ◽

2020 ◽

Vol 31 (5) ◽

pp. 1544-1556 ◽

Cited By ~ 3

Author(s):

Yidi Wang ◽

Zhibin Pan ◽

Yiwei Pan

Keyword(s):

Nearest Neighbor ◽

Training Data ◽

K Nearest Neighbor ◽

Data Set ◽

Nearest Neighbor Classifier ◽

Cleaning Method ◽

Neighbor Classifier

Download Full-text

Use of Sentinel-2 and LUCAS Database for the Inventory of Land Use, Land Use Change, and Forestry in Wallonia, Belgium

Land ◽

10.3390/land7040154 ◽

2018 ◽

Vol 7 (4) ◽

pp. 154 ◽

Cited By ~ 6

Author(s):

Odile Close ◽

Beaumont Benjamin ◽

Sophie Petit ◽

Xavier Fripiat ◽

Eric Hallot

Keyword(s):

Land Use ◽

Land Use Change ◽

Land Cover ◽

Greenhouse Gas ◽

Nearest Neighbor ◽

Training Sample ◽

Training Data ◽

K Nearest Neighbor ◽

Statistical Survey ◽

Sentinel 2

Due to its cost-effectiveness and repeatability of observations, high resolution optical satellite remote sensing has become a major technology for land use and land cover mapping. However, inventory compilers for the Land Use, Land Use Change, and Forestry (LULUCF) sector are still mostly relying on annual census and periodic surveys for such inventories. This study proposes a new approach based on per-pixel supervised classification using Sentinel-2 imagery from 2016 for mapping greenhouse gas emissions and removals associated with the LULUCF sector in Wallonia, Belgium. The Land Use/Cover Area frame statistical Survey (LUCAS) of 2015 was used as training data and reference data to validate the map produced. Then, we investigated the performance of four widely used classifiers (maximum likelihood, random forest, k-nearest neighbor, and minimum distance) on different training sample sizes. We also studied the use of the rich spectral information of Sentinel-2 data as well as single-date and multitemporal classification. Our study illustrates how open source data can be effectively used for land use and land cover classification. This classification, based on Sentinel-2 and LUCAS, offers new opportunities for LULUCF inventory of greenhouse gas on a European scale.

Download Full-text

Cell morphology-based machine learning models for human cell state classification

npj Systems Biology and Applications ◽

10.1038/s41540-021-00180-y ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Yi Li ◽

Chance M. Nowak ◽

Uyen Pham ◽

Khai Nguyen ◽

Leonidas Bleris

Keyword(s):

Machine Learning ◽

Flow Cytometry ◽

Multilayer Perceptron ◽

Nearest Neighbor ◽

Annexin V ◽

Classification Performance ◽

Training Data ◽

Support Vector ◽

Apoptotic Cells ◽

K Nearest Neighbor

AbstractHerein, we implement and access machine learning architectures to ascertain models that differentiate healthy from apoptotic cells using exclusively forward (FSC) and side (SSC) scatter flow cytometry information. To generate training data, colorectal cancer HCT116 cells were subjected to miR-34a treatment and then classified using a conventional Annexin V/propidium iodide (PI)-staining assay. The apoptotic cells were defined as Annexin V-positive cells, which include early and late apoptotic cells, necrotic cells, as well as other dying or dead cells. In addition to fluorescent signal, we collected cell size and granularity information from the FSC and SSC parameters. Both parameters are subdivided into area, height, and width, thus providing a total of six numerical features that informed and trained our models. A collection of logistical regression, random forest, k-nearest neighbor, multilayer perceptron, and support vector machine was trained and tested for classification performance in predicting cell states using only the six aforementioned numerical features. Out of 1046 candidate models, a multilayer perceptron was chosen with 0.91 live precision, 0.93 live recall, 0.92 live f value and 0.97 live area under the ROC curve when applied on standardized data. We discuss and highlight differences in classifier performance and compare the results to the standard practice of forward and side scatter gating, typically performed to select cells based on size and/or complexity. We demonstrate that our model, a ready-to-use module for any flow cytometry-based analysis, can provide automated, reliable, and stain-free classification of healthy and apoptotic cells using exclusively size and granularity information.

Download Full-text