nearest neighbor classifiers Latest Research Papers

Knowledge extraction within a healthcare field is a very challenging task since we are having many problems such as noise and imbalanced datasets. They are obtained from clinical studies where uncertainty and variability are popular. Lately, a wide number of machine learning algorithms are considered and evaluated to check their validity of being used in the medical field. Usually, the classification algorithms are compared against medical experts who are specialized in certain disease diagnoses and provide an effective methodological evaluation of classifiers by applying performance metrics. The performance metrics contain four criteria: accuracy, sensitivity, and specificity forming the confusion matrix of each used algorithm. We have utilized eight different well-known machine learning algorithms to evaluate their performances in six different medical datasets. Based on the experimental results we conclude that the XGBoost and K-Nearest Neighbor classifiers were the best overall among the used datasets and signs can be used for diagnosing various diseases.

Download Full-text

Shape-Based Classification of Partially Observed Curves, With Applications to Anthropology

Frontiers in Applied Mathematics and Statistics ◽

10.3389/fams.2021.759622 ◽

2021 ◽

Vol 7 ◽

Author(s):

Gregory J. Matthews ◽

Karthik Bharath ◽

Sebastian Kurtek ◽

Juliet K. Brophy ◽

George K. Thiruvathukal ◽

...

Keyword(s):

Computational Methods ◽

Nearest Neighbor ◽

Training Data ◽

Paleoenvironmental Reconstruction ◽

Shape Information ◽

Partially Observed ◽

Nearest Neighbor Classifiers ◽

Imputation Procedure

We consider the problem of classifying curves when they are observed only partially on their parameter domains. We propose computational methods for (i) completion of partially observed curves; (ii) assessment of completion variability through a nonparametric multiple imputation procedure; (iii) development of nearest neighbor classifiers compatible with the completion techniques. Our contributions are founded on exploiting the geometric notion of shape of a curve, defined as those aspects of a curve that remain unchanged under translations, rotations and reparameterizations. Explicit incorporation of shape information into the computational methods plays the dual role of limiting the set of all possible completions of a curve to those with similar shape while simultaneously enabling more efficient use of training data in the classifier through shape-informed neighborhoods. Our methods are then used for taxonomic classification of partially observed curves arising from images of fossilized Bovidae teeth, obtained from a novel anthropological application concerning paleoenvironmental reconstruction.

Download Full-text

SUBiNN: a stacked uni- and bivariate kNN sparse ensemble

Advances in Data Analysis and Classification ◽

10.1007/s11634-021-00462-7 ◽

2021 ◽

Author(s):

Tiffany Elsten ◽

Mark de Rooij

Keyword(s):

Random Forests ◽

Nearest Neighbor ◽

Ensemble Methods ◽

Predictive Performance ◽

Ensemble Classifier ◽

Support Vector ◽

Data Sets ◽

Vector Machines ◽

Lasso Method ◽

Nearest Neighbor Classifiers

AbstractNearest Neighbor classification is an intuitive distance-based classification method. It has, however, two drawbacks: (1) it is sensitive to the number of features, and (2) it does not give information about the importance of single features or pairs of features. In stacking, a set of base-learners is combined in one overall ensemble classifier by means of a meta-learner. In this manuscript we combine univariate and bivariate nearest neighbor classifiers that are by itself easily interpretable. Furthermore, we combine these classifiers by a Lasso method that results in a sparse ensemble of nonlinear main and pairwise interaction effects. We christened the new method SUBiNN: Stacked Uni- and Bivariate Nearest Neighbors. SUBiNN overcomes the two drawbacks of simple nearest neighbor methods. In extensive simulations and using benchmark data sets, we evaluate the predictive performance of SUBiNN and compare it to other nearest neighbor ensemble methods as well as Random Forests and Support Vector Machines. Results indicate that SUBiNN often outperforms other nearest neighbor methods, that SUBiNN is well capable of identifying noise features, but that Random Forests is often, but not always, the best classifier.

Download Full-text

Semantics of Voids within Data: Ignorance-Aware Machine Learning

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10040246 ◽

2021 ◽

Vol 10 (4) ◽

pp. 246

Author(s):

Vagan Terziyan ◽

Anton Nikulin

Keyword(s):

Spatial Data ◽

Information Science ◽

Nearest Neighbor ◽

Geographical Information ◽

Two Dimensional ◽

Prototype Selection ◽

Data Space ◽

Important Concern ◽

Traditional Classification ◽

Nearest Neighbor Classifiers

Operating with ignorance is an important concern of geographical information science when the objective is to discover knowledge from the imperfect spatial data. Data mining (driven by knowledge discovery tools) is about processing available (observed, known, and understood) samples of data aiming to build a model (e.g., a classifier) to handle data samples that are not yet observed, known, or understood. These tools traditionally take semantically labeled samples of the available data (known facts) as an input for learning. We want to challenge the indispensability of this approach, and we suggest considering the things the other way around. What if the task would be as follows: how to build a model based on the semantics of our ignorance, i.e., by processing the shape of “voids” within the available data space? Can we improve traditional classification by also modeling the ignorance? In this paper, we provide some algorithms for the discovery and visualization of the ignorance zones in two-dimensional data spaces and design two ignorance-aware smart prototype selection techniques (incremental and adversarial) to improve the performance of the nearest neighbor classifiers. We present experiments with artificial and real datasets to test the concept of the usefulness of ignorance semantics discovery.

Download Full-text

Similarities: Nearest-Neighbor Classifiers

10.1007/978-3-030-81935-4_3 ◽

2021 ◽

pp. 41-64

Author(s):

Miroslav Kubat

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Classifiers

Download Full-text

Nearest neighbor classifiers over incomplete information

Proceedings of the VLDB Endowment ◽

10.14778/3430915.3430917 ◽

2020 ◽

Vol 14 (3) ◽

pp. 255-267

Author(s):

Bojan Karlaš ◽

Peng Li ◽

Renzhi Wu ◽

Nezihe Merve Gürel ◽

Xu Chu ◽

...

Keyword(s):

Machine Learning ◽

Incomplete Information ◽

Possible Worlds ◽

Missing Values ◽

Nearest Neighbor ◽

Efficient Solutions ◽

Classification Problems ◽

Database Research ◽

Real World Datasets ◽

Nearest Neighbor Classifiers

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables , which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

Download Full-text