Nearest neighbor classifiers over incomplete information

2020 ◽  
Vol 14 (3) ◽  
pp. 255-267
Author(s):  
Bojan Karlaš ◽  
Peng Li ◽  
Renzhi Wu ◽  
Nezihe Merve Gürel ◽  
Xu Chu ◽  
...  

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables , which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Gazi Mohammed Ifraz ◽  
Muhammad Hasnath Rashid ◽  
Tahia Tazin ◽  
Sami Bourouis ◽  
Mohammad Monirujjaman Khan

Chronic kidney disease (CKD) is a major burden on the healthcare system because of its increasing prevalence, high risk of progression to end-stage renal disease, and poor morbidity and mortality prognosis. It is rapidly becoming a global health crisis. Unhealthy dietary habits and insufficient water consumption are significant contributors to this disease. Without kidneys, a person can only live for 18 days on average, requiring kidney transplantation and dialysis. It is critical to have reliable techniques at predicting CKD in its early stages. Machine learning (ML) techniques are excellent in predicting CKD. The current study offers a methodology for predicting CKD status using clinical data, which incorporates data preprocessing, a technique for managing missing values, data aggregation, and feature extraction. A number of physiological variables, as well as ML techniques such as logistic regression (LR), decision tree (DT) classification, and K -nearest neighbor (KNN), were used in this work to train three distinct models for reliable prediction. The LR classification method was found to be the most accurate in this role, with an accuracy of about 97 percent in this study. The dataset that was used in the creation of the technique was the CKD dataset, which was made available to the public. Compared to prior research, the accuracy rate of the models employed in this study is considerably greater, implying that they are more trustworthy than the models used in previous studies as well. A large number of model comparisons have shown their resilience, and the scheme may be inferred from the study’s results.


2021 ◽  
Vol 11 (19) ◽  
pp. 8884
Author(s):  
Oscar Camacho-Urriolagoitia ◽  
Itzamá López-Yáñez ◽  
Yenny Villuendas-Rey ◽  
Oscar Camacho-Nieto ◽  
Cornelio Yáñez-Márquez

The presence of machine learning, data mining and related disciplines is increasingly evident in everyday environments. The support for the applications of learning techniques in topics related to economic risk assessment, among other financial topics of interest, is relevant for us as human beings. The content of this paper consists of a proposal of a new supervised learning algorithm and its application in real world datasets related to finance, called D1-NN (Dynamic 1-Nearest Neighbor). The D1-NN performance is competitive against the main state of the art algorithms in solving finance-related problems. The effectiveness of the new D1-NN classifier was compared against five supervised classifiers of the most important approaches (Bayes, nearest neighbors, support vector machines, classifier ensembles, and neural networks), with superior results overall.


2021 ◽  
Author(s):  
Tlamelo Emmanuel ◽  
Thabiso Maupong ◽  
Dimane Mpoeleng ◽  
Thabo Semong ◽  
Mphago Banyatsang ◽  
...  

Abstract Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur as a result of various factors like missing completely at random, missing at random or missing not at random. All these may be as a result of system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for. Finally, we experiment on the K nearest neighbor and random forest imputation techniques on novel power plant induced fan data and offer some possible future research direction.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Tlamelo Emmanuel ◽  
Thabiso Maupong ◽  
Dimane Mpoeleng ◽  
Thabo Semong ◽  
Banyatsang Mphago ◽  
...  

AbstractMachine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.


2015 ◽  
Vol 2015 ◽  
pp. 1-12 ◽  
Author(s):  
Feng Hu ◽  
Jin Shi

The problem of classification in incomplete information system is a hot issue in intelligent information processing. Hypergraph is a new intelligent method for machine learning. However, it is hard to process the incomplete information system by the traditional hypergraph, which is due to two reasons: (1) the hyperedges are generated randomly in traditional hypergraph model; (2) the existing methods are unsuitable to deal with incomplete information system, for the sake of missing values in incomplete information system. In this paper, we propose a novel classification algorithm for incomplete information system based on hypergraph model and rough set theory. Firstly, we initialize the hypergraph. Second, we classify the training set by neighborhood hypergraph. Third, under the guidance of rough set, we replace the poor hyperedges. After that, we can obtain a good classifier. The proposed approach is tested on 15 data sets from UCI machine learning repository. Furthermore, it is compared with some existing methods, such as C4.5, SVM, NavieBayes, andKNN. The experimental results show that the proposed algorithm has better performance via Precision, Recall, AUC, andF-measure.


2021 ◽  
Vol 14 (11) ◽  
pp. 1964-1978
Author(s):  
Mengzhao Wang ◽  
Xiaoliang Xu ◽  
Qiang Yue ◽  
Yuxiang Wang

Approximate nearest neighbor search (ANNS) constitutes an important operation in a multitude of applications, including recommendation systems, information retrieval, and pattern recognition. In the past decade, graph-based ANNS algorithms have been the leading paradigm in this domain, with dozens of graph-based ANNS algorithms proposed. Such algorithms aim to provide effective, efficient solutions for retrieving the nearest neighbors for a given query. Nevertheless, these efforts focus on developing and optimizing algorithms with different approaches, so there is a real need for a comprehensive survey about the approaches' relative performance, strengths, and pitfalls. Thus here we provide a thorough comparative analysis and experimental evaluation of 13 representative graph-based ANNS algorithms via a new taxonomy and fine-grained pipeline. We compared each algorithm in a uniform test environment on eight real-world datasets and 12 synthetic datasets with varying sizes and characteristics. Our study yields novel discoveries, offerings several useful principles to improve algorithms, thus designing an optimized method that outperforms the state-of-the-art algorithms. This effort also helped us pinpoint algorithms' working portions, along with rule-of-thumb recommendations about promising research directions and suitable algorithms for practitioners in different fields.


2012 ◽  
Vol 11 (01) ◽  
pp. 1250007
Author(s):  
Ali Mirza Mahmood ◽  
Mrithyumjaya Rao Kuppa

Many traditional pruning methods assume that all the datasets are equally probable and equally important, so they apply equal pruning to all the datasets. However, in real-world classification problems, all the datasets are not equal and considering equal pruning rate during pruning tends to generate a decision tree with a large size and high misclassification rate. In this paper, we present a practical algorithm to deal with the data specific classification problem when there are datasets with different properties. Another key motivation of the data specific pruning in the paper is "trading accuracy and size". A new algorithm called Expert Knowledge Based Pruning (EKBP) is proposed to solve this dilemma. We proposed to integrate error rate, missing values and expert judgment as factors for determining data specific pruning for each dataset. We show by analysis and experiments that using this pruning, we can scale both accuracy and generalisation for the tree that is generated. Moreover, the method can be very effective for high dimensional datasets. We conduct an extensive experimental study on openly available 40 real world datasets from UCI repository. In all these experiments, the proposed approach shows considerably reduction of tree size having equal or better accuracy compared to several benchmark decision tree methods that are proposed in literature.


Knowledge extraction within a healthcare field is a very challenging task since we are having many problems such as noise and imbalanced datasets. They are obtained from clinical studies where uncertainty and variability are popular. Lately, a wide number of machine learning algorithms are considered and evaluated to check their validity of being used in the medical field. Usually, the classification algorithms are compared against medical experts who are specialized in certain disease diagnoses and provide an effective methodological evaluation of classifiers by applying performance metrics. The performance metrics contain four criteria: accuracy, sensitivity, and specificity forming the confusion matrix of each used algorithm. We have utilized eight different well-known machine learning algorithms to evaluate their performances in six different medical datasets. Based on the experimental results we conclude that the XGBoost and K-Nearest Neighbor classifiers were the best overall among the used datasets and signs can be used for diagnosing various diseases.


2021 ◽  
Vol 13 (8) ◽  
pp. 193
Author(s):  
Diego Lopez-Bernal ◽  
David Balderas ◽  
Pedro Ponce ◽  
Arturo Molina

One of the main focuses of Education 4.0 is to provide students with knowledge on disruptive technologies, such as Machine Learning (ML), as well as the skills to implement this knowledge to solve real-life problems. Therefore, both students and professors require teaching and learning tools that facilitate the introduction to such topics. Consequently, this study looks forward to contributing to the development of those tools by introducing the basic theory behind three machine learning classifying algorithms: K-Nearest-Neighbor (KNN), Linear Discriminant Analysis (LDA), and Simple Perceptron; as well as discussing the diverse advantages and disadvantages of each method. Moreover, it is proposed to analyze how these methods work on different conditions through their implementation over a test bench. Thus, in addition to the description of each algorithm, we discuss their application to solving three different binary classification problems using three different datasets, as well as comparing their performances in these specific case studies. The findings of this study can be used by teachers to provide students the basic knowledge of KNN, LDA, and perceptron algorithms, and, at the same time, it can be used as a guide to learn how to apply them to solve real-life problems that are not limited to the presented datasets.


Author(s):  
Anantvir Singh Romana

Accurate diagnostic detection of the disease in a patient is critical and may alter the subsequent treatment and increase the chances of survival rate. Machine learning techniques have been instrumental in disease detection and are currently being used in various classification problems due to their accurate prediction performance. Various techniques may provide different desired accuracies and it is therefore imperative to use the most suitable method which provides the best desired results. This research seeks to provide comparative analysis of Support Vector Machine, Naïve bayes, J48 Decision Tree and neural network classifiers breast cancer and diabetes datsets.


Sign in / Sign up

Export Citation Format

Share Document