A Comparative Study of Nearest Neighbor Regression and Nadaraya Watson Regression

Two non-parametric statistical methods are studied in this work. These are the nearest neighbor regression and the Nadaraya Watson kernel smoothing technique. We have proven that under a precise circumstance, the nearest neighborhood estimator and the Nadaraya Watson smoothing produce a smoothed data with a same error level, which means they have the same performance. Another result of the paper is that nearest neighborhood estimator performs better locally, but it graphically shows a weakness point when a large data set is considered on a global scale.

Download Full-text

Scalable Non-Parametric Methods for Large Data Sets

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch260 ◽

2011 ◽

pp. 1708-1713

Author(s):

V. Suresh Babu ◽

P. Viswanath ◽

Narasimha M. Murty

Keyword(s):

Nearest Neighbor ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Parametric Methods ◽

Clustering Method ◽

Data Set ◽

Computational Burden ◽

Set Size ◽

Non Parametric

Non-parametric methods like the nearest neighbor classifier (NNC) and the Parzen-Window based density estimation (Duda, Hart & Stork, 2000) are more general than parametric methods because they do not make any assumptions regarding the probability distribution form. Further, they show good performance in practice with large data sets. These methods, either explicitly or implicitly estimates the probability density at a given point in a feature space by counting the number of points that fall in a small region around the given point. Popular classifiers which use this approach are the NNC and its variants like the k-nearest neighbor classifier (k-NNC) (Duda, Hart & Stock, 2000). Whereas the DBSCAN is a popular density based clustering method (Han & Kamber, 2001) which uses this approach. These methods show good performance, especially with larger data sets. Asymptotic error rate of NNC is less than twice the Bayes error (Cover & Hart, 1967) and DBSCAN can find arbitrary shaped clusters along with noisy outlier detection (Ester, Kriegel & Xu, 1996). The most prominent difficulty in applying the non-parametric methods for large data sets is its computational burden. The space and classification time complexities of NNC and k-NNC are O(n) where n is the training set size. The time complexity of DBSCAN is O(n2). So, these methods are not scalable for large data sets. Some of the remedies to reduce this burden are as follows. (1) Reduce the training set size by some editing techniques in order to eliminate some of the training patterns which are redundant in some sense (Dasarathy, 1991). For example, the condensed NNC (Hart, 1968) is of this type. (2) Use only a few selected prototypes from the data set. For example, Leaders-subleaders method and l-DBSCAN method are of this type (Vijaya, Murthy & Subramanian, 2004 and Viswanath & Rajwala, 2006). These two remedies can reduce the computational burden, but this can also result in a poor performance of the method. Using enriched prototypes can improve the performance as done in (Asharaf & Murthy, 2003) where the prototypes are derived using adaptive rough fuzzy set theory and as in (Suresh Babu & Viswanath, 2007) where the prototypes are used along with their relative weights. Using a few selected prototypes can reduce the computational burden. Prototypes can be derived by employing a clustering method like the leaders method (Spath, 1980), the k-means method (Jain, Dubes, & Chen, 1987), etc., which can find a partition of the data set where each block (cluster) of the partition is represented by a prototype called leader, centroid, etc. But these prototypes can not be used to estimate the probability density, since the density information present in the data set is lost while deriving the prototypes. The chapter proposes to use a modified leader clustering method called the counted-leader method which along with deriving the leaders preserves the crucial density information in the form of a count which can be used in estimating the densities. The chapter presents a fast and efficient nearest prototype based classifier called the counted k-nearest leader classifier (ck-NLC) which is on-par with the conventional k-NNC, but is considerably faster than the k-NNC. The chapter also presents a density based clustering method called l-DBSCAN which is shown to be a faster and scalable version of DBSCAN (Viswanath & Rajwala, 2006). Formally, under some assumptions, it is shown that the number of leaders is upper-bounded by a constant which is independent of the data set size and the distribution from which the data set is drawn.

Download Full-text

A comprehensive tool for the statistical comparison of Large Surveys to Models of the Galaxy

Proceedings of the International Astronomical Union ◽

10.1017/s1743921313006261 ◽

2013 ◽

Vol 9 (S298) ◽

pp. 92-97

Author(s):

Andreas Ritter

Keyword(s):

Radial Velocity ◽

Statistical Methods ◽

Large Data ◽

Data Set ◽

Statistical Comparison ◽

Stellar Streams ◽

The Galaxy ◽

Comprehensive Comparison ◽

Moving Groups ◽

First Time

AbstractThe advent of large spectroscopic surveys of the Galaxy offers the possibility to compare Galactic models to actual measurements for the first time. I have developed a tool for the comprehensive comparison of any large data set to the predictions made by models of the Galaxy using sophisticated statistical methods, and to visualise the results for any given direction. This enables us to point out systematic differences between the model and the measurements, as well as to identify new (sub-)structures in the Galaxy. These results can then be used to improve the models, which in turn will allow us to find even more substructures like stellar streams, moving groups, or clusters. In this paper I show the potential of this tool by applying it to the RAdial Velocity Experiment (RAVE, Steinmetz 2003) and the Besançon model of the Galaxy Robin et al. 2003.

Download Full-text

Feature Selection Method based on Fisher’s Exact Test for Agricultural Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d1104.1284s219 ◽

2019 ◽

Vol 8 (4S2) ◽

pp. 558-564

Keyword(s):

Feature Selection ◽

Statistical Methods ◽

Selection Process ◽

Large Data ◽

Chi Square ◽

Data Set ◽

Exact Test ◽

Fisher's Exact Test ◽

Large Data Set ◽

Fisher’S Exact Test

This paper is aimed to analyze the feature selection process based on different statistical methods viz., Correlation, Gain Ratio, Information gain, OneR, Chi-square MapReduce model, Fisher’s exact test for agricultural data. During the recent past, Fishers exact test was commonly used for feature selection process. However, it supports only for small data set. To handle large data set, the Chi square, one of the most popular statistical methods is used. But, it also finds irrelevant data and thus resultant accuracy is not as expected. As a novelty, Fisher’s exact test is combined with Map Reduce model to handle large data set. In addition, the simulation outcome proves that proposed fisher’s exact test finds the significant attributes with more accurate and reduced time complexity when compared to other existing methods.

Download Full-text

Clustering Large Data Set: An Applied Comparative Study

Advanced Statistical Methods for the Analysis of Large Data-Sets ◽

10.1007/978-3-642-21037-2_1 ◽

2011 ◽

pp. 3-12 ◽

Cited By ~ 1

Author(s):

Laura Bocci ◽

Isabella Mingo

Keyword(s):

Comparative Study ◽

Large Data ◽

Data Set ◽

Large Data Set

Download Full-text

The active leaning-based nearest neighbor mean distance novelty detection for large data set

2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA) ◽

10.1109/icccbda.2017.7951901 ◽

2017 ◽

Author(s):

Lili Yin ◽

Huangang Wang ◽

Wenhui Fan

Keyword(s):

Nearest Neighbor ◽

Novelty Detection ◽

Large Data ◽

Data Set ◽

Large Data Set

Download Full-text

Spectrum of Chemically Induced Mutations From a Large-Scale Reverse-Genetic Screen in Arabidopsis

Genetics ◽

10.1093/genetics/164.2.731 ◽

2003 ◽

Vol 164 (2) ◽

pp. 731-740 ◽

Cited By ~ 12

Author(s):

Elizabeth A Greene ◽

Christine A Codomo ◽

Nicholas E Taylor ◽

Jorja G Henikoff ◽

Bradley J Till ◽

...

Keyword(s):

Large Scale ◽

Target Genes ◽

Nearest Neighbor ◽

Large Data ◽

Chemical Mutagenesis ◽

Reverse Genetic ◽

Genetic Screens ◽

Induced Mutations ◽

Data Set ◽

Chemically Induced

AbstractChemical mutagenesis has been the workhorse of traditional genetics, but it has not been possible to determine underlying rates or distributions of mutations from phenotypic screens. However, reverse-genetic screens can be used to provide an unbiased ascertainment of mutation statistics. Here we report a comprehensive analysis of ∼1900 ethyl methanesulfonate (EMS)-induced mutations in 192 Arabidopsis thaliana target genes from a large-scale TILLING reverse-genetic project, about two orders of magnitude larger than previous such efforts. From this large data set, we are able to draw strong inferences about the occurrence and randomness of chemically induced mutations. We provide evidence that we have detected the large majority of mutations in the regions screened and confirm the robustness of the high-throughput TILLING method; therefore, any deviations from randomness can be attributed to selectional or mutational biases. Overall, we detect twice as many heterozygotes as homozygotes, as expected; however, for mutations that are predicted to truncate an encoded protein, we detect a ratio of 3.6:1, indicating selection against homozygous deleterious mutations. As expected for alkylation of guanine by EMS, >99% of mutations are G/C-to-A/T transitions. A nearest-neighbor bias around the mutated base pair suggests that mismatch repair counteracts alkylation damage.

Download Full-text

Finding Nutritional Deficiency and Disease Pattern of Rural People Using Fuzzy Logic and Big Data Techniques on Hadoop

Computer and Information Science ◽

10.5539/cis.v11n2p11 ◽

2018 ◽

Vol 11 (2) ◽

pp. 11 ◽

Cited By ~ 2

Author(s):

Sadia Yeasmin ◽

Muhammad Abrar Hussain ◽

Noor Yazdani Sikder ◽

Rashedur M Rahman

Keyword(s):

Fuzzy Logic ◽

Big Data ◽

Low Income ◽

Rural Areas ◽

Nearest Neighbor ◽

Large Data ◽

Food Habit ◽

K Nearest Neighbor ◽

Data Set ◽

Rural People

Over the decades there is a high demand of a tool to identify the nutritional needs of the people of Bangladesh since it has an alarming rate of under nutrition among the countries of the world. This analysis has focused on the dissimilarity of diseases caused by malnutrition in different districts of Bangladesh. Among the 64 districts, there is no single one found where people have grown proper nutritional food habit. Low income and less knowledge are the triggering factors and the case is worse in the rural areas. In this research, a distributed enumerating framework for large data set is processed in big data models. Fuzzy logic has the ability to model the nutrition problem, in the way helping people to calculate the suitability between food calories and user’s profile. A Map Reduce-based K-nearest neighbor (mrK-NN) classifier has been applied in this research in order to classify data. We have designed a balanced model applying fuzzy logic and big data analysis on Hadoop concerning food habit, food nutrition and disease, especially for the rural people.

Download Full-text

Classification of Musical Timbre Using Bayesian Networks

Computer Music Journal ◽

10.1162/comj_a_00210 ◽

2013 ◽

Vol 37 (4) ◽

pp. 70-86 ◽

Cited By ~ 6

Author(s):

Patrick J. Donnelly ◽

John W. Sheppard

Keyword(s):

Bayesian Networks ◽

Nearest Neighbor ◽

Time Windows ◽

Large Data ◽

Musical Instrument ◽

Spectral Amplitude ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Percent Accuracy

In this article, we explore the use of Bayesian networks for identifying the timbre of musical instruments. Peak spectral amplitude in ten frequency windows is extracted for each of 20 time windows to be used as features. Over a large data set of 24,000 audio examples covering the full musical range of 24 different common orchestral instruments, four different Bayesian network structures, including naive Bayes, are examined and compared with two support vector machines and a k-nearest neighbor classifier. Classification accuracy is examined by instrument, instrument family, and data set size. Bayesian networks with conditional dependencies in the time and frequency dimensions achieved 98 percent accuracy in the instrument classification task and 97 percent accuracy in the instrument family identification task. These results demonstrate a significant improvement over the previous approaches in the literature on this data set. Additionally, we tested our Bayesian approach on the widely used Iowa musical instrument data set, with similar results.

Download Full-text

The CRI-model: A domain-independent taxonomy for non-conformance between observed and specified behaviour

Computer Science and Information Systems ◽

10.2298/csis180126034h ◽

2018 ◽

Vol 15 (3) ◽

pp. 705-731

Author(s):

Christopher Haubeck ◽

Alexander Pokahr ◽

Kim Reichert ◽

Till Hohenberger ◽

Winfried Lamersdorf

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Statistical Methods ◽

Production Systems ◽

Large Data ◽

Data Set ◽

Cloud Systems ◽

Cyber Physical Production Systems ◽

Distributed Cloud ◽

Domain Independent

Anomaly detection is the process of identifying nonconforming behaviour. Many approaches from machine learning to statistical methods exist to detect behaviour that deviate from its norm. These non-conformances of specifications can stem from failures in the system or undocumented changes of the system during its evolution. However, no generic solutions exist for classifying and identifying these non-conformances. In this paper, we present the CRI-Model (Cause, Reaction, Impact), which is a taxonomy based on a study of anomaly types in the literature, an analysis of system outages in major cloud companies and evolution scenarios which describe and implement changes in Cyber-Physical Production Systems. The goal of the taxonomy is to be usable for different objectives like discover gaps in the detection process, determine components most often affected by a particular anomaly type or describe system evolution. While the dimensions of the taxonomy are fixed, the categories can be adapted to different domains. We show and validate the applicability of the taxonomy to distributed cloud systems using a large data set of anomaly reports and cyber-physical production systems by categorizing common changes of an evolution benchmarking plant.

Download Full-text

Some statistical and CI models to predict chaotic high-frequency financial data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189107 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6419-6430

Author(s):

Dusan Marcek

Keyword(s):

Time Series Data ◽

Moving Average ◽

Methodological Approach ◽

Back Propagation ◽

Large Data ◽

Series Data ◽

Data Set ◽

Training Time ◽

Optimal Population ◽

Forecast Time

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Download Full-text