Partition Selection for Large-Scale Data Management Using KNN Join Processing

For the data processing with increasing avalanche under large datasets, the k nearest neighbors (KNN) algorithm is a particularly expensive operation for both classification and regression predictive problems. To predict the values of new data points, it can calculate the feature similarity between each object in the test dataset and each object in the training dataset. However, due to expensive computational cost, the single computer is out of work to deal with large-scale dataset. In this paper, we propose an adaptive vKNN algorithm, which adopts on the Voronoi diagram under the MapReduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data. In the process of partition selection, we design a new predictive strategy for sample point to find the optimal relevant partition. Then, we can effectively collect irrelevant data, reduce KNN join computation, and improve the operation efficiency. Finally, we use a large number of 54-dimensional datasets to conduct a large number of experiments on the cluster. The experimental results show that our proposed method is effective and scalable with ensuring accuracy.

Download Full-text

Comparative Study on Kinds of Feature Subset Selection for Inconsistent Large-scale Data

International Journal of Advancements in Computing Technology ◽

10.4156/ijact.vol5.issue9.57 ◽

2013 ◽

Vol 5 (9) ◽

pp. 482-489

Author(s):

Dongsong Zheng ◽

Changsheng Zhang

Keyword(s):

Comparative Study ◽

Large Scale ◽

Subset Selection ◽

Feature Subset Selection ◽

Feature Subset ◽

Large Scale Data ◽

Selection For ◽

Scale Data

Download Full-text

PB-Spline Hybrid Surface Fitting Technique

Volume 1: 22nd Biennial Conference on Mechanical Vibration and Noise, Parts A and B ◽

10.1115/detc2009-87195 ◽

2009 ◽

Cited By ~ 1

Author(s):

Antonio Carminelli ◽

Giuseppe Catania

Keyword(s):

Large Scale ◽

Free Form ◽

Rectangular Array ◽

B Spline ◽

Large Scale Data ◽

Spline Surface ◽

Spline Basis ◽

Data Points ◽

Free Form Surfaces ◽

Scale Data

This work considers the fitting of data points organized in a rectangular array to parametric spline surfaces. Point Based (PB) splines, a generalization of tensor product splines, are adopted. The basic idea of this paper is to fit large scale data with a tensorial B-spline surface and to refine the surface until a specified tolerance is met. Since some isolated domains exceeding tolerance may result, detail features on these domains are modeled by a tensorial B-spline basis with a finer resolution, superimposed by employing the PB-spline approach. The present method leads to an efficient model of free form surfaces, since both large scale data and local geometrical details can be efficiently fitted. Two application examples are presented. The first one concerns the fitting of a set of data points sampled from an interior car trim with a central geometrical detail. The second one refers to the modification of the tensorial B-spline surface representation of a mould in order to create a local adjustment. Considerations regarding strengths and limits of the approach then follow.

Download Full-text

High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance

Scientific Programming ◽

10.1155/2020/1953461 ◽

2020 ◽

Vol 2020 ◽

pp. 1-16

Author(s):

Yang Liu ◽

Xiang Li ◽

Xianbang Chen ◽

Xi Wang ◽

Huaqiang Li

Keyword(s):

Neural Network ◽

Machine Learning ◽

High Performance ◽

Large Scale ◽

Class Imbalance ◽

Data Classification ◽

Training Dataset ◽

Large Scale Data ◽

Input Layer ◽

Scale Data

Currently, data classification is one of the most important ways to analysis data. However, along with the development of data collection, transmission, and storage technologies, the scale of the data has been sharply increased. Additionally, due to multiple classes and imbalanced data distribution in the dataset, the class imbalance issue is also gradually highlighted. The traditional machine learning algorithms lack of abilities for handling the aforementioned issues so that the classification efficiency and precision may be significantly impacted. Therefore, this paper presents an improved artificial neural network in enabling the high-performance classification for the imbalanced large volume data. Firstly, the Borderline-SMOTE (synthetic minority oversampling technique) algorithm is employed to balance the training dataset, which potentially aims at improving the training of the back propagation neural network (BPNN), and then, zero-mean, batch-normalization, and rectified linear unit (ReLU) are further employed to optimize the input layer and hidden layers of BPNN. At last, the ensemble learning-based parallelization of the improved BPNN is implemented using the Hadoop framework. Positive conclusions can be summarized according to the experimental results. Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. The improvements for the input layer and hidden layer also enhance the training performances in terms of convergence. The parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. The experimental results show the effectiveness of the presented classification algorithm.

Download Full-text

Greedy column subset selection for large-scale data sets

Knowledge and Information Systems ◽

10.1007/s10115-014-0801-8 ◽

2014 ◽

Vol 45 (1) ◽

pp. 1-34 ◽

Cited By ~ 19

Author(s):

Ahmed K. Farahat ◽

Ahmed Elgohary ◽

Ali Ghodsi ◽

Mohamed S. Kamel

Keyword(s):

Large Scale ◽

Subset Selection ◽

Data Sets ◽

Large Scale Data ◽

Column Subset Selection ◽

Selection For ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Effective and efficient feature selection for large-scale data using Bayes’ theorem

International Journal of Automation and Computing ◽

10.1007/s11633-009-0062-2 ◽

2009 ◽

Vol 6 (1) ◽

pp. 62-71 ◽

Cited By ~ 13

Author(s):

Subramanian Appavu Alias Balamurugan ◽

Ramasamy Rajaram

Keyword(s):

Feature Selection ◽

Large Scale ◽

Bayes Theorem ◽

Large Scale Data ◽

Selection For ◽

Scale Data

Download Full-text

Feature selection for large-scale data sets in GrC

2012 IEEE International Conference on Granular Computing ◽

10.1109/grc.2012.6468708 ◽

2012 ◽

Author(s):

Jiye Liang

Keyword(s):

Feature Selection ◽

Large Scale ◽

Data Sets ◽

Large Scale Data ◽

Selection For ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Feature Selection for Large Scale Data Using Class Association Rule Mining

Journal of Convergence Information Technology ◽

10.4156/jcit.vol6.issue11.42 ◽

2011 ◽

Vol 6 (11) ◽

pp. 371-377

Author(s):

J. Alamelu Mangai ◽

S. Sameen Fathima

Keyword(s):

Feature Selection ◽

Association Rule ◽

Association Rule Mining ◽

Large Scale ◽

Rule Mining ◽

Large Scale Data ◽

Selection For ◽

Class Association Rule ◽

Scale Data

Download Full-text

Comparative Study of Classification Techniques For Large Scale Data - Case Study

Kurdistan Journal of Applied Research ◽

10.24017/science.2017.3.2 ◽

2017 ◽

Vol 2 (3) ◽

pp. 56-61

Author(s):

Nigar M.Shafiq Surameery ◽

Dana Lattef Hussein

Keyword(s):

Structure Prediction ◽

Large Scale ◽

Attribute Selection ◽

Massive Datasets ◽

Classification Techniques ◽

Large Scale Data ◽

Large Scale Dataset ◽

Test Sets ◽

Scale Data

The existence of Massive datasets that are generated in many applications provides various opportunities and challenges. Especially, scalable mining of such large-scale datasets is a challenging issue that attracted some recent research. In the present study, the main focus is to analyse the classification techniques using WEKA machine learning workbench. Moreover, a large-scale dataset was used. This dataset comes from the protein structure prediction field. It has already been partitioned into training and test sets using the ten-fold cross-validation methodology. In this experiment, nine different methods have been tested. As a result, it became obvious that it is not applicable to test more than one classifier from the (tree) family in the same experiment. On the other hand, using (NaiveBayes) Classifier with the default properties of the attribute selection filter has a great time consuming. Finally, varying the parameters of the attribute selections should be prioritized for more accurate results.

Download Full-text

A Fast Multiscale Clustering Approach Based on DBSCAN

Wireless Communications and Mobile Computing ◽

10.1155/2021/4071177 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Runzi Chen ◽

Shuliang Zhao ◽

Meishe Liang

Keyword(s):

Large Scale ◽

Computational Cost ◽

Scaling Up ◽

Practical Significance ◽

Small Scale ◽

Large Scale Data ◽

Traditional Algorithm ◽

Multiscale Clustering ◽

Clustering Approach ◽

Scale Data

Multiscale brings great benefits for people to observe objects or problems from different perspectives. It has practical significance for clustering on multiscale data. At present, there is a lack of research on the clustering of large-scale data under the premise that clustering results of small-scale datasets have been obtained. If one does cluster on large-scale datasets by using traditional methods, two disadvantages are as follows: (1) Clustering results of small-scale datasets are not utilized. (2) Traditional method will cause more running overhead. Aims at these shortcomings, this paper proposes a multiscale clustering framework based on DBSCAN. This framework uses DBSCAN for clustering small-scale datasets, then introduces algorithm Scaling-Up Cluster Centers (SUCC) generating cluster centers of large-scale datasets by merging clustering results of small-scale datasets, not mining raw large-scale datasets. We show experimentally that, compared to traditional algorithm DBACAN and leading algorithms DBSCAN++ and HDBSCAN, SUCC can provide not only competitive performance but reduce computational cost. In addition, under the guidance of experts, the performance of SUCC is more competitive in accuracy.

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text