scholarly journals Partition Selection for Large-Scale Data Management Using KNN Join Processing

2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Yue Hu ◽  
Ge Peng ◽  
Zehua Wang ◽  
Yanrong Cui ◽  
Hang Qin

For the data processing with increasing avalanche under large datasets, the k nearest neighbors (KNN) algorithm is a particularly expensive operation for both classification and regression predictive problems. To predict the values of new data points, it can calculate the feature similarity between each object in the test dataset and each object in the training dataset. However, due to expensive computational cost, the single computer is out of work to deal with large-scale dataset. In this paper, we propose an adaptive vKNN algorithm, which adopts on the Voronoi diagram under the MapReduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data. In the process of partition selection, we design a new predictive strategy for sample point to find the optimal relevant partition. Then, we can effectively collect irrelevant data, reduce KNN join computation, and improve the operation efficiency. Finally, we use a large number of 54-dimensional datasets to conduct a large number of experiments on the cluster. The experimental results show that our proposed method is effective and scalable with ensuring accuracy.

Author(s):  
Antonio Carminelli ◽  
Giuseppe Catania

This work considers the fitting of data points organized in a rectangular array to parametric spline surfaces. Point Based (PB) splines, a generalization of tensor product splines, are adopted. The basic idea of this paper is to fit large scale data with a tensorial B-spline surface and to refine the surface until a specified tolerance is met. Since some isolated domains exceeding tolerance may result, detail features on these domains are modeled by a tensorial B-spline basis with a finer resolution, superimposed by employing the PB-spline approach. The present method leads to an efficient model of free form surfaces, since both large scale data and local geometrical details can be efficiently fitted. Two application examples are presented. The first one concerns the fitting of a set of data points sampled from an interior car trim with a central geometrical detail. The second one refers to the modification of the tensorial B-spline surface representation of a mould in order to create a local adjustment. Considerations regarding strengths and limits of the approach then follow.


2020 ◽  
Vol 2020 ◽  
pp. 1-16
Author(s):  
Yang Liu ◽  
Xiang Li ◽  
Xianbang Chen ◽  
Xi Wang ◽  
Huaqiang Li

Currently, data classification is one of the most important ways to analysis data. However, along with the development of data collection, transmission, and storage technologies, the scale of the data has been sharply increased. Additionally, due to multiple classes and imbalanced data distribution in the dataset, the class imbalance issue is also gradually highlighted. The traditional machine learning algorithms lack of abilities for handling the aforementioned issues so that the classification efficiency and precision may be significantly impacted. Therefore, this paper presents an improved artificial neural network in enabling the high-performance classification for the imbalanced large volume data. Firstly, the Borderline-SMOTE (synthetic minority oversampling technique) algorithm is employed to balance the training dataset, which potentially aims at improving the training of the back propagation neural network (BPNN), and then, zero-mean, batch-normalization, and rectified linear unit (ReLU) are further employed to optimize the input layer and hidden layers of BPNN. At last, the ensemble learning-based parallelization of the improved BPNN is implemented using the Hadoop framework. Positive conclusions can be summarized according to the experimental results. Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. The improvements for the input layer and hidden layer also enhance the training performances in terms of convergence. The parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. The experimental results show the effectiveness of the presented classification algorithm.


2014 ◽  
Vol 45 (1) ◽  
pp. 1-34 ◽  
Author(s):  
Ahmed K. Farahat ◽  
Ahmed Elgohary ◽  
Ali Ghodsi ◽  
Mohamed S. Kamel

2017 ◽  
Vol 2 (3) ◽  
pp. 56-61
Author(s):  
Nigar M.Shafiq Surameery ◽  
Dana Lattef Hussein

The existence of Massive datasets that are generated in many applications provides various opportunities and challenges. Especially, scalable mining of such large-scale datasets is a challenging issue that attracted some recent research. In the present study, the main focus is to analyse the classification techniques using WEKA machine learning workbench. Moreover, a large-scale dataset was used. This dataset comes from the protein structure prediction field. It has already been partitioned into training and test sets using the ten-fold cross-validation methodology. In this experiment, nine different methods have been tested. As a result, it became obvious that it is not applicable to test more than one classifier from the (tree) family in the same experiment. On the other hand, using (NaiveBayes) Classifier with the default properties of the attribute selection filter has a great time consuming. Finally, varying the parameters of the attribute selections should be prioritized for more accurate results.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Runzi Chen ◽  
Shuliang Zhao ◽  
Meishe Liang

Multiscale brings great benefits for people to observe objects or problems from different perspectives. It has practical significance for clustering on multiscale data. At present, there is a lack of research on the clustering of large-scale data under the premise that clustering results of small-scale datasets have been obtained. If one does cluster on large-scale datasets by using traditional methods, two disadvantages are as follows: (1) Clustering results of small-scale datasets are not utilized. (2) Traditional method will cause more running overhead. Aims at these shortcomings, this paper proposes a multiscale clustering framework based on DBSCAN. This framework uses DBSCAN for clustering small-scale datasets, then introduces algorithm Scaling-Up Cluster Centers (SUCC) generating cluster centers of large-scale datasets by merging clustering results of small-scale datasets, not mining raw large-scale datasets. We show experimentally that, compared to traditional algorithm DBACAN and leading algorithms DBSCAN++ and HDBSCAN, SUCC can provide not only competitive performance but reduce computational cost. In addition, under the guidance of experts, the performance of SUCC is more competitive in accuracy.


Sign in / Sign up

Export Citation Format

Share Document