Large-scale data classification based on hierarchical clustering and re-sampling

2013 ◽  
Vol 33 (10) ◽  
pp. 2801-2803
Author(s):  
Yong ZHANG ◽  
Panpan FU ◽  
Yuting ZHANG
Author(s):  
Bing Xu

In the process of e-commerce transactions, a large amount of data will be generated, whose effective classification is one of current research hotspots. An improved feature selection method was proposed based on the characteristics of Bayesian classification algorithm. Due to the long training and testing time of modern large-scale data classification on a single computer, a data classification algorithm based on Naive Bayes was designed and implemented on the Hadoop distributed platform. The experimental results showed that the improved algorithm could effectively improve the accuracy of classification, and the designed parallel Bayesian data classification algorithm had high efficiency, which was suitable for the processing and analysis of massive data.


2020 ◽  
Vol 2020 ◽  
pp. 1-16
Author(s):  
Yang Liu ◽  
Xiang Li ◽  
Xianbang Chen ◽  
Xi Wang ◽  
Huaqiang Li

Currently, data classification is one of the most important ways to analysis data. However, along with the development of data collection, transmission, and storage technologies, the scale of the data has been sharply increased. Additionally, due to multiple classes and imbalanced data distribution in the dataset, the class imbalance issue is also gradually highlighted. The traditional machine learning algorithms lack of abilities for handling the aforementioned issues so that the classification efficiency and precision may be significantly impacted. Therefore, this paper presents an improved artificial neural network in enabling the high-performance classification for the imbalanced large volume data. Firstly, the Borderline-SMOTE (synthetic minority oversampling technique) algorithm is employed to balance the training dataset, which potentially aims at improving the training of the back propagation neural network (BPNN), and then, zero-mean, batch-normalization, and rectified linear unit (ReLU) are further employed to optimize the input layer and hidden layers of BPNN. At last, the ensemble learning-based parallelization of the improved BPNN is implemented using the Hadoop framework. Positive conclusions can be summarized according to the experimental results. Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. The improvements for the input layer and hidden layer also enhance the training performances in terms of convergence. The parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. The experimental results show the effectiveness of the presented classification algorithm.


2006 ◽  
Vol 10 (5) ◽  
pp. 604-616 ◽  
Author(s):  
G. Folino ◽  
C. Pizzuti ◽  
G. Spezzano

IEEE Access ◽  
2017 ◽  
pp. 1-1 ◽  
Author(s):  
Yongkweon Jeon ◽  
Jaeyoon Yoo ◽  
Jongsun Lee ◽  
Sungroh Yoon

2018 ◽  
Vol 23 (11) ◽  
pp. 3793-3801 ◽  
Author(s):  
Tinglong Tang ◽  
Shengyong Chen ◽  
Meng Zhao ◽  
Wei Huang ◽  
Jake Luo

Sign in / Sign up

Export Citation Format

Share Document