Research on parallel data processing of data mining platform in the background of cloud computing

Lingrui Bu; Hui Zhang; Haiyan Xing; Lijun Wu

doi:10.1515/jisys-2020-0113

Research on parallel data processing of data mining platform in the background of cloud computing

Journal of Intelligent Systems ◽

10.1515/jisys-2020-0113 ◽

2021 ◽

Vol 30 (1) ◽

pp. 479-486

Author(s):

Lingrui Bu ◽

Hui Zhang ◽

Haiyan Xing ◽

Lijun Wu

Keyword(s):

Data Mining ◽

Data Processing ◽

Parallel Algorithm ◽

Large Scale ◽

File System ◽

Large Data ◽

Distributed File System ◽

Data Set ◽

Traditional Algorithm ◽

Hadoop Distributed File System

Abstract The efficient processing of large-scale data has very important practical value. In this study, a data mining platform based on Hadoop distributed file system was designed, and then K-means algorithm was improved with the idea of max-min distance. On Hadoop distributed file system platform, the parallelization was realized by MapReduce. Finally, the data processing effect of the algorithm was analyzed with Iris data set. The results showed that the parallel algorithm divided more correct samples than the traditional algorithm; in the single-machine environment, the parallel algorithm ran longer; in the face of large data sets, the traditional algorithm had insufficient memory, but the parallel algorithm completed the calculation task; the acceleration ratio of the parallel algorithm was raised with the expansion of cluster size and data set size, showing a good parallel effect. The experimental results verifies the reliability of parallel algorithm in big data processing, which makes some contributions to further improve the efficiency of data mining.

Download Full-text

HDFS Security Approaches and Visualization Tracking

Journal of Engineering & Technological Advances ◽

10.35934/segi.v3i1.49 ◽

2018 ◽

Vol 3 (1) ◽

pp. 49-60

Author(s):

M. Elshayeb ◽

◽

Leelavathi Rajamanickam ◽

Keyword(s):

Big Data ◽

Data Processing ◽

Large Scale ◽

File System ◽

Leading Edge ◽

Distributed File System ◽

Complex Data ◽

Processing Technologies ◽

Big Data Visualization ◽

Hadoop Distributed File System

Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. In order to analyse complex data and to identify patterns it is very important to securely store, manage, and share large amounts of complex data. In recent years an increasing of database size according to the various forms (text, images and videos), in huge volumes and with high velocity, the services issues that use internet and desires big data come to leading edge (data-intensive services), (HDFS) Apache’s Hadoop distributed file system is in progress as outstanding software component for cloud computing joint with integrated pieces such as MapReduce. GoogleMapReduce implemented an open source which is Hadoop, having a distributed file system, present to software programmers the perception of the map and reduce. The research shows the security approaches for Big Data Hadoop distributed file system and the best security solution, also this research will help business by big data visualization which will help in better data analysis. In today’s data-centric world, big-data processing and analytics have become critical to most enterprise and government applications.

Download Full-text

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Business Intelligence ◽

10.4018/978-1-4666-9562-7.ch062 ◽

2016 ◽

pp. 1220-1243

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

Big Data Clustering and Hadoop Distributed File System Architecture

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8256 ◽

2019 ◽

Vol 16 (9) ◽

pp. 3824-3829

Author(s):

Deepak Ahlawat ◽

Deepali Gupta

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Early Stage ◽

Large Data ◽

Data File ◽

Distributed File System ◽

Hadoop Distributed File System ◽

Data Files ◽

Technological World

Due to advancement in the technological world, there is a great surge in data. The main sources of generating such a large amount of data are social websites, internet sites etc. The large data files are combined together to create a big data architecture. Managing the data file in such a large volume is not easy. Therefore, modern techniques are developed to manage bulk data. To arrange and utilize such big data, Hadoop Distributed File System (HDFS) architecture from Hadoop was presented in the early stage of 2015. This architecture is used when traditional methods are insufficient to manage the data. In this paper, a novel clustering algorithm is implemented to manage a large amount of data. The concepts and frames of Big Data are studied. A novel algorithm is developed using the K means and cosine-based similarity clustering in this paper. The developed clustering algorithm is evaluated using the precision and recall parameters. The prominent results are obtained which successfully manages the big data issue.

Download Full-text

Large-Scale Web Traffic Log Analyzer on Hadoop Distributed File System

Proceedings of The 3rd International Conference on Intelligent Systems and Image Processing 2015 ◽

10.12792/icisip2015.012 ◽

2015 ◽

Author(s):

Choopan Rattanapoka ◽

Prasertsak Tiawongsombat

Keyword(s):

Large Scale ◽

File System ◽

Distributed File System ◽

Web Traffic ◽

Hadoop Distributed File System

Download Full-text

3A1-T03 Large-scale database using the Hadoop Distributed File System and RT-Middleware(RT Middleware and Open Systems)

The Proceedings of JSME annual Conference on Robotics and Mechatronics (Robomec) ◽

10.1299/jsmermd.2014._3a1-t03_1 ◽

2014 ◽

Vol 2014 (0) ◽

pp. _3A1-T03_1-_3A1-T03_2

Author(s):

Isao HARA ◽

Seisho IRIE ◽

Mamoru SEKIYAMA ◽

Tamio TANIKAWA

Keyword(s):

Large Scale ◽

File System ◽

Open Systems ◽

Distributed File System ◽

Hadoop Distributed File System

Download Full-text

An Efficient Block Assignment Policy in Hadoop Distributed File System for Multimedia Data Processing

IEICE Transactions on Information and Systems ◽

10.1587/transinf.2019edl8016 ◽

2019 ◽

Vol E102.D (8) ◽

pp. 1569-1571

Author(s):

Cheolgi KIM ◽

Daechul LEE ◽

Jaehyun LEE ◽

Jaehwan LEE

Keyword(s):

Data Processing ◽

File System ◽

Multimedia Data ◽

Distributed File System ◽

Hadoop Distributed File System

Download Full-text

The File System Recommendations to Reduce the Space and Time Parameters in Hadoop File Storage and Map Reduce Processing of Big Data Applications

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j7579.0891020 ◽

2020 ◽

Vol 9 (10) ◽

pp. 353-356

Keyword(s):

Big Data ◽

Data Processing ◽

Data Storage ◽

File System ◽

Distributed File System ◽

Map Reduce ◽

Space And Time ◽

File Storage ◽

Hadoop Distributed File System ◽

Hadoop Framework

The study of Hadoop Distributed File System (HDFS) and Map Reduce (MR) are the key aspects of the Hadoop framework. The big data scenarios like Face Book (FB) data processing or the twitter analytics such as storing the tweets and processing the tweets is other scenario of big data which can depends on Hadoop framework to perform the storage and processing through which further analytics can be done. The point here is the usage of space and time in the processing of the above-mentioned huge amounts of the data definitely leads to higher amounts of space and time consumption of the Hadoop framework. The problem here is usage of huge amounts of the space and at the same time the processing time is also high which need to be reduced so as to get the fastest response from the framework. The attempt is important as all the other eco system tools also depends on HDFS and MR so as to perform the data storage and processing of the data and alternative architecture so as to improve the usage of the space and effective utilization of the resources so as to reduce the time requirements of the framework. The outcome of the work is faster data processing and less space utilization of the framework in the processing of MR along with other eco system tools like Hive, Flume, Sqoop and Pig Latin. The work is proposing an alternative framework of the HDFS and MR and the name we are assigning is Unified Space Allocation and Data Processing with Metadata based Distributed File System (USAMDFS).

Download Full-text

A comparative study of Distributed Large Scale Data Mining Algorithms

BSSS Journal of Computer ◽

10.51767/jc1102 ◽

2020 ◽

Author(s):

Isha Sood ◽

Varsha Sharma

Keyword(s):

Data Mining ◽

Large Scale ◽

Large Data ◽

Data Sets ◽

Data Set ◽

Data Mining Algorithms ◽

Large Scale Data ◽

Mapreduce Model ◽

Mining Algorithms ◽

Scale Data

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed

Download Full-text