Issues of K Means Clustering While Migrating to Map Reduce Paradigm with Big Data: A Survey

Author(s):  
Khyati R Nirmal ◽  
K.V.V. Satyanarayana

<p><span>In recent times Big Data Analysis are imminent as essential area in the field of Computer Science. Taking out of significant information from Big Data by separating the data in to distinct group is crucial task and it is beyond the scope of commonly used personal machine. It is necessary to adopt the distributed environment similar to map reduce paradigm and migrate the data mining algorithm using it. In Data Mining the partition based K Means Clustering is one of the broadly used algorithms for grouping data according to the degree of similarities between data. It requires the number of K and initial centroid of cluster as input. By surveying the parameters preferred by algorithm or opted by user influence the functionality of Algorithm. It is the necessity to migrate the K means Clustering on MapReduce and predicts the value of k using machine learning approach. For selecting the initial cluster the efficient method is to be devised and united with it. This paper is comprised the survey of several methods for predicting the value of K in K means Clustering and also contains the survey of different methodologies to find out initial center of the cluster. Along with initial value of k and initial centroid selection the objective of proposed work is to compact with analysis of categorical data.</span></p>

Author(s):  
Khyati R Nirmal ◽  
K.V.V. Satyanarayana

<p><span>In recent times Big Data Analysis are imminent as essential area in the field of Computer Science. Taking out of significant information from Big Data by separating the data in to distinct group is crucial task and it is beyond the scope of commonly used personal machine. It is necessary to adopt the distributed environment similar to map reduce paradigm and migrate the data mining algorithm using it. In Data Mining the partition based K Means Clustering is one of the broadly used algorithms for grouping data according to the degree of similarities between data. It requires the number of K and initial centroid of cluster as input. By surveying the parameters preferred by algorithm or opted by user influence the functionality of Algorithm. It is the necessity to migrate the K means Clustering on MapReduce and predicts the value of k using machine learning approach. For selecting the initial cluster the efficient method is to be devised and united with it. This paper is comprised the survey of several methods for predicting the value of K in K means Clustering and also contains the survey of different methodologies to find out initial center of the cluster. Along with initial value of k and initial centroid selection the objective of proposed work is to compact with analysis of categorical data.</span></p>


Author(s):  
Shalin Eliabeth S. ◽  
Sarju S.

Big data privacy preservation is one of the most disturbed issues in current industry. Sometimes the data privacy problems never identified when input data is published on cloud environment. Data privacy preservation in hadoop deals in hiding and publishing input dataset to the distributed environment. In this paper investigate the problem of big data anonymization for privacy preservation from the perspectives of scalability and time factor etc. At present, many cloud applications with big data anonymization faces the same kind of problems. For recovering this kind of problems, here introduced a data anonymization algorithm called Two Phase Top-Down Specialization (TPTDS) algorithm that is implemented in hadoop. For the data anonymization-45,222 records of adults information with 15 attribute values was taken as the input big data. With the help of multidimensional anonymization in map reduce framework, here implemented proposed Two-Phase Top-Down Specialization anonymization algorithm in hadoop and it will increases the efficiency on the big data processing system. By conducting experiment in both one dimensional and multidimensional map reduce framework with Two Phase Top-Down Specialization algorithm on hadoop, the better result shown in multidimensional anonymization on input adult dataset. Data sets is generalized in a top-down manner and the better result was shown in multidimensional map reduce framework by the better IGPL values generated by the algorithm. The anonymization was performed with specialization operation on taxonomy tree. The experiment shows that the solutions improves the IGPL values, anonymity parameter and decreases the execution time of big data privacy preservation by compared to the existing algorithm. This experimental result will leads to great application to the distributed environment.


Author(s):  
Trupti Vishwambhar Kenekar ◽  
Ajay R. Dani

As Big Data is group of structured, unstructured and semi-structure data collected from various sources, it is important to mine and provide privacy to individual data. Differential Privacy is one the best measure which provides strong privacy guarantee. The chapter proposed differentially private frequent item set mining using map reduce requires less time for privately mining large dataset. The chapter discussed problem of preserving data privacy, different challenges to preserving data privacy in big data environment, Data privacy techniques and their applications to unstructured data. The analyses of experimental results on structured and unstructured data set are also presented.


2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Zhihao Zeng

Aiming at the problems of the multimedia computer-aided industrial system, this paper puts forward the application of big data mining algorithm to multimedia computer-aided industrial system design and analyzes in detail the impact of multimedia technology on industrial quality. This paper introduces the advantages of using big data mining algorithm in multimedia computer technology course, shows the operating environment to be met by using the multimedia computer-aided industrial system, follows the guiding principles of the overall design learning theory and artistic conception cognition theory, supplements specific industrial examples, and discusses multimedia industrial design.


An advanced Incremental processing technique is planned for data examination in knowledge to have the clustering results inform. Data is continuously arriving by different data generating factors like social network, online shopping, sensors, e-commerce etc. [1]. On account of this Big Data the consequences of data mining applications getting stale and neglected after some time. Cloud knowledge applications regularly perform iterative calculations (e.g., PageRank) on continuously converting datasets. Though going before trainings grow Map-Reduce aimed at productive iterative calculations, it's miles also pricey to carry out a whole new big-ruler Map-Reduce iterative task near well-timed quarter new adjustments to fundamental records sets. Our usage of MapReduce keeps running [4] scheduled a big cluster of product technologies and is incredibly walkable: an ordinary Map-Reduce computation procedure several terabytes of records arranged heaps of technologies. Processor operator locates the machine clean to apply: masses of MapReduce applications, we look at that during many instances, The differences result separate a totally little part of the data set, and the recently iteratively merged nation is very near the recently met state. I2MapReduce clustering adventures this commentary to keep re-calculated by way of beginning after the before affected national [2], and by using acting incremental up-dates on the converging information. The approach facilitates in enhancing the process successively period and decreases the jogging period of stimulating the consequences of big data.


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Kai Ji

Wireless personal communication network is easily affected by intrusion data in the communication process, resulting in the inability to ensure the security of personal information in wireless communication. Therefore, this paper proposes a malicious intrusion data mining algorithm based on legitimate big data in wireless personal communication networks. The clustering algorithm is used to iteratively obtain the central point of malicious intrusion data and determine its expected membership. The noise in malicious intrusion data is denoised by objective function, and the membership degree of communication data is calculated. The change factor of the neighborhood center of gravity of malicious intrusion data in wireless personal communication network is determined, the similarity between the characteristics of malicious intrusion data by using the Markov distance was determined, and the malicious intrusion data mining of wireless personal communication network supported by legal big data was completed. The experimental results show that the accuracy of mining malicious data is high and the mining time is short.


2014 ◽  
Vol 16 (6) ◽  
pp. 37-40 ◽  
Author(s):  
Shital Suryawanshi ◽  
◽  
Prof. V.S Wadne

2018 ◽  
Vol 7 (3.4) ◽  
pp. 13
Author(s):  
Gourav Bathla ◽  
Himanshu Aggarwal ◽  
Rinkle Rani

Data mining is one of the most researched fields in computer science. Several researches have been carried out to extract and analyse important information from raw data. Traditional data mining algorithms like classification, clustering and statistical analysis can process small scale of data with great efficiency and accuracy. Social networking interactions, business transactions and other communications result in Big data. It is large scale of data which is not in competency for traditional data mining techniques. It is observed that traditional data mining algorithms are not capable for storage and processing of large scale of data. If some algorithms are capable, then response time is very high. Big data have hidden information, if that is analysed in intelligent manner can be highly beneficial for business organizations. In this paper, we have analysed the advancement from traditional data mining algorithms to Big data mining algorithms. Applications of traditional data mining algorithms can be straight forward incorporated in Big data mining algorithm. Several studies have analysed traditional data mining with Big data mining, but very few have analysed most important algortihsm within one research work, which is the core motive of our paper. Readers can easily observe the difference between these algorthithms with  pros and cons. Mathemtics concepts are applied in data mining algorithms. Means and Euclidean distance calculation in Kmeans, Vectors application and margin in SVM and Bayes therorem, conditional probability in Naïve Bayes algorithm are real examples.  Classification and clustering are the most important applications of data mining. In this paper, Kmeans, SVM and Naïve Bayes algorithms are analysed in detail to observe the accuracy and response time both on concept and empirical perspective. Hadoop, Mapreduce etc. Big data technologies are used for implementing Big data mining algorithms. Performace evaluation metrics like speedup, scaleup and response time are used to compare traditional mining with Big data mining.  


Algorithms ◽  
2018 ◽  
Vol 11 (12) ◽  
pp. 194
Author(s):  
Yaron Gonen ◽  
Ehud Gudes ◽  
Kirill Kandalov

The Map-Reduce (MR) framework has become a popular framework for developing new parallel algorithms for Big Data. Efficient algorithms for data mining of big data and distributed databases has become an important problem. In this paper we focus on algorithms producing association rules and frequent itemsets. After reviewing the most recent algorithms that perform this task within the MR framework, we present two new algorithms: one algorithm for producing closed frequent itemsets, and the second one for producing frequent itemsets when the database is updated and new data is added to the old database. Both algorithms include novel optimizations which are suitable to the MR framework, as well as to other parallel architectures. A detailed experimental evaluation shows the effectiveness and advantages of the algorithms over existing methods when it comes to large distributed databases.


Sign in / Sign up

Export Citation Format

Share Document