Improvement of K-Means Algorithm for Accelerated Big Data Clustering

Author(s):  
Chunqiong Wu ◽  
Bingwen Yan ◽  
Rongrui Yu ◽  
Zhangshu Huang ◽  
Baoqin Yu ◽  
...  

With the rapid development of the computer level, especially in recent years, “Internet +,” cloud platforms, etc. have been used in various industries, and various types of data have grown in large quantities. Behind these large amounts of data often contain very rich information, relying on traditional data retrieval and analysis methods, and data management models can no longer meet our needs for data acquisition and management. Therefore, data mining technology has become one of the solutions to how to quickly obtain useful information in today's society. Effectively processing large-scale data clustering is one of the important research directions in data mining. The k-means algorithm is the simplest and most basic method in processing large-scale data clustering. The k-means algorithm has the advantages of simple operation, fast speed, and good scalability in processing large data, but it also often exposes fatal defects in data processing. In view of some defects exposed by the traditional k-means algorithm, this paper mainly improves and analyzes from two aspects.

2013 ◽  
Vol 441 ◽  
pp. 691-694
Author(s):  
Yi Qun Zeng ◽  
Jing Bin Wang

With the rapid development of information technology, data grows explosionly, how to deal with the large scale data become more and more important. Based on the characteristics of RDF data, we propose to compress RDF data. We construct an index structure called PAR-Tree Index, then base on the MapReduce parallel computing framework and the PAR-Tree Index to execute the query. Experimental results show that the algorithm can improve the efficiency of large data query.


2020 ◽  
Vol 7 (3) ◽  
pp. 230
Author(s):  
Saifullah Saifullah ◽  
Nani Hidayati

<p><em>Data Mining is a method that is often needed in large-scale data processing, so data mining has important access to the fields of life including industry, finance, weather, science and technology. In data mining techniques there are methods that can be used, namely classification, clustering, regression, variable selection, and market basket analysis. Illiteracy is one of the factors that hinder the quality of human resources. One of the basic things that must be fulfilled to improve the quality of human resources is the eradication of illiteracy among the community. The purpose of this study is to determine the clustering of illiterate communities based on provinces in Indonesia. The results of the study are illiterate data clustering according to the age proportion of 15-44 namely 1 high group node, low group has 27 nodes, and medium group 6 nodes. The results of this study become input for the government to determine illiteracy eradication policies in Indonesia based on provinces.</em></p><p><strong>Kata Kunci</strong>: <em>Illiterate</em><em>, Data mining, K-Means Clustering</em></p><p><em>Data Mining termasuk metode yang sering dibutuhkan dalam pengolahan data berskala besar, maka data mining mempunyai akses penting pada bidang kehidupan diantaranya yaitu bidang industri, bidang keuangan, cuaca, ilmu dan teknologi. Pada teknik data mining terdapat metode-metode yang dapat digunakan yaitu klasifikasi, clustering, regresi, seleksi variabel, dan market basket analisis. Buta huruf merupakan salah satu faktor yang menghambat kualitas sumber daya manusia. Salah satu hal mendasar yang harus dipenuhi untuk meningkatkan kualitas sumber daya manusia adalah pemberantasan buta huruf di kalangan masyarakat</em><em> </em><em>Adapun tujuan penelitian ini adalah menetukan clustering masyarakat buta huruf</em><em> berdasarkan propinsi di Indonesia</em><em>.</em><em> </em><em>Hasil dari penelitian adalah data clustering buta huruf menurut propisi umur 15-44 yaitu</em><em> 1 node</em><em> kelompok tinggi</em><em>,  kelompok rendah memiliki 27 node</em><em>, dan kelompok  sedang  6 node. Ha</em><em>sil penelitian ini menjadi bahan masukan kepada pemerintah untuk menentukan kebijakan</em><em> </em><em>pemberantasan buta huruf di Indonesia berdasarakn propinsi</em><em>.</em></p><p><strong>Kata Kunci</strong>: Buta Huruf, Data mining, <em>K-Means Clustering</em><em></em></p>


Author(s):  
Anisa Anisa ◽  
Mesran Mesran

Data mining is mining or discovery information to the process of looking for patterns or information that contains the search trends in a number of very large data in taking decisions on the future.In determining the patterns of classification techniques garnered record (Training set). The class attribute, which is a decision tree with method C 4.5 builds upon an algorithm of induction can be minimised.By utilizing data jobs graduates expected to generate information about interest & talent, work with benefit from graduate quisioner alumni. A pattern of work that sought from large-scale data and analyzed by various algorithms to compute the C 4.5 can do that work based on the pattern of investigation patterns that affect so that it found the rules are interconnected that can result from the results of the classification of objects of different classes or categories of attributes that influence to shape the patterns of work. The application used is software that used Tanagra data mining for academic and research purposes.That contains data mining method explored starting from the data analysis, and classification data mining.Keywords: analysis, Data Mining, method C 4.5, Tanagra, patterns of work


2014 ◽  
Vol 687-691 ◽  
pp. 1157-1160
Author(s):  
Xin Yu Zhen ◽  
Yong Xia

Hadoop, is becoming a necessary part of a large-scale data mining system. Therefore, this issue is exactly a kind of practice of data mining tasks on the hadoop distributed Systems. In this paper, the main task is to build a distributed cluster computation environment using hadoop and implement a data mining task in the environment. We select data clustering task as a representative, and select the K-means clustering algorithm to do in-depth research.


2021 ◽  
Author(s):  
Arina Prima Silalahi

Data mining is a process that combines statistics, artificial intelligence, mathematics and machine learning to extract data on a large scale in the database. Data mining is always able to analyze the data so as to find the relevance of data that has a meaning and have a tendency to check large-scale data stored in the database to find a meaningful pattern or rules. The increasing availability of data is often not utilized to provide new knowledge so that large data accumulate is meaningless. The purpose of this research is to extract the information so as to produce knowledge through the decision tree and show the accuracy or influence of Iterative Algorithm Dichotomiser 3 which is used to predict a situation. The classes or attributes in the Iterative Algorithm Dichotomiser are continuously broken into relative categories. Fuzzy Curve Shoulder will be used as a function to form the categories of each attribute value. Using a fuzzy shoulder curve, the dataset is processed using a decision tree that is useful for extracting large amounts of data and searching for hidden links between multiple potential input variables with a target variable. The results of this study are decision trees that will provide predictive data with Iterative Dichotomizer (ID) Algorithm 3.


2020 ◽  
Author(s):  
Isha Sood ◽  
Varsha Sharma

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed


Author(s):  
Krzysztof Jurczuk ◽  
Marcin Czajkowski ◽  
Marek Kretowski

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.


2018 ◽  
Vol 3 (1) ◽  
pp. 1-18
Author(s):  
Kislaya Kunjan ◽  
Huanmei Wu ◽  
Tammy R. Toscos ◽  
Bradley N. Doebbeling

Sign in / Sign up

Export Citation Format

Share Document