Computational analysis of incremental clustering approaches for Large Data

International Journal of Computers and Communications ◽

10.46300/91013.2021.15.3 ◽

2021 ◽

Vol 15 ◽

pp. 14-18

Author(s):

Arun Pratap Singh Kushwah ◽

Shailesh Jaloree ◽

Ramjeevan Singh Thakur

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Computational Analysis ◽

Large Data ◽

Distance Functions ◽

Spatial Density ◽

Incremental Clustering ◽

Clustering Method ◽

Density Method ◽

Incremental Approach

Clustering is an approach of data mining, which helps us to find the underlying hidden structure in the dataset. K-means is a clustering method which usages distance functions to find the similarities or dissimilarities between the instances. DBSCAN is a clustering algorithm, which discovers the arbitrary shapes & sizes of clusters from huge volume of using spatial density method. These two approaches of clustering are the classical methods for efficient clustering but underperform when the data is updated frequently in the databases so, the incremental or gradual clustering approaches are always preferred in this environment. In this paper, an incremental approach for clustering is introduced using K-means and DBSCAN to handle the new datasets dynamically updated in the database in an interval.

Download Full-text

Teknik Data Mining Dalam Clustering Produksi Susu Segar Di Indonesia Dengan Algoritma K-Means

BRAHMANA: Jurnal Penerapan Kecerdasan Buatan ◽

10.30645/brahmana.v1i1.5 ◽

2019 ◽

Vol 1 (1) ◽

pp. 31-39

Author(s):

Ilham Safitra Damanik ◽

Sundari Retno Andani ◽

Dedi Sehendro

Keyword(s):

Data Mining ◽

Milk Production ◽

Clustering Algorithm ◽

Clustering Method ◽

Data Mining Techniques ◽

Low Level ◽

Fresh Milk ◽

Nutritional Needs ◽

High Level ◽

Level Cluster

Milk is an important intake to meet nutritional needs. Both consumed by children, and adults. Indonesia has many producers of fresh milk, but it is not sufficient for national milk needs. Data mining is a science in the field of computers that is widely used in research. one of the data mining techniques is Clustering. Clustering is a method by grouping data. The Clustering method will be more optimal if you use a lot of data. Data to be used are provincial data in Indonesia from 2000 to 2017 obtained from the Central Statistics Agency. The results of this study are in Clusters based on 2 milk-producing groups, namely high-dairy producers and low-milk producing regions. From 27 data on fresh milk production in Indonesia, two high-level provinces can be obtained, namely: West Java and East Java. And 25 others were added in 7 provinces which did not follow the calculation of the K-Means Clustering Algorithm, including in the low level cluster.

Download Full-text

AN EFFICIENT CLUSTERING METHOD FOR DBSCAN GEOGRAPHIC SPATIO-TEMPORAL LARGE DATA WITH IMPROVED PARAMETER OPTIMIZATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-3-w10-581-2020 ◽

2020 ◽

Vol XLII-3/W10 ◽

pp. 581-584

Author(s):

J. W. Li ◽

X. Q. Han ◽

J. W. Jiang ◽

Y. Hu ◽

L. Liu

Keyword(s):

Parameter Optimization ◽

Clustering Algorithm ◽

Optimal Solution ◽

Large Data ◽

Parameter Selection ◽

Physical Analysis ◽

Clustering Method ◽

K Value ◽

Dbscan Clustering ◽

Spatio Temporal

Abstract. How to establish an effective method of large data analysis of geographic space-time and quickly and accurately find the hidden value behind geographic information has become a current research focus. Researchers have found that clustering analysis methods in data mining field can well mine knowledge and information hidden in complex and massive spatio-temporal data, and density-based clustering is one of the most important clustering methods.However, the traditional DBSCAN clustering algorithm has some drawbacks which are difficult to overcome in parameter selection. For example, the two important parameters of Eps neighborhood and MinPts density need to be set artificially. If the clustering results are reasonable, the more suitable parameters can not be selected according to the guiding principles of parameter setting of traditional DBSCAN clustering algorithm. It can not produce accurate clustering results.To solve the problem of misclassification and density sparsity caused by unreasonable parameter selection in DBSCAN clustering algorithm. In this paper, a DBSCAN-based data efficient density clustering method with improved parameter optimization is proposed. Its evaluation index function (Optimal Distance) is obtained by cycling k-clustering in turn, and the optimal solution is selected. The optimal k-value in k-clustering is used to cluster samples. Through mathematical and physical analysis, we can determine the appropriate parameters of Eps and MinPts. Finally, we can get clustering results by DBSCAN clustering. Experiments show that this method can select parameters reasonably for DBSCAN clustering, which proves the superiority of the method described in this paper.

Download Full-text

Cloud4NFICA-Nearness Factor-Based Incremental Clustering Algorithm Using Microsoft Azure for the Analysis of Intelligent Meter Data

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2020040102 ◽

2020 ◽

Vol 10 (2) ◽

pp. 21-39

Author(s):

Archana Yashodip Chaudhari ◽

Preeti Mulay

Keyword(s):

Smart Grids ◽

Clustering Algorithm ◽

Incremental Clustering ◽

Clustering Method ◽

Data Types ◽

Dynamic Data ◽

Real Dataset ◽

Processing Power ◽

Microsoft Azure ◽

Socioeconomic Data

Intelligent electricity meters (IEMs) form a key infrastructure necessary for the growth of smart grids. IEMs generate a considerable amount of electricity data incrementally. However, on an influx of new data, traditional clustering task re-cluster all of the data from scratch. The incremental clustering method is an essential way to solve the problem of clustering with dynamic data. Given the volume of IEM data and the number of data types involved, an incremental clustering method is highly complex. Microsoft Azure provide the processing power necessary to handle incremental clustering analytics. The proposed Cloud4NFICA is a scalable platform of a nearness factor-based incremental clustering algorithm. This research uses the real dataset of Irish households collected by IEMs and related socioeconomic data. Cloud4NFICA is incremental in nature, hence accommodates the influx of new data. Cloud4NFICA was designed as an infrastructure as a service. It is visible from the study that the developed system performs well on the scalability aspect.

Download Full-text

Clustering Penerima Beasiswa Yayasan Untuk Mahasiswa Menggunakan Metode K-Means

JURNAL MEDIA INFORMATIKA BUDIDARMA ◽

10.30865/mib.v5i1.2670 ◽

2021 ◽

Vol 5 (1) ◽

pp. 258

Author(s):

Bernadus Gunawan Sudarsono ◽

Sri Poedji Lestari

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Clustering Method ◽

Scholarship Recipients

Grouping of scholarship recipients Scholarship assistance will be made based on the accumulated value using clustering where the scholarship recipients will be given scholarships with different amounts and sizes, because scholarships from foundations are limited and have levels of distribution. The division of groups to students who receive scholarships from foundations uses the clustering method of data mining where the function of clustering is a cluster or the task of grouping something is using the clustering algorithm approach, namely the K-means algorithm. The results of this clustering show that students based on their groups are divided into four groups based on the number of criteria, the results of the grouping show the number and decision of the foundation on granting foundation scholarships to students.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>

Download Full-text

An advanced ilrcpsd technique for bridging the competency and cognitive skills of students in higher education

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i1.3.8984 ◽

2017 ◽

Vol 7 (1.3) ◽

pp. 37

Author(s):

Joy Christy A.

Keyword(s):

Data Mining ◽

Cognitive Skills ◽

Clustering Algorithm ◽

Descriptive Analysis ◽

Clustering Algorithms ◽

Large Data ◽

Global Optimum ◽

Optimum Number ◽

Data Objects ◽

Alternate Solution

Data mining refers to the extraction of meaningful knowledge from large data sources as it may contain hidden potential facts. In general the analysis of data mining can either be predictive or descriptive. Predictive analysis of data mining interprets the inference of the existing results so as to identify the future outputs and the descriptive analysis of data mining interprets the intrinsic characteristics or nature of the data. Clustering is one of the descriptive analysis techniques of data mining which groups the objects of similar types in such a way that objects in a cluster are closer to each other than the objects of other clusters. K-means is the most popular and widely used clustering algorithm that starts by selecting the k-random initial centroids as equal to number of clusters given by the user. It then computes the distance between initial centroids with the remaining data objects and groups the data objects into the cluster centroids with minimum distance. This process is repeated until there is no change in the cluster centroids or cluster members. But, still k-means has been suffered from several issues such as optimum number of k, random initial centroids, unknown number of iterations, global optimum solutions of clusters and more importantly the creation of meaningful clusters when dealing with the analysis of datasets from various domains. The accuracy involved with clustering should never be compromised. Thus, in this paper, a novel classification via clustering algorithm called Iterative Linear Regression Clustering with Percentage Split Distribution (ILRCPSD) is introduced as an alternate solution to the problems encountered in traditional clustering algorithms. The proposed algorithm is examined over an educational dataset to identify the hidden group of students having similar cognitive and competency skills. The performance of the proposed algorithm is well-compared with the accuracy of the traditional k-means clustering in terms of building meaningful clusters and to prove its real time usefulness.

Download Full-text

DATA MINING DALAM PENGELOMPOKAN JENIS DAN JUMLAH PEMBAGIAN ZAKAT DENGAN MENGGUNAKAN METODE CLUSTERING K-MEANS (STUDI KASUS: BADAN AMIL ZAKAT KOTA BENGKULU)

JURNAL TEKNOLOGI INFORMASI ◽

10.36294/jurti.v1i2.298 ◽

2018 ◽

Vol 1 (2) ◽

pp. 211

Author(s):

Prahasti Prahasti

Keyword(s):

Data Mining ◽

Data Processing ◽

Clustering Algorithm ◽

Test Results ◽

Clustering Method ◽

Center Point

Abstrack - This research applies data mining by grouping the types and recipients of zakat. The application is done by the k-means clustering algorithm where the data to be entered is grouped by education and type of work in the distribution of zakat. Then a cluster is formed using the centroid value to determine the closest center point of distance between data. In the k-means clustering algorithm data processing is stopped in the iteration count of the data has not changed (fixed data) from the data that has been grouped. The test is done by using the RapidMiner software experiment conducted by the k-means clustering method which consists of input units, data processing units and output units, k-means clustering grouping data 1-2-1-1, 1-2-1-2 and 3-4-3-4. The results obtained from these tests are grouping the distribution of zakat with each cluster not the same. The test results are displayed in slatter graph. Keywords - Data Mining, K-Means Clusttering, Zakat

Download Full-text

Dengue Disease Detection using K- Means, Hierarchical, Kohonen- SOM Clustering

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j9066.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 904-907

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Research Work ◽

Data Mining Algorithm ◽

Clustering Method ◽

Dengue Disease ◽

Related Data ◽

Som Algorithm ◽

Som Clustering ◽

Kohonen Som

Data Mining is the process of extracting useful information. Data Mining is about finding new information from pre-existing databases. It is the procedure of mining facts from data and deals with the kind of patterns that can be mined. Therefore, this proposed work is to detect and categorize the illness of people who are affected by Dengue through Data Mining techniques mainly as the Clustering method. Clustering is the method of finding related groups of data in a dataset and used to split the related data into a group of sub-classes. So, in this research work clustering method is used to categorize the age group of people those who are affected by mosquito-borne viral infection using K-Means and Hierarchical Clustering algorithm and Kohonen-SOM algorithm has been implemented in Tanagra tool. The scientists use the data mining algorithm for preventing and defending different diseases like Dengue disease. This paper helps to apply the algorithm for clustering of Dengue fever in Tanagra tool to detect the best results from those algorithms.

Download Full-text

An effective and efficient hierarchical K-means clustering algorithm

International Journal of Distributed Sensor Networks ◽

10.1177/1550147717728627 ◽

2017 ◽

Vol 13 (8) ◽

pp. 155014771772862 ◽

Cited By ~ 8

Author(s):

Jianpeng Qi ◽

Yanwei Yu ◽

Lihong Wang ◽

Jinglei Liu ◽

Yingjie Wang

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Hierarchical Optimization ◽

Clustering Method ◽

Number Of Clusters ◽

Computation Cost ◽

Optimization Principle ◽

Pruning Strategy ◽

Efficiency And Effectiveness ◽

Synthetic Datasets

K-means plays an important role in different fields of data mining. However, k-means often becomes sensitive due to its random seeds selecting. Motivated by this, this article proposes an optimized k-means clustering method, named k*-means, along with three optimization principles. First, we propose a hierarchical optimization principle initialized by k* seeds ([Formula: see text]) to reduce the risk of random seeds selecting, and then use the proposed “top- n nearest clusters merging” to merge the nearest clusters in each round until the number of clusters reaches at [Formula: see text]. Second, we propose an “optimized update principle” that leverages moved points updating incrementally instead of recalculating mean and [Formula: see text] of cluster in k-means iteration to minimize computation cost. Third, we propose a strategy named “cluster pruning strategy” to improve efficiency of k-means. This strategy omits the farther clusters to shrink the adjustable space in each iteration. Experiments performed on real UCI and synthetic datasets verify the efficiency and effectiveness of our proposed algorithm.

Download Full-text

Incremental Clustering Algorithm for Earth Science Data Mining

Lecture Notes in Computer Science - Computational Science – ICCS 2009 ◽

10.1007/978-3-642-01973-9_42 ◽

2009 ◽

pp. 375-384 ◽

Cited By ~ 1

Author(s):

Ranga Raju Vatsavai

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Earth Science ◽

Incremental Clustering ◽

Science Data ◽

Earth Science Data

Download Full-text