scholarly journals Computational analysis of incremental clustering approaches for Large Data

2021 ◽  
Vol 15 ◽  
pp. 14-18
Author(s):  
Arun Pratap Singh Kushwah ◽  
Shailesh Jaloree ◽  
Ramjeevan Singh Thakur

Clustering is an approach of data mining, which helps us to find the underlying hidden structure in the dataset. K-means is a clustering method which usages distance functions to find the similarities or dissimilarities between the instances. DBSCAN is a clustering algorithm, which discovers the arbitrary shapes & sizes of clusters from huge volume of using spatial density method. These two approaches of clustering are the classical methods for efficient clustering but underperform when the data is updated frequently in the databases so, the incremental or gradual clustering approaches are always preferred in this environment. In this paper, an incremental approach for clustering is introduced using K-means and DBSCAN to handle the new datasets dynamically updated in the database in an interval.

2019 ◽  
Vol 1 (1) ◽  
pp. 31-39
Author(s):  
Ilham Safitra Damanik ◽  
Sundari Retno Andani ◽  
Dedi Sehendro

Milk is an important intake to meet nutritional needs. Both consumed by children, and adults. Indonesia has many producers of fresh milk, but it is not sufficient for national milk needs. Data mining is a science in the field of computers that is widely used in research. one of the data mining techniques is Clustering. Clustering is a method by grouping data. The Clustering method will be more optimal if you use a lot of data. Data to be used are provincial data in Indonesia from 2000 to 2017 obtained from the Central Statistics Agency. The results of this study are in Clusters based on 2 milk-producing groups, namely high-dairy producers and low-milk producing regions. From 27 data on fresh milk production in Indonesia, two high-level provinces can be obtained, namely: West Java and East Java. And 25 others were added in 7 provinces which did not follow the calculation of the K-Means Clustering Algorithm, including in the low level cluster.


Author(s):  
J. W. Li ◽  
X. Q. Han ◽  
J. W. Jiang ◽  
Y. Hu ◽  
L. Liu

Abstract. How to establish an effective method of large data analysis of geographic space-time and quickly and accurately find the hidden value behind geographic information has become a current research focus. Researchers have found that clustering analysis methods in data mining field can well mine knowledge and information hidden in complex and massive spatio-temporal data, and density-based clustering is one of the most important clustering methods.However, the traditional DBSCAN clustering algorithm has some drawbacks which are difficult to overcome in parameter selection. For example, the two important parameters of Eps neighborhood and MinPts density need to be set artificially. If the clustering results are reasonable, the more suitable parameters can not be selected according to the guiding principles of parameter setting of traditional DBSCAN clustering algorithm. It can not produce accurate clustering results.To solve the problem of misclassification and density sparsity caused by unreasonable parameter selection in DBSCAN clustering algorithm. In this paper, a DBSCAN-based data efficient density clustering method with improved parameter optimization is proposed. Its evaluation index function (Optimal Distance) is obtained by cycling k-clustering in turn, and the optimal solution is selected. The optimal k-value in k-clustering is used to cluster samples. Through mathematical and physical analysis, we can determine the appropriate parameters of Eps and MinPts. Finally, we can get clustering results by DBSCAN clustering. Experiments show that this method can select parameters reasonably for DBSCAN clustering, which proves the superiority of the method described in this paper.


2020 ◽  
Vol 10 (2) ◽  
pp. 21-39
Author(s):  
Archana Yashodip Chaudhari ◽  
Preeti Mulay

Intelligent electricity meters (IEMs) form a key infrastructure necessary for the growth of smart grids. IEMs generate a considerable amount of electricity data incrementally. However, on an influx of new data, traditional clustering task re-cluster all of the data from scratch. The incremental clustering method is an essential way to solve the problem of clustering with dynamic data. Given the volume of IEM data and the number of data types involved, an incremental clustering method is highly complex. Microsoft Azure provide the processing power necessary to handle incremental clustering analytics. The proposed Cloud4NFICA is a scalable platform of a nearness factor-based incremental clustering algorithm. This research uses the real dataset of Irish households collected by IEMs and related socioeconomic data. Cloud4NFICA is incremental in nature, hence accommodates the influx of new data. Cloud4NFICA was designed as an infrastructure as a service. It is visible from the study that the developed system performs well on the scalability aspect.


2021 ◽  
Vol 5 (1) ◽  
pp. 258
Author(s):  
Bernadus Gunawan Sudarsono ◽  
Sri Poedji Lestari

Grouping of scholarship recipients Scholarship assistance will be made based on the accumulated value using clustering where the scholarship recipients will be given scholarships with different amounts and sizes, because scholarships from foundations are limited and have levels of distribution. The division of groups to students who receive scholarships from foundations uses the clustering method of data mining where the function of clustering is a cluster or the task of grouping something is using the clustering algorithm approach, namely the K-means algorithm. The results of this clustering show that students based on their groups are divided into four groups based on the number of criteria, the results of the grouping show the number and decision of the foundation on granting foundation scholarships to students.


Author(s):  
Md. Zakir Hossain ◽  
Md.Nasim Akhtar ◽  
R.B. Ahmad ◽  
Mostafijur Rahman

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets.  The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>


2017 ◽  
Vol 7 (1.3) ◽  
pp. 37
Author(s):  
Joy Christy A.

Data mining refers to the extraction of meaningful knowledge from large data sources as it may contain hidden potential facts. In general the analysis of data mining can either be predictive or descriptive. Predictive analysis of data mining interprets the inference of the existing results so as to identify the future outputs and the descriptive analysis of data mining interprets the intrinsic characteristics or nature of the data. Clustering is one of the descriptive analysis techniques of data mining which groups the objects of similar types in such a way that objects in a cluster are closer to each other than the objects of other clusters.  K-means is the most popular and widely used clustering algorithm that starts by selecting the k-random initial centroids as equal to number of clusters given by the user. It then computes the distance between initial centroids with the remaining data objects and groups the data objects into the cluster centroids with minimum distance. This process is repeated until there is no change in the cluster centroids or cluster members. But, still k-means has been suffered from several issues such as optimum number of k, random initial centroids, unknown number of iterations, global optimum solutions of clusters and more importantly the creation of meaningful clusters when dealing with the analysis of datasets from various domains. The accuracy involved with clustering should never be compromised. Thus, in this paper, a novel classification via clustering algorithm called Iterative Linear Regression Clustering with Percentage Split Distribution (ILRCPSD) is introduced as an alternate solution to the problems encountered in traditional clustering algorithms. The proposed algorithm is examined over an educational dataset to identify the hidden group of students having similar cognitive and competency skills.  The performance of the proposed algorithm is well-compared with the accuracy of the traditional k-means clustering in terms of building meaningful clusters and to prove its real time usefulness.


2018 ◽  
Vol 1 (2) ◽  
pp. 211
Author(s):  
Prahasti Prahasti

Abstrack - This research applies data mining by grouping the types and recipients of zakat. The application is done by the k-means clustering algorithm where the data to be entered is grouped by education and type of work in the distribution of zakat. Then a cluster is formed using the centroid value to determine the closest center point of distance between data. In the k-means clustering algorithm data processing is stopped in the iteration count of the data has not changed (fixed data) from the data that has been grouped. The test is done by using the RapidMiner software experiment conducted by the k-means clustering method which consists of input units, data processing units and output units, k-means clustering grouping data 1-2-1-1, 1-2-1-2 and 3-4-3-4. The results obtained from these tests are grouping the distribution of zakat with each cluster not the same. The test results are displayed in slatter graph.  Keywords - Data Mining, K-Means Clusttering, Zakat


Data Mining is the process of extracting useful information. Data Mining is about finding new information from pre-existing databases. It is the procedure of mining facts from data and deals with the kind of patterns that can be mined. Therefore, this proposed work is to detect and categorize the illness of people who are affected by Dengue through Data Mining techniques mainly as the Clustering method. Clustering is the method of finding related groups of data in a dataset and used to split the related data into a group of sub-classes. So, in this research work clustering method is used to categorize the age group of people those who are affected by mosquito-borne viral infection using K-Means and Hierarchical Clustering algorithm and Kohonen-SOM algorithm has been implemented in Tanagra tool. The scientists use the data mining algorithm for preventing and defending different diseases like Dengue disease. This paper helps to apply the algorithm for clustering of Dengue fever in Tanagra tool to detect the best results from those algorithms.


2017 ◽  
Vol 13 (8) ◽  
pp. 155014771772862 ◽  
Author(s):  
Jianpeng Qi ◽  
Yanwei Yu ◽  
Lihong Wang ◽  
Jinglei Liu ◽  
Yingjie Wang

K-means plays an important role in different fields of data mining. However, k-means often becomes sensitive due to its random seeds selecting. Motivated by this, this article proposes an optimized k-means clustering method, named k*-means, along with three optimization principles. First, we propose a hierarchical optimization principle initialized by k* seeds ([Formula: see text]) to reduce the risk of random seeds selecting, and then use the proposed “top- n nearest clusters merging” to merge the nearest clusters in each round until the number of clusters reaches at [Formula: see text]. Second, we propose an “optimized update principle” that leverages moved points updating incrementally instead of recalculating mean and [Formula: see text] of cluster in k-means iteration to minimize computation cost. Third, we propose a strategy named “cluster pruning strategy” to improve efficiency of k-means. This strategy omits the farther clusters to shrink the adjustable space in each iteration. Experiments performed on real UCI and synthetic datasets verify the efficiency and effectiveness of our proposed algorithm.


Sign in / Sign up

Export Citation Format

Share Document