Pengelompokan Komentar Dataset Sentipol dengan Modified K-Means Clustering

Clustering is a technique in data mining thatgroups data sets into similar data clusters. One of thealgorithms that is commonly used for clustering is K-Means.However, the K-Means algorithm has several weaknesses, oneof them is the random factor in initial centroid selection, sothat cluster result is inconsistent even though it is tested withthe exact same data. The Modified K-Means algorithm focuseson selecting the initial centroid to overcome inconsistencies ofcluster results in the K-Means method. The test was conductedusing sentipol dataset and only focused on comment data.Furthermore, the specified number of clusters is 3 based on thenumber of existing comment labels (positive, negative, andneutral). According to testing result proves that Modified KMeans algorithm produces better purity value than K-Meansalgorithm. Modified K-Means algorithm produces average ofpurity value 0,42, while K-Means produces average of purityvalue 0,391. Meanwhile, from testing related to random factorsconducted 5 times with the same attributes and test data, theresults of the cluster on the Modified K-Means algorithm didnot change, so automatically the resulting purity value was alsothe same. Whereas in the K-Means algorithm, the clusterresults always change in each test, so the result of purity valueis also likely to change.

Download Full-text

Data mining techniques

Acta Numerica ◽

10.1017/s0962492901000058 ◽

2001 ◽

Vol 10 ◽

pp. 313-355 ◽

Cited By ~ 16

Author(s):

Markus Hegland

Keyword(s):

Data Mining ◽

Efficient Algorithms ◽

Data Sets ◽

Mathematical Framework ◽

Similar Data ◽

Functional Relationships ◽

New Methods ◽

The Common ◽

Data Objects ◽

Data Collections

Methods for knowledge discovery in data bases (KDD) have been studied for more than a decade. New methods are required owing to the size and complexity of data collections in administration, business and science. They include procedures for data query and extraction, for data cleaning, data analysis, and methods of knowledge representation. The part of KDD dealing with the analysis of the data has been termed data mining. Common data mining tasks include the induction of association rules, the discovery of functional relationships (classification and regression) and the exploration of groups of similar data objects in clustering. This review provides a discussion of and pointers to efficient algorithms for the common data mining tasks in a mathematical framework. Because of the size and complexity of the data sets, efficient algorithms and often crude approximations play an important role.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>

Download Full-text

AK-means: an automatic clustering algorithm based on K-means

Journal of Advanced Computer Science & Technology ◽

10.14419/jacst.v4i2.4749 ◽

2015 ◽

Vol 4 (2) ◽

pp. 231 ◽

Cited By ~ 1

Author(s):

Omar Kettani ◽

Faical Ramdani ◽

Benaissa Tadili

Keyword(s):

Data Mining ◽

Fast Algorithm ◽

Clustering Algorithm ◽

Data Sets ◽

Number Of Clusters ◽

Correct Number ◽

Standard Data ◽

Exact Number ◽

Automatic Clustering ◽

Clustering Problems

<p>In data mining, K-means is a simple and fast algorithm for solving clustering problems, but it requires that the user provides in advance the exact number of clusters (k), which is often not obvious. Thus, this paper intends to overcome this problem by proposing a parameter-free algorithm for automatic clustering. It is based on successive adequate restarting of K-means algorithm. Experiments conducted on several standard data sets demonstrate that the proposed approach is effective and outperforms the related well known algorithm G-means, in terms of clustering accuracy and estimation of the correct number of clusters.</p>

Download Full-text

GPGC: Genetic programming for automatic clustering using a flexible non-hyper-spherical graph-based approach

10.26686/wgtn.13058750 ◽

2020 ◽

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Data Mining ◽

Genetic Programming ◽

Data Sets ◽

Number Of Clusters ◽

Automatic Clustering ◽

Graph Based Clustering

© 2017 ACM. Genetic programming (GP) has been shown to be very effective for performing data mining tasks. Despite this, it has seen relatively little use in clustering. In this work, we introduce a new GP approach for performing graph-based (GPGC) non-hyper-spherical clustering where the number of clusters is not required to be set in advance. The proposed GPGC approach is compared with a number of well known methods on a large number of data sets with a wide variety of shapes and sizes. Our results show that GPGC is the most generalisable of the tested methods, achieving good performance across all datasets. GPGC significantly outperforms all existing methods on the hardest ellipsoidal datasets, without needing the user to pre-define the number of clusters. To our knowledge, this is the first work which proposes using GP for graph-based clustering.

Download Full-text

GPGC: Genetic programming for automatic clustering using a flexible non-hyper-spherical graph-based approach

10.26686/wgtn.13058750.v1 ◽

2020 ◽

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Data Mining ◽

Genetic Programming ◽

Data Sets ◽

Number Of Clusters ◽

Automatic Clustering ◽

Graph Based Clustering

Download Full-text

Pemanfaatan Metode K-Means Dalam Penentuan Persediaan Barang

PIKSEL : Penelitian Ilmu Komputer Sistem Embedded and Logic ◽

10.33558/piksel.v6i1.1398 ◽

2018 ◽

Vol 6 (1) ◽

pp. 41-48

Author(s):

Santoso Setiawan

Keyword(s):

Data Mining ◽

Stock Management ◽

Transaction Data ◽

Business People ◽

Cluster Data ◽

Data Clusters ◽

Inventory Strategy ◽

Transactional Data

Abstract Inaccurate stock management will lead to high and uneconomical storage costs, as there may be a void or surplus of certain products. This will certainly be very dangerous for all business people. The K-Means method is one of the techniques that can be used to assist in designing an effective inventory strategy by utilizing the sales transaction data that is already available in the company. The K-Means algorithm will group the products sold into several large transactional data clusters, so it is expected to help entrepreneurs in designing stock inventory strategies. Keywords: inventory, k-means, product transaction data, rapidminer, data mining Abstrak Manajemen stok yang tidak akurat akan menyebabkan biaya penyimpanan yang tinggi dan tidak ekonomis, karena kemungkinan terjadinya kekosongan atau kelebihan produk tertentu. Hal ini sangat berbahaya bagi para pelaku bisnis. Metode K-Means adalah salah satu teknik yang dapat digunakan untuk membantu dalam merancang strategi persediaan yang efektif dengan memanfaatkan data transaksi penjualan yang telah tersedia di perusahaan. Algoritma K-Means akan mengelompokkan produk yang dijual ke beberapa cluster data transaksi yang umumnya besar, sehingga diharapkan dapat membantu pengusaha dalam merancang strategi persediaan stok. Kata kunci: data transaksi produk, k-means, persediaan, rapidminer, data mining.

Download Full-text

A Survey on Preparing Data Sets for Data Mining Analysis using Horizontal Aggregations in SQL

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i4/0199 ◽

2017 ◽

Vol 7 (5) ◽

pp. 172-176

Author(s):

Prashant B. Rajole ◽

Keyword(s):

Data Mining ◽

Data Sets ◽

Data Mining Analysis

Download Full-text

Foreign Language Optical Character Recognition, Phase II: Arabic and Persian Training and Test Data Sets.

10.21236/ada325444 ◽

1997 ◽

Author(s):

Robert B. Davidson ◽

Richard L. Hopely

Keyword(s):

Foreign Language ◽

Phase Ii ◽

Test Data ◽

Character Recognition ◽

Optical Character Recognition ◽

Data Sets ◽

Optical Character ◽

Recognition Phase

Download Full-text

PCA for heterogeneous data sets in a distributed data mining

Proceedings of the Fourth Annual ACM Bangalore Conference on - COMPUTE '11 ◽

10.1145/1980422.1980451 ◽

2011 ◽

Author(s):

E. Chandra ◽

P. Ajitha

Keyword(s):

Data Mining ◽

Heterogeneous Data ◽

Distributed Data Mining ◽

Data Sets ◽

Distributed Data

Download Full-text

Understanding Power System Behavior through Mining Archived Operational Data

International Journal of Emerging Electric Power Systems ◽

10.2202/1553-779x.2211 ◽

2009 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Sarasij Das ◽

Nagendra Rao P S

Keyword(s):

Data Mining ◽

Power System ◽

System Level ◽

Data Sets ◽

Total System ◽

System Behavior ◽

Average System ◽

Recorded Data ◽

Operational Data ◽

Southern Regional

This paper is the outcome of an attempt in mining recorded power system operational data in order to get new insight to practical power system behavior. Data mining, in general, is essentially finding new relations between data sets by analyzing well known or recorded data. In this effort we make use of the recorded data of the Southern regional grid of India. Some interesting relations at the total system level between frequency, total MW/MVAr generation, and average system voltage have been obtained. The aim of this work is to highlight the potential of data mining for power system applications and also some of the concerns that need to be addressed to make such efforts more useful.

Download Full-text