KMEANS Algorithm Clustering for Massive AIS Data Based on the Spark Platform

Clustering is a technique in data mining thatgroups data sets into similar data clusters. One of thealgorithms that is commonly used for clustering is K-Means.However, the K-Means algorithm has several weaknesses, oneof them is the random factor in initial centroid selection, sothat cluster result is inconsistent even though it is tested withthe exact same data. The Modified K-Means algorithm focuseson selecting the initial centroid to overcome inconsistencies ofcluster results in the K-Means method. The test was conductedusing sentipol dataset and only focused on comment data.Furthermore, the specified number of clusters is 3 based on thenumber of existing comment labels (positive, negative, andneutral). According to testing result proves that Modified KMeans algorithm produces better purity value than K-Meansalgorithm. Modified K-Means algorithm produces average ofpurity value 0,42, while K-Means produces average of purityvalue 0,391. Meanwhile, from testing related to random factorsconducted 5 times with the same attributes and test data, theresults of the cluster on the Modified K-Means algorithm didnot change, so automatically the resulting purity value was alsothe same. Whereas in the K-Means algorithm, the clusterresults always change in each test, so the result of purity valueis also likely to change.

Download Full-text

A Method to Enhance the Remote Sensing Images Based on the Local Approach Using KMeans Algorithm

Advances in Information and Communication Technology - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-319-49073-1_7 ◽

2016 ◽

pp. 41-52

Author(s):

Trung Nguyen Tu ◽

Duc Dang Van ◽

Huy Ngo Hoang ◽

Thoa Vu Van

Keyword(s):

Remote Sensing ◽

Remote Sensing Images ◽

Local Approach ◽

Kmeans Algorithm

Download Full-text

Optimasi Proses Klasterisasi di MySQL DBMS dengan Mengintegrasikan Algoritme MIC-Kmeans Menggunakan Bahasa SQL dalam Stored Procedure

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020702639 ◽

2020 ◽

Vol 7 (2) ◽

pp. 391

Author(s):

Issa Arwani

Keyword(s):

Clustering Algorithms ◽

Optimal Solution ◽

Test Results ◽

Stored Procedure ◽

Average Value ◽

Silhouette Coefficient ◽

Algorithm Mapping ◽

Kmeans Algorithm ◽

Time Required ◽

Mapping Result

Proses klasterisasi data di DBMS akan lebih efisien jika dilakukan langsung di dalam DBMS itu sendiri karena DBMS mendukung untuk pengelolaan data yang baik. SQL-Kmeans merupakan salah satu metode yang sebelumnya telah digunakan untuk mengintegrasikan algoritme klasterisasi K-means ke dalam DBMS menggunakan SQL. Akan tetapi, metode ini juga membawa kelemahan dari algoritme K-means itu sendiri yaitu lamanya iterasi untuk mencapai konvergen dan keakuratan hasil klasterisasi yang belum optimal akibat dari proses inisialisasi centroid awal secara acak. Algoritme Median Initial Centroid (MIC)-Kmeans merupakan pengembangan dari algoritme K-means yang bisa memberikan solusi optimal dalam menentukan awal centroid yang berdampak pada keakuratan dan lamanya iterasi. Dengan keunggulan yang dimiliki algoritme MIC-Kmeans, maka dalam penelitian ini dipilih sebagai alternatif algoritme yang diintegrasikan dalam proses klasterisasi data secara langsung di DBMS menggunakan SQL. Proses integrasinya meliputi 4 tahap yaitu tahap inisialisasi tabel dataset, tahap pemetaan algoritme MIC-Kmeans pada SQL dan tabel dataset, tahap perancangan SQL untuk tiap hasil pemetaan dan tahap implementasi rancangan SQL dalam MySQL stored procedure. Hasil pengujian menunjukkan bahwa metode SQL MIC-Kmeans bisa mengurangi 43% jumlah iterasi dan mengurangi 39% waktu yang dibutuhkan dari metode SQL-Kmeans untuk mencapai konvergen. Selain itu, nilai rata-rata silhouette coefficient metode SQL MIC-Kmeans adalah 0,79 dan masuk dalam kategori strong structure (nilai rentang 0,7 sampai 1). Sedangkan nilai rata-rata silhouette coefficient metode SQL-Kmeans adalah 0,68 dan masuk dalam kategori medium structure (nilai rentang 0,5 sampai 0,7).AbstractThe process of data clustering in the DBMS will be more efficient because the DBMS supports good data management. SQL-Kmeans is a method that has been used to integrate K-means clustering algorithms into DBMS using SQL. However, it carries the weakness of the K-means algorithm itself in the duration of iterations to reach convergence and the accuracy of clustering due to the centroid initialization process randomly. Median Initial Centroid (MIC)-Kmeans algorithm is a development of the K-means algorithm that can provide the optimal solution in determining the initial centroid which has an impact on the accuracy and duration of iterations. With the advantages of the MIC-Kmeans algorithm, the method was chosen as an alternative algorithm to be integrated in the DBMS using SQL for a clustering. The integration process includes 4 stages, there are dataset initialization, SQL algorithm mapping and dataset table, SQL design for each mapping result, and implementation SQL in the MySQL stored procedure. The test results show that the SQL MIC-Kmeans method can reduce 43% the number of iterations and reduce 39% of the time required from the SQL-Kmeans method to reach convergence. In addition, the average value of the coefficient SQL MIC-Kmeans method is 0.79 and categorized as strong structure (value ranges from 0.7 to 1). While, the average value of the coefficient SQL-Kmeans method is 0.68 and categorized as medium structure (value ranges from 0.5 to 0.7).

Download Full-text