KMEANS Algorithm Clustering for Massive AIS Data Based on the Spark Platform

Author(s):  
Xiumin Chu ◽  
Jinyu Lei ◽  
Xinglong Liu ◽  
Zhiyuan Wang
Keyword(s):  
Author(s):  
Ruddy Cahyanto ◽  
Antonius Rachmat Chrismanto ◽  
Danny Sebastian

Clustering is a technique in data mining thatgroups data sets into similar data clusters. One of thealgorithms that is commonly used for clustering is K-Means.However, the K-Means algorithm has several weaknesses, oneof them is the random factor in initial centroid selection, sothat cluster result is inconsistent even though it is tested withthe exact same data. The Modified K-Means algorithm focuseson selecting the initial centroid to overcome inconsistencies ofcluster results in the K-Means method. The test was conductedusing sentipol dataset and only focused on comment data.Furthermore, the specified number of clusters is 3 based on thenumber of existing comment labels (positive, negative, andneutral). According to testing result proves that Modified KMeans algorithm produces better purity value than K-Meansalgorithm. Modified K-Means algorithm produces average ofpurity value 0,42, while K-Means produces average of purityvalue 0,391. Meanwhile, from testing related to random factorsconducted 5 times with the same attributes and test data, theresults of the cluster on the Modified K-Means algorithm didnot change, so automatically the resulting purity value was alsothe same. Whereas in the K-Means algorithm, the clusterresults always change in each test, so the result of purity valueis also likely to change.


2020 ◽  
Vol 7 (2) ◽  
pp. 391
Author(s):  
Issa Arwani

<p>Proses klasterisasi data di <em>DBMS</em> akan lebih efisien jika dilakukan langsung di dalam <em>DBMS</em> itu sendiri karena <em>DBMS</em> mendukung untuk pengelolaan data yang baik. <em>SQL-Kmeans</em> merupakan salah satu metode yang sebelumnya telah digunakan untuk mengintegrasikan algoritme klasterisasi <em>K-means</em> ke dalam <em>DBMS</em> menggunakan <em>SQL</em>. Akan tetapi, metode ini juga membawa kelemahan dari algoritme <em>K-means</em> itu sendiri yaitu lamanya iterasi untuk mencapai konvergen dan keakuratan hasil klasterisasi yang belum optimal akibat dari proses inisialisasi <em>centroid</em> awal secara acak. Algoritme <em>Median Initial Centroid (MIC)-Kmeans</em> merupakan pengembangan dari algoritme <em>K-means</em> yang bisa memberikan solusi optimal dalam menentukan awal <em>centroid</em> yang berdampak pada keakuratan dan lamanya iterasi. Dengan keunggulan yang dimiliki algoritme <em>MIC-Kmeans</em>, maka dalam penelitian ini dipilih sebagai alternatif algoritme yang diintegrasikan dalam proses klasterisasi data secara langsung di <em>DBMS</em> menggunakan <em>SQL</em>. Proses integrasinya meliputi 4 tahap yaitu tahap inisialisasi tabel <em>dataset</em>, tahap pemetaan algoritme <em>MIC-Kmeans</em> pada <em>SQL</em> dan tabel <em>dataset</em>, tahap perancangan <em>SQL </em>untuk tiap hasil pemetaan dan tahap implementasi rancangan <em>SQL</em> dalam <em>MySQL</em> <em>stored procedure</em>. Hasil pengujian menunjukkan bahwa metode <em>SQL MIC-Kmeans</em> bisa mengurangi 43% jumlah iterasi dan mengurangi 39% waktu yang dibutuhkan dari metode <em>SQL-Kmeans</em> untuk mencapai konvergen. Selain itu, nilai rata-rata <em>silhouette coefficient </em>metode <em>SQL MIC-Kmeans</em> adalah 0,79 dan masuk dalam kategori <em>strong structure</em> (nilai rentang 0,7 sampai 1). Sedangkan nilai rata-rata <em>silhouette coefficient </em>metode <em>SQL-Kmeans </em>adalah<em> </em>0,68<em> </em>dan masuk dalam kategori <em>medium structure </em>(nilai rentang 0,5 sampai 0,7).</p><p class="Judul2"><strong><em>Abstract</em></strong></p><p class="Judul2"><em>The process of data clustering in the DBMS will be more efficient because the DBMS supports good data management. SQL-Kmeans is a method that has been used to integrate K-means clustering algorithms into DBMS using SQL. However, it carries the weakness of the K-means algorithm itself in the duration of iterations to reach convergence and the accuracy of clustering due to the centroid initialization process randomly. Median Initial Centroid (MIC)-Kmeans algorithm is a development of the K-means algorithm that can provide the optimal solution in determining the initial centroid which has an impact on the accuracy and duration of iterations. With the advantages of the MIC-Kmeans algorithm, the method was chosen as an alternative algorithm to be integrated in the DBMS using SQL  for a clustering. The integration process includes 4 stages, there are dataset initialization, SQL algorithm mapping and dataset table, SQL design for each mapping result, and implementation SQL in the MySQL stored procedure. The test results show that the SQL MIC-Kmeans method can reduce 43% the number of iterations and reduce 39% of the time required from the SQL-Kmeans method to reach convergence. In addition, the average value of the coefficient SQL MIC-Kmeans method is 0.79 and categorized as strong structure (value ranges from 0.7 to 1). While, the average value of the coefficient SQL-Kmeans method is 0.68 and categorized as medium structure (value ranges from 0.5 to 0.7).</em></p>


Sign in / Sign up

Export Citation Format

Share Document