cluster quality
Recently Published Documents


TOTAL DOCUMENTS

78
(FIVE YEARS 28)

H-INDEX

9
(FIVE YEARS 1)

2022 ◽  
Vol 12 (1) ◽  
pp. 0-0

The centroid-based clustering algorithm depends on the number of clusters, initial centroid, distance measures, and statistical approach of central tendencies. The initial centroid initialization algorithm defines convergence speed, computing efficiency, execution time, scalability, memory utilization, and performance issues for big data clustering. Nowadays various researchers have proposed the cluster initialization techniques, where some initialization techniques reduce the number of iterations with the lowest cluster quality, and some initialization techniques increase the cluster quality with high iterations. For these reasons, this study proposed the initial centroid initialization based Maxmin Data Range Heuristic (MDRH) method for K-Means (KM) clustering that reduces the execution times, iterations, and improves quality for big data clustering. The proposed MDRH method has compared against the classical KM and KM++ algorithms with four real datasets. The MDRH method has achieved better effectiveness and efficiency over RS, DB, CH, SC, IS, and CT quantitative measurements.


Author(s):  
Gary Reyes ◽  
Laura Lanzarini ◽  
Waldo Hasperué ◽  
Aurelio F. Bariviera

Given the large volume of georeferenced information generated and stored by many types of devices, the study and improvement of techniques capable of operating with these data is an area of great interest. The analysis of vehicular trajectories with the aim of forming clusters and identifying emerging patterns is very useful for characterizing and analyzing transportation flows in cities. This paper presents a new trajectory clustering method capable of identifying clusters of vehicular sub-trajectories in various sectors of a city. The proposed method is based on the use of an auxiliary structure to determine the correct location of the centroid of each group or set of sub-trajectories along the adaptive process. The proposed method was applied on three real databases, as well as being compared with other relevant methods, achieving satisfactory results and showing good cluster quality according to the Silhouette index.


2021 ◽  
Vol 4 (2) ◽  
pp. 174-183
Author(s):  
Hadian Mandala Putra ◽  
◽  
Taufik Akbar ◽  
Ahwan Ahmadi ◽  
Muhammad Iman Darmawan ◽  
...  

Big Data is a collection of data with a large and complex size, consisting of various data types and obtained from various sources, overgrowing quickly. Some of the problems that will arise when processing big data, among others, are related to the storage and access of big data, which consists of various types of data with high complexity that are not able to be handled by the relational model. One technology that can solve the problem of storing and accessing big data is Hadoop. Hadoop is a technology that can store and process big data by distributing big data into several data partitions (data blocks). Problems arise when an analysis process requires all data spread out into one data entity, for example, in the data clustering process. One alternative solution is to do a parallel and scattered analysis, then perform a centralized analysis of the results of the scattered analysis. This study examines and analyzes two methods, namely K-Medoids Mapreduce and K-Modes without Mapreduce. The dataset used is a dataset about cars consisting of 3.5 million rows of data with 400MB distributed in a Hadoop Cluster (consisting of more than one engine). Hadoop has a MapReduce feature, consisting of 2 functions, namely map and reduce. The map function performs a selection to retrieve a key, value pairs, and returns a value in the form of a collection of key value pairs, and then the reduce function combines all key value pairs from several map functions. The results of the cluster quality evaluation are tested using the Silhouette Coefficient testing metric. The K-Medoids MapReduce algorithm for the car dataset gives a silhouette value of 0.99 with a total of 2 clusters.


2021 ◽  
pp. 008117502110142
Author(s):  
Matthias Studer

In this article, the author proposes a methodology for the validation of sequence analysis typologies on the basis of parametric bootstraps following the framework proposed by Hennig and Lin (2015). The method works by comparing the cluster quality of an observed typology with the quality obtained by clustering similar but nonclustered data. The author proposes several models to test the different structuring aspects of the sequences important in life-course research, namely, sequencing, timing, and duration. This strategy allows identifying the key structural aspects captured by the observed typology. The usefulness of the proposed methodology is illustrated through an analysis of professional and coresidence trajectories in Switzerland. The proposed methodology is available in the WeightedCluster R library.


2021 ◽  
Author(s):  
Shakira Banu Kaleel

Social media data carries abundant hidden occurrences of real-time events in the world which raises the demand for efficient event detection and trending system. The Locality Sensitive Hashing (LSH) technique is capable of processing the large-scale big datasets. In this thesis, a novel framework is proposed for detecting and trending events from tweet clusters presence in Twitter1 dataset that are discovered using LSH. The experimental results obtained from this research work showed that the LSH technique took only 12.99% of the running time compared to that required for K-means to find all of the tweet clusters. Key challenges include: 1) construction of dictionary using incremental TF-IDF in high-dimensional data in order to create tweet feature vector 2) leveraging LSH to find truly interesting events 3) trending the behavior of event based on time, geo-locations and cluster size and 4) speed-up the cluster-discovery process while retaining the cluster quality.


2021 ◽  
Author(s):  
Shakira Banu Kaleel

Social media data carries abundant hidden occurrences of real-time events in the world which raises the demand for efficient event detection and trending system. The Locality Sensitive Hashing (LSH) technique is capable of processing the large-scale big datasets. In this thesis, a novel framework is proposed for detecting and trending events from tweet clusters presence in Twitter1 dataset that are discovered using LSH. The experimental results obtained from this research work showed that the LSH technique took only 12.99% of the running time compared to that required for K-means to find all of the tweet clusters. Key challenges include: 1) construction of dictionary using incremental TF-IDF in high-dimensional data in order to create tweet feature vector 2) leveraging LSH to find truly interesting events 3) trending the behavior of event based on time, geo-locations and cluster size and 4) speed-up the cluster-discovery process while retaining the cluster quality.


2021 ◽  
Vol 6 (2) ◽  
pp. 70-77
Author(s):  
Fatimah Defina Setiti Alhamdani ◽  
Ananda Ayu Dianti ◽  
Yufis Azhar

Credit card is one of the payment media owned by banks in conducting transactions. Credit card issuers provide benefits for banks with interest that must be paid. Credit card issuers also provide losses to banks that have agreed to pay not to pay their credit card bills. To request a loan from the bank, a cluster model is needed. This study, proposing a segmentation system in research using credit cards to determine marketing strategies using the K-Means Clustering method and conducting experiments using the 4 methods namely K-Means, Agglomerative Clustering, GMM, and DBSCAN. Clustering is done using 9000 active credit card user data at banks that have 18 characteristic features. The results of cluster quality accuracy obtained by using the K-Means method are 0.207014 with the number of clusters 3. Based on the results obtained by considering 4 of these methods, the best method for this case is K-Means.


Author(s):  
Ch. Raja Ramesh, Et. al.

A group of different data objects is classified as similar objects is known as clusters. It is the process of finding homogeneous data items like patterns, documents etc. and then group the homogenous data items togetherothers groupsmay have dissimilar data items. Most of the clustering methods are either crisp or fuzzy and moreover member allocation to the respective clusters is strictly based on similarity measures and membership functions.Both of the methods have limitations in terms of membership. One strictly decides a sample must belong to single cluster and other anyway fuzzy i.e probability. Finally, Quality and Purity like measure are applied to understand how well clusters are created. But there is a grey area in between i.e. ‘Boundary Points’ and ‘Moderately Far’ points from the cluster centre. We considered the cluster quality [18], processing time and relevant features identification as basis for our problem statement and implemented Zone based clustering by using map reducer concept. I have implemented the process to find far points from different clusters and generate a new cluster, repeat the above process until cluster quantity is stabilized. By using this processwe can improve the cluster quality and processing time also.


Author(s):  
Pratama Ryan Harnanda ◽  
Natalia Damastuti ◽  
Tresna Maulana Fahrudin

The blood needs of PMI (Indonesian Red Cross) in the Surabaya City area are sometimes erratic, the problem occurs because the amount of blood demand continues to increase while the blood supply is running low. As the main objective of this research, data mining was applied to able to cluster the blood donor data in UTD-PMI Surabaya City Center which was to determine both potential and no potential donors and also visualize the pattern of donor distribution in Geographic Information System (GIS). Agglomerative Hierarchical Clustering was applied to obtain the clustering result from the existing of 8757 donors. The experiment result shown that the cluster quality was quite good which reached 0.6065410 using Silhouette Coefficient. We concluded the one interesting analysis that private male employees with blood type O, and live in the eastern part of Surabaya City are the most potential donors.


2020 ◽  
pp. 1-20
Author(s):  
Shreya Chandrasekharan ◽  
Mariam Zaka ◽  
Stephen Gallo ◽  
Wenxi Zhao ◽  
Dmitriy Korobskiy ◽  
...  

Understanding the nature and organization of scientific communities is of broad interest. The “Invisible College” is a historical metaphor for one such type of community that refers to a small group of scientists working on a problem of common interest. The scientific and social behavior of such colleges has been the subject of case studies that have examined limited samples of the scientific enterprise. We introduce a meta-method for large-scale discovery that consists of a pipeline to select themed article clusters, whose authors can then be analyzed. A sample of article clusters produced by this pipeline was reviewed by experts, who inferred significant thematic relatedness within clusters, suggesting that authors linked to such clusters may represent valid communities of practice. We explore properties of the author communities identified by our pipeline, and the publication and citation practices of both typical and highly influential authors. Our study reveals that popular domain-independent criteria for graphical cluster quality must be carefully interpreted in the context of searching for author communities, and also suggests a role for contextual criteria.


Sign in / Sign up

Export Citation Format

Share Document