An efficient trajectory-clustering algorithm based on an index tree

With the development of location-based services, such as the Global Positioning System and Radio Frequency Identification, a great deal of trajectory data can be collected. Therefore, how to mine knowledge from these data has become an attractive topic. In this paper, we propose an efficient trajectory-clustering algorithm based on an index tree. Firstly, an index tree is proposed to store trajectories and their similarity matrix, with which trajectories can be retrieved efficiently; secondly, a new conception of trajectory structure is introduced to analyse both the internal and external features of trajectories; then, trajectories are partitioned into trajectory segments according to their corners; furthermore, the similarity between every trajectory segment pairs is compared by presenting the structural similarity function; finally, trajectory segments are grouped into different clusters according to their location in the different levels of the index tree. Experimental results on real data sets demonstrate not only the efficiency and effectiveness of our algorithm, but also the great flexibility that feature sensitivity can be adjusted by different parameters, and the cluster results are more practically significant.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Clustering Mashups by Integrating Structural and Semantic Similarities Using Fuzzy AHP

International Journal of Web Services Research ◽

10.4018/ijwsr.2021010103 ◽

2021 ◽

Vol 18 (1) ◽

pp. 34-57

Author(s):

Weifeng Pan ◽

Xinxin Xu ◽

Hua Ming ◽

Carl K. Chang

Keyword(s):

Semantic Similarity ◽

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Fuzzy Ahp ◽

Real Data ◽

Structural Similarity ◽

Analytic Hierarchy ◽

Data Set ◽

Novel Approach ◽

Hierarchy Process

Mashup technology has become a promising way to develop and deliver applications on the web. Automatically organizing Mashups into functionally similar clusters helps improve the performance of Mashup discovery. Although there are many approaches aiming to cluster Mashups, they solely focus on utilizing semantic similarities to guide the Mashup clustering process and are unable to utilize both the structural and semantic information in Mashup profiles. In this paper, a novel approach to cluster Mashups into groups is proposed, which integrates structural similarity and semantic similarity using fuzzy AHP (fuzzy analytic hierarchy process). The structural similarity is computed from usage histories between Mashups and Web APIs using SimRank algorithm. The semantic similarity is computed from the descriptions and tags of Mashups using LDA (latent dirichlet allocation). A clustering algorithm based on the genetic algorithm is employed to cluster Mashups. Comprehensive experiments are performed on a real data set collected from ProgrammableWeb. The results show the effectiveness of the approach when compared with two kinds of conventional approaches.

Download Full-text

A new stochastic gradient descent possibilistic clustering algorithm

AI Communications ◽

10.3233/aic-210125 ◽

2021 ◽

pp. 1-18

Author(s):

Angeliki Koutsimpela ◽

Konstantinos D. Koutroumbas

Keyword(s):

Cost Function ◽

Gradient Descent ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Convergence Results ◽

Possibilistic Clustering

Several well known clustering algorithms have their own online counterparts, in order to deal effectively with the big data issue, as well as with the case where the data become available in a streaming fashion. However, very few of them follow the stochastic gradient descent philosophy, despite the fact that the latter enjoys certain practical advantages (such as the possibility of (a) running faster than their batch processing counterparts and (b) escaping from local minima of the associated cost function), while, in addition, strong theoretical convergence results have been established for it. In this paper a novel stochastic gradient descent possibilistic clustering algorithm, called O- PCM 2 is introduced. The algorithm is presented in detail and it is rigorously proved that the gradient of the associated cost function tends to zero in the L 2 sense, based on general convergence results established for the family of the stochastic gradient descent algorithms. Furthermore, an additional discussion is provided on the nature of the points where the algorithm may converge. Finally, the performance of the proposed algorithm is tested against other related algorithms, on the basis of both synthetic and real data sets.

Download Full-text

Extended Classification Course Improves Road Intersection Detection from Low-Frequency GPS Trajectory Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9030181 ◽

2020 ◽

Vol 9 (3) ◽

pp. 181

Author(s):

Banqiao Chen ◽

Chibiao Ding ◽

Wenjuan Ren ◽

Guangluan Xu

Keyword(s):

Clustering Algorithm ◽

Recall Rate ◽

Location Based Services ◽

Trajectory Data ◽

Intersection Detection ◽

Map Generation ◽

Gps Trajectory Data ◽

Road Intersections ◽

Road Maps ◽

Gps Trajectory

The requirements of location-based services have generated an increasing need for up-to-date digital road maps. However, traditional methods are expensive and time-consuming, requiring many skilled operators. The feasibility of using massive GPS trajectory data provides a cheap and quick means for generating and updating road maps. The detection of road intersections, being the critical component of a road map, is a key problem in map generation. Unfortunately, low sampling rates and high disparities are ubiquitous among floating car data (FCD), making road intersection detection from such GPS trajectories very challenging. In this paper, we extend a point clustering-based road intersection detection framework to include a post-classification course, which utilizes the geometric features of road intersections. First, we propose a novel turn-point position compensation algorithm, in order to improve the concentration of selected turn-points under low sampling rates. The initial detection results given by the clustering algorithm are recall-focused. Then, we rule out false detections in an extended classification course based on an image thinning algorithm. The detection results of the proposed method are quantitatively evaluated by matching with intersections from OpenStreetMap using a variety of distance thresholds. Compared with other methods, our approach can achieve a much higher recall rate and better overall performance, thereby better supporting map generation and other similar applications.

Download Full-text

Algorithm to forming a rule base for a fuzzy classifier designed on the basis of the K-means clustering algorithm and the whale optimization algorithm

10.21293/1818-0442-2021-24-1-42-47 ◽

2021 ◽

Vol 24 (1) ◽

pp. 42-47

Author(s):

N. P. Koryshev ◽

◽

I. A. Hodashinsky ◽

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Performance Testing ◽

Real Data ◽

Rule Base ◽

Data Sets ◽

Fuzzy Classifier ◽

Whale Optimization ◽

Clustering Quality ◽

Using Data

The article presents a description of the algorithm for generating fuzzy rules for a fuzzy classifier using data clustering, metaheuristic, and the clustering quality index, as well as the results of performance testing on real data sets.

Download Full-text

A new approach to the fuzzy c-means clustering algorithm by automatic weights and local clustering

10.24271/psr.18 ◽

2021 ◽

Vol 3 (1) ◽

pp. 1-7

Author(s):

Yadgar Sirwan Abdulrahman

Keyword(s):

Clustering Algorithm ◽

Similarity Criterion ◽

Real Data ◽

Well Being ◽

Classical Solutions ◽

Data Sets ◽

Data Set ◽

New Approach ◽

Fuzzy C Means Clustering ◽

Global And Local

Clustering is one of the essential strategies in data analysis. In classical solutions, all features are assumed to contribute equally to the data clustering. Of course, some features are more important than others in real data sets. As a result, essential features will have a more significant impact on identifying optimal clusters than other features. In this article, a fuzzy clustering algorithm with local automatic weighting is presented. The proposed algorithm has many advantages such as: 1) the weights perform features locally, meaning that each cluster's weight is different from the rest. 2) calculating the distance between the samples using a non-euclidian similarity criterion to reduce the noise effect. 3) the weight of the features is obtained comparatively during the learning process. In this study, mathematical analyzes were done to obtain the clustering centers well-being and the features' weights. Experiments were done on the data set range to represent the progressive algorithm's efficiency compared to other proposed algorithms with global and local features

Download Full-text

GRAPH BASED CLUSTERING WITH CONSTRAINTS AND ACTIVE LEARNING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/37/1/15773 ◽

2021 ◽

Vol 37 (1) ◽

pp. 71-89

Author(s):

Vu-Tuan Dang ◽

Viet-Vu Vu ◽

Hong-Quan Do ◽

Thi Kieu Oanh Le

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Class Labels ◽

Graph Based Clustering

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.

Download Full-text

Trajectory Clustering by Sampling and Density

Marine Technology Society Journal ◽

10.4031/mtsj.48.6.8 ◽

2014 ◽

Vol 48 (6) ◽

pp. 74-85 ◽

Cited By ~ 5

Author(s):

Jiacai Pan ◽

Qingshan Jiang ◽

Zheping Shao

Keyword(s):

Traffic Flow ◽

Clustering Algorithm ◽

Moving Objects ◽

Distance Measure ◽

Classical Method ◽

Trajectory Clustering ◽

Trajectory Data ◽

Density Based Clustering ◽

Clustering Quality ◽

Parameter Values

AbstractThe trajectory data of moving objects contain huge amounts of information pertaining to traffic flow. It is incredibly important to extract valuable knowledge from this particular kind of data. Trajectory clustering is one of the most widely used approaches to complete this extraction. However, the current practice of trajectory clustering always groups similar subtrajectories that are partitioned from the trajectories; these methods would thus lose important information of the trajectory as a whole. To deal with this problem, this paper introduces a new trajectory-clustering algorithm based on sampling and density, which groups similar traffic movement tracks (car, ship, airplane, etc.) for further analysis of the characteristics of traffic flow. In particular, this paper proposes a novel technique of measuring distances between trajectories using point sampling. This distance measure does not divide the trajectory and thus conserves the integrated knowledge of these trajectories. This trajectory clustering approach is a new adaptation of a density-based clustering algorithm to the trajectories of moving objects. This paper then adopts the entropy theory as the heuristic for selecting the parameter values of this algorithm and the sum of the squared error method for measuring the clustering quality. Experiments on real ship trajectory data have shown that this algorithm is superior to the classical method TRACLUSS in the run time and that this method works well in discovering traffic flow patterns.

Download Full-text

A Global-Relationship Dissimilarity Measure for thek-Modes Clustering Algorithm

Computational Intelligence and Neuroscience ◽

10.1155/2017/3691316 ◽

2017 ◽

Vol 2017 ◽

pp. 1-7 ◽

Cited By ~ 3

Author(s):

Hongfang Zhou ◽

Yihui Zhang ◽

Yibin Liu

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Real Data ◽

Dissimilarity Measure ◽

Data Sets ◽

Dissimilarity Measures

Thek-modes clustering algorithm has been widely used to cluster categorical data. In this paper, we firstly analyzed thek-modes algorithm and its dissimilarity measure. Based on this, we then proposed a novel dissimilarity measure, which is named as GRD. GRD considers not only the relationships between the object and all cluster modes but also the differences of different attributes. Finally the experiments were made on four real data sets from UCI. And the corresponding results show that GRD achieves better performance than two existing dissimilarity measures used ink-modes and Cao’s algorithms.

Download Full-text

An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division

Discrete Dynamics in Nature and Society ◽

10.1155/2015/793010 ◽

2015 ◽

Vol 2015 ◽

pp. 1-18 ◽

Cited By ~ 7

Author(s):

Dawen Xia ◽

Binfeng Wang ◽

Yantao Li ◽

Zhuobo Rong ◽

Zili Zhang

Keyword(s):

Intelligent Transportation Systems ◽

Large Scale ◽

Clustering Algorithm ◽

Transportation Systems ◽

Division Problem ◽

Data Sets ◽

Trajectory Data ◽

Computing Platform ◽

Distributed Computing Platform ◽

Parallel Clustering

Traffic subarea division is vital for traffic system management and traffic network analysis in intelligent transportation systems (ITSs). Since existing methods may not be suitable for big traffic data processing, this paper presents a MapReduce-based Parallel Three-PhaseK-Means (Par3PKM) algorithm for solving traffic subarea division problem on a widely adopted Hadoop distributed computing platform. Specifically, we first modify the distance metric and initialization strategy ofK-Means and then employ a MapReduce paradigm to redesign the optimizedK-Means algorithm for parallel clustering of large-scale taxi trajectories. Moreover, we propose a boundary identifying method to connect the borders of clustering results for each cluster. Finally, we divide traffic subarea of Beijing based on real-world trajectory data sets generated by 12,000 taxis in a period of one month using the proposed approach. Experimental evaluation results indicate that when compared withK-Means, Par2PK-Means, and ParCLARA, Par3PKM achieves higher efficiency, more accuracy, and better scalability and can effectively divide traffic subarea with big taxi trajectory data.

Download Full-text