scholarly journals Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

2019 ◽  
Vol 31 (2) ◽  
pp. 329-338 ◽  
Author(s):  
Jian Hu ◽  
Haiwan Zhu ◽  
Yimin Mao ◽  
Canlong Zhang ◽  
Tian Liang ◽  
...  

Landslide hazard prediction is a difficult, time-consuming process when traditional methods are used. This paper presents a method that uses machine learning to predict landslide hazard levels automatically. Due to difficulties in obtaining and effectively processing rainfall in landslide hazard prediction, and to the existing limitation in dealing with large-scale data sets in the M-chameleon algorithm, a new method based on an uncertain DM-chameleon algorithm (developed M-chameleon) is proposed to assess the landslide susceptibility model. First, this method designs a new two-phase clustering algorithm based on M-chameleon, which effectively processes large-scale data sets. Second, the new E-H distance formula is designed by combining the Euclidean and Hausdorff distances, and this enables the new method to manage uncertain data effectively. The uncertain data model is presented at the same time to effectively quantify triggering factors. Finally, the model for predicting landslide hazards is constructed and verified using the data from the Baota district of the city of Yan’an, China. The experimental results show that the uncertain DM-chameleon algorithm of machine learning can effectively improve the accuracy of landslide prediction and has high feasibility. Furthermore, the relationships between hazard factors and landslide hazard levels can be extracted based on clustering results.


2014 ◽  
Vol 687-691 ◽  
pp. 1342-1345 ◽  
Author(s):  
Jie Ding ◽  
Li Peng Zhu ◽  
Bin Hu ◽  
Ren Long Hang ◽  
Yu Bao Sun

With the rapid advance of data collection and storage technique, it is easy to acquire tens of millions or even billions of data sets. How to explore and exploit the useful or interesting information for human beings from these data sets has become an urgent issue. Traditional k-means clustering algorithm has been widely used in data mining community. First, randomly initialize k clustering centres. Then, all instances are classified into k different classes according to their distances to clustering centres. Lastly, update the clustering centres by the mean of its corresponding constituent instances. This whole process will be iterated until convergence. Obviously, at each iteration, distance matrix from all instances to k clustering centres must be calculated which will cost so much time when encounter large scale data sets. To address this issue, in this paper, we proposed a fast optimization algorithm based on stochastic gradient descent (SGD). At each iteration, randomly choose an instance, search its corresponding clustering centre and then update it immediately. Experimental results show that our proposed method achieves a competitive clustering results with less time cost.


2019 ◽  
Vol 48 (4) ◽  
pp. 673-681
Author(s):  
Shufen Zhang ◽  
Zhiyu Liu ◽  
Xuebin Chen ◽  
Changyin Luo

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Li Guo ◽  
Kunlin Zhu ◽  
Ruijun Duan

In order to explore the economic development trend in the postepidemic era, this paper improves the traditional clustering algorithm and constructs a postepidemic economic development trend analysis model based on intelligent algorithms. In order to solve the clustering problem of large-scale nonuniform density data sets, this paper proposes an adaptive nonuniform density clustering algorithm based on balanced iterative reduction and uses the algorithm to further cluster the compressed data sets. For large-scale data sets, the clustering results can accurately reflect the class characteristics of the data set as a whole. Moreover, the algorithm greatly improves the time efficiency of clustering. From the research results, we can see that the improved clustering algorithm has a certain effect on the analysis of economic development trends in the postepidemic era and can continue to play a role in subsequent economic analysis.


Author(s):  
D T Pham ◽  
S S Dimov ◽  
C D Nguyen

The K-means algorithm is a popular data-clustering algorithm. However, one of its drawbacks is the requirement for the number of clusters, K, to be specified before the algorithm is applied. This paper first reviews existing methods for selecting the number of clusters for the algorithm. Factors that affect this selection are then discussed and a new measure to assist the selection is proposed. The paper concludes with an analysis of the results of using the proposed measure to determine the number of clusters for the K-means algorithm for different data sets.


2016 ◽  
Vol 2016 ◽  
pp. 1-9
Author(s):  
Size Bi ◽  
Xiaoyu Han ◽  
Jing Tian ◽  
Xiao Liang ◽  
Yang Wang ◽  
...  

This paper investigates a homotopy-based method for embedding with hundreds of thousands of data items that yields a parallel algorithm suitable for running on a distributed system. Current eigenvalue-based embedding algorithms attempt to use a sparsification of the distance matrix to approximate a low-dimensional representation when handling large-scale data sets. The main reason of taking approximation is that it is still hindered by the eigendecomposition bottleneck for high-dimensional matrices in the embedding process. In this study, a homotopy continuation algorithm is applied for improving this embedding model by parallelizing the corresponding eigendecomposition. The eigenvalue solution is converted to the operation of ordinary differential equations with initialized values, and all isolated positive eigenvalues and corresponding eigenvectors can be obtained in parallel according to predicting eigenpaths. Experiments on the real data sets show that the homotopy-based approach is potential to be implemented for millions of data sets.


2021 ◽  
Author(s):  
Abdelhamid ZAIDI

Abstract Object datasets used in the construction of object detectors are typically manually annotated with horizontal or rotated bounding rectangles. The optimality of an annotation is obtained by fulfilling two conditions (i) the rectangle covers the whole object (ii) the area of ​​the rectangle is minimal. Building a large-scale object dataset requires annotators with equal manual dexterity to carry out this tedious work. When an object is horizontal, it is easy for the annotator to reach the optimal bounding box within a reasonable time. However, if the object is rotated, the annotator needs additional time to decide whether the object will be annotated with a horizontal rectangle or a rotated rectangle. Moreover, in both cases, the final decision is not based on any objective argument, and the annotation is generally not optimal. In this study, we propose a new method of annotation by rectangles, called Robust Semi-Automatic Annotation, which combines speed and robustness. Our method has two phases. The first phase consists in inviting the annotator to click on the most relevant points located on the contour of the object. The outputs of the first phase are used by an algorithm to determine a rectangle enclosing these points. To carry out the second phase, we develop an algorithm called RANGE-MBR, which determines, from the selected points on the contour of the object, a rectangle enclosing these points in a linear time. The rectangle returned by RANGE-MBR always satisfies optimality condition (i). We prove that the optimality condition (ii) is always satisfied for objects with isotropic shapes. For objects with anisotropic shapes, we study the optimality condition (ii) by simulations. We show that the rectangle returned by RANGE-MBR is quasi-optimal for the condition (ii), and that its performance increases with dilated objects, which is the case for most of the objects appearing on images collected by aerial photography.


Sign in / Sign up

Export Citation Format

Share Document