Self-Adaptive K-Means Based on a Covering Algorithm

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Q-Learning with Fisher Score for Feature Selection of Large-Scale Data Sets

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-82147-0_25 ◽

2021 ◽

pp. 306-318

Author(s):

Min Gan ◽

Li Zhang

Keyword(s):

Feature Selection ◽

Large Scale ◽

Data Sets ◽

Fisher Score ◽

Q Learning ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets ◽

Selection Of

Download Full-text

Using Uncertain DM-Chameleon Clustering Algorithm Based on Machine Learning to Predict Landslide Hazards

Journal of Robotics and Mechatronics ◽

10.20965/jrm.2019.p0329 ◽

2019 ◽

Vol 31 (2) ◽

pp. 329-338 ◽

Cited By ~ 1

Author(s):

Jian Hu ◽

Haiwan Zhu ◽

Yimin Mao ◽

Canlong Zhang ◽

Tian Liang ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Clustering Algorithm ◽

Uncertain Data ◽

Landslide Hazard ◽

Data Sets ◽

Large Scale Data ◽

Landslide Hazards ◽

Hazard Levels ◽

Scale Data

Landslide hazard prediction is a difficult, time-consuming process when traditional methods are used. This paper presents a method that uses machine learning to predict landslide hazard levels automatically. Due to difficulties in obtaining and effectively processing rainfall in landslide hazard prediction, and to the existing limitation in dealing with large-scale data sets in the M-chameleon algorithm, a new method based on an uncertain DM-chameleon algorithm (developed M-chameleon) is proposed to assess the landslide susceptibility model. First, this method designs a new two-phase clustering algorithm based on M-chameleon, which effectively processes large-scale data sets. Second, the new E-H distance formula is designed by combining the Euclidean and Hausdorff distances, and this enables the new method to manage uncertain data effectively. The uncertain data model is presented at the same time to effectively quantify triggering factors. Finally, the model for predicting landslide hazards is constructed and verified using the data from the Baota district of the city of Yan’an, China. The experimental results show that the uncertain DM-chameleon algorithm of machine learning can effectively improve the accuracy of landslide prediction and has high feasibility. Furthermore, the relationships between hazard factors and landslide hazard levels can be extracted based on clustering results.

Download Full-text

A EM Probabilistic Clustering Algorithm for Large Scale Data Sets based on Partial Constraints Information

INTERNATIONAL JOURNAL ON Advances in Information Sciences and Service Sciences ◽

10.4156/aiss.vol3.issue10.3 ◽

2011 ◽

Vol 3 (10) ◽

pp. 20-29

Author(s):

Shen Yan ◽

Song Shunlin ◽

Zhu Yuquan

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Data Sets ◽

Probabilistic Clustering ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Privacy-preserving constrained spectral clustering algorithm for large-scale data sets

IET Information Security ◽

10.1049/iet-ifs.2019.0255 ◽

2020 ◽

Vol 14 (3) ◽

pp. 321-331 ◽

Cited By ~ 1

Author(s):

Ji Li ◽

Jianghong Wei ◽

Mao Ye ◽

Wenfen Liu ◽

Xuexian Hu

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Privacy Preserving ◽

Data Sets ◽

Large Scale Data ◽

Spectral Clustering Algorithm ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Stochastic Gradient Descent Based K-Means Algorithm on Large Scale Data Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1342 ◽

2014 ◽

Vol 687-691 ◽

pp. 1342-1345 ◽

Cited By ~ 1

Author(s):

Jie Ding ◽

Li Peng Zhu ◽

Bin Hu ◽

Ren Long Hang ◽

Yu Bao Sun

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Clustering Algorithm ◽

Distance Matrix ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Human Beings ◽

Large Scale Data ◽

Scale Data

With the rapid advance of data collection and storage technique, it is easy to acquire tens of millions or even billions of data sets. How to explore and exploit the useful or interesting information for human beings from these data sets has become an urgent issue. Traditional k-means clustering algorithm has been widely used in data mining community. First, randomly initialize k clustering centres. Then, all instances are classified into k different classes according to their distances to clustering centres. Lastly, update the clustering centres by the mean of its corresponding constituent instances. This whole process will be iterated until convergence. Obviously, at each iteration, distance matrix from all instances to k clustering centres must be calculated which will cost so much time when encounter large scale data sets. To address this issue, in this paper, we proposed a fast optimization algorithm based on stochastic gradient descent (SGD). At each iteration, randomly choose an instance, search its corresponding clustering centre and then update it immediately. Experimental results show that our proposed method achieves a competitive clustering results with less time cost.

Download Full-text

Parallel Implementation of Improved K-Means Based on a Cloud Platform

Information Technology And Control ◽

10.5755/j01.itc.48.4.23881 ◽

2019 ◽

Vol 48 (4) ◽

pp. 673-681

Author(s):

Shufen Zhang ◽

Zhiyu Liu ◽

Xuebin Chen ◽

Changyin Luo

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Sample Density ◽

Scale Data ◽

Selection Of

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Download Full-text

Analysis of Economic Development Trend in Postepidemic Era Based on Improved Clustering Algorithm

Scientific Programming ◽

10.1155/2021/4467001 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Li Guo ◽

Kunlin Zhu ◽

Ruijun Duan

Keyword(s):

Economic Development ◽

Large Scale ◽

Clustering Algorithm ◽

Development Trend ◽

Data Sets ◽

Analysis Model ◽

Data Set ◽

Clustering Problem ◽

Large Scale Data ◽

Compressed Data

In order to explore the economic development trend in the postepidemic era, this paper improves the traditional clustering algorithm and constructs a postepidemic economic development trend analysis model based on intelligent algorithms. In order to solve the clustering problem of large-scale nonuniform density data sets, this paper proposes an adaptive nonuniform density clustering algorithm based on balanced iterative reduction and uses the algorithm to further cluster the compressed data sets. For large-scale data sets, the clustering results can accurately reflect the class characteristics of the data set as a whole. Moreover, the algorithm greatly improves the time efficiency of clustering. From the research results, we can see that the improved clustering algorithm has a certain effect on the analysis of economic development trends in the postepidemic era and can continue to play a role in subsequent economic analysis.

Download Full-text

Selection of K in K-means clustering

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/095440605x8298 ◽

2005 ◽

Vol 219 (1) ◽

pp. 103-119 ◽

Cited By ~ 199

Author(s):

D T Pham ◽

S S Dimov ◽

C D Nguyen

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Data Sets ◽

Number Of Clusters ◽

Selection Of

The K-means algorithm is a popular data-clustering algorithm. However, one of its drawbacks is the requirement for the number of clusters, K, to be specified before the algorithm is applied. This paper first reviews existing methods for selecting the number of clusters for the algorithm. Factors that affect this selection are then discussed and a new measure to assist the selection is proposed. The paper concludes with an analysis of the results of using the proposed measure to determine the number of clusters for the K-means algorithm for different data sets.

Download Full-text

Parallelization of Eigenvalue-Based Dimensional Reductions via Homotopy Continuation

Mathematical Problems in Engineering ◽

10.1155/2016/5815429 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9

Author(s):

Size Bi ◽

Xiaoyu Han ◽

Jing Tian ◽

Xiao Liang ◽

Yang Wang ◽

...

Keyword(s):

Large Scale ◽

Distance Matrix ◽

Real Data ◽

Homotopy Continuation ◽

Data Sets ◽

Dimensional Representation ◽

Large Scale Data ◽

Continuation Algorithm ◽

Low Dimensional ◽

Scale Data

This paper investigates a homotopy-based method for embedding with hundreds of thousands of data items that yields a parallel algorithm suitable for running on a distributed system. Current eigenvalue-based embedding algorithms attempt to use a sparsification of the distance matrix to approximate a low-dimensional representation when handling large-scale data sets. The main reason of taking approximation is that it is still hindered by the eigendecomposition bottleneck for high-dimensional matrices in the embedding process. In this study, a homotopy continuation algorithm is applied for improving this embedding model by parallelizing the corresponding eigendecomposition. The eigenvalue solution is converted to the operation of ordinary differential equations with initialized values, and all isolated positive eigenvalues and corresponding eigenvectors can be obtained in parallel according to predicting eigenpaths. Experiments on the real data sets show that the homotopy-based approach is potential to be implemented for millions of data sets.

Download Full-text

Robust Semi-Automatic Annotation of object Data Sets with Bounding Rectangles.

10.21203/rs.3.rs-860574/v1 ◽

2021 ◽

Author(s):

Abdelhamid ZAIDI

Keyword(s):

Optimality Condition ◽

Large Scale ◽

Linear Time ◽

Second Phase ◽

Final Decision ◽

Data Sets ◽

Automatic Annotation ◽

Additional Time ◽

Object A ◽

Two Phases

Abstract Object datasets used in the construction of object detectors are typically manually annotated with horizontal or rotated bounding rectangles. The optimality of an annotation is obtained by fulfilling two conditions (i) the rectangle covers the whole object (ii) the area of the rectangle is minimal. Building a large-scale object dataset requires annotators with equal manual dexterity to carry out this tedious work. When an object is horizontal, it is easy for the annotator to reach the optimal bounding box within a reasonable time. However, if the object is rotated, the annotator needs additional time to decide whether the object will be annotated with a horizontal rectangle or a rotated rectangle. Moreover, in both cases, the final decision is not based on any objective argument, and the annotation is generally not optimal. In this study, we propose a new method of annotation by rectangles, called Robust Semi-Automatic Annotation, which combines speed and robustness. Our method has two phases. The first phase consists in inviting the annotator to click on the most relevant points located on the contour of the object. The outputs of the first phase are used by an algorithm to determine a rectangle enclosing these points. To carry out the second phase, we develop an algorithm called RANGE-MBR, which determines, from the selected points on the contour of the object, a rectangle enclosing these points in a linear time. The rectangle returned by RANGE-MBR always satisfies optimality condition (i). We prove that the optimality condition (ii) is always satisfied for objects with isotropic shapes. For objects with anisotropic shapes, we study the optimality condition (ii) by simulations. We show that the rectangle returned by RANGE-MBR is quasi-optimal for the condition (ii), and that its performance increases with dilated objects, which is the case for most of the objects appearing on images collected by aerial photography.

Download Full-text