A FAST k-MEANS IMPLEMENTATION USING CORESETS

In this paper we develop an efficient implementation for a k-means clustering algorithm. The algorithm is based on a combination of Lloyd's algorithm with random swapping of centers to avoid local minima. This approach was proposed by Mount 30. The novel feature of our algorithms is the use of coresets to speed up the algorithm. A coreset is a small weighted set of points that approximates the original point set with respect to the considered problem. We use a coreset construction described in 12. Our algorithm first computes a solution on a very small coreset. Then in each iteration the previous solution is used as a starting solution on a refined, i.e. larger, coreset. To evaluate the performance of our algorithm we compare it with algorithm KMHybrid 30 on typical 3D data sets for an image compression application and on artificially created instances. Our data sets consist of 300,000 to 4.9 million points. Our algorithm outperforms KMHybrid on most of these input instances. Additionally, the quality of the solutions computed by our algorithm deviates significantly less than that of KMHybrid. We conclude that the use of coresets has two effects. First, it can speed up algorithms significantly. Secondly, in variants of Lloyd's algorithm, it reduces the dependency on the starting solution and thus makes the algorithm more stable. Finally, we propose the use of coresets as a heuristic to approximate the average silhouette coefficient of clusterings. The average silhouette coefficient is a measure for the quality of a clustering that is independent of the number of clusters k. Hence, it can be used to compare the quality of clusterings for different sizes of k. To show the applicability of our approach we computed clusterings and approximate average silhouette coefficient for k = 1,…, 100 for our input instances and discuss the performance of our algorithm in detail.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Improved Bidirectional CABOSFV Based on Multi-Adjustment Clustering and Simulated Annealing

Cybernetics and Information Technologies ◽

10.1515/cait-2016-0075 ◽

2016 ◽

Vol 16 (6) ◽

pp. 27-42 ◽

Cited By ~ 1

Author(s):

Minghan Yang ◽

Xuedong Gao ◽

Ling Li

Keyword(s):

Simulated Annealing ◽

Data Clustering ◽

Time Complexity ◽

Clustering Algorithm ◽

Feature Vector ◽

Parameter Determination ◽

Data Sets ◽

Parameter Vector ◽

Clustering Validity

Abstract Although Clustering Algorithm Based on Sparse Feature Vector (CABOSFV) and its related algorithms are efficient for high dimensional sparse data clustering, there exist several imperfections. Such imperfections as subjective parameter designation and order sensibility of clustering process would eventually aggravate the time complexity and quality of the algorithm. This paper proposes a parameter adjustment method of Bidirectional CABOSFV for optimization purpose. By optimizing Parameter Vector (PV) and Parameter Selection Vector (PSV) with the objective function of clustering validity, an improved Bidirectional CABOSFV algorithm using simulated annealing is proposed, which circumvents the requirement of initial parameter determination. The experiments on UCI data sets show that the proposed algorithm, which can perform multi-adjustment clustering, has a higher accurateness than single adjustment clustering, along with a decreased time complexity through iterations.

Download Full-text

DBSCANI: Noise-Resistant Method for Missing Value Imputation

Journal of Intelligent Systems ◽

10.1515/jisys-2014-0172 ◽

2016 ◽

Vol 25 (3) ◽

pp. 431-440 ◽

Cited By ~ 1

Author(s):

Archana Purwar ◽

Sandeep Kumar Singh

Keyword(s):

Spatial Data ◽

Missing Values ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Data Sets ◽

Quality Of Data ◽

Data Set ◽

Dbscan Clustering ◽

Density Based Clustering

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.

Download Full-text

CLUSTERING WITH MINIMUM SPANNING TREES

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213011000061 ◽

2011 ◽

Vol 20 (01) ◽

pp. 139-177 ◽

Cited By ~ 8

Author(s):

YAN ZHOU ◽

OLEKSANDR GRYGORASH ◽

THOMAS F. HAIN

Keyword(s):

Spanning Tree ◽

Clustering Algorithm ◽

Spanning Trees ◽

Minimum Spanning Tree ◽

Clustering Algorithms ◽

Predefined Criterion ◽

Point Set ◽

Color Clustering ◽

Representative Points ◽

Set Of Points

We propose two Euclidean minimum spanning tree based clustering algorithms — one a k-constrained, and the other an unconstrained algorithm. Our k-constrained clustering algorithm produces a k-partition of a set of points for any given k. The algorithm constructs a minimum spanning tree of a set of representative points and removes edges that satisfy a predefined criterion. The process is repeated until k clusters are produced. Our unconstrained clustering algorithm partitions a point set into a group of clusters by maximally reducing the overall standard deviation of the edges in the Euclidean minimum spanning tree constructed from a given point set, without prescribing the number of clusters. We present our experimental results comparing our proposed algorithms with k-means, X-means, CURE, Chameleon, and the Expectation-Maximization (EM) algorithm on both artificial data and benchmark data from the UCI repository. We also apply our algorithms to image color clustering and compare them with the standard minimum spanning tree clustering algorithm as well as CURE, Chameleon, and X-means.

Download Full-text

Research on Clustering Method of Improved Glowworm Algorithm Based on Good-Point Set

Mathematical Problems in Engineering ◽

10.1155/2018/8724084 ◽

2018 ◽

Vol 2018 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Yaping Li ◽

Zhiwei Ni ◽

Feifei Jin ◽

Jingming Li ◽

Fenggang Li

Keyword(s):

Clustering Algorithm ◽

Initial Distribution ◽

Data Sets ◽

Clustering Method ◽

Data Object ◽

Good Point ◽

Point Set ◽

Clustering Optimization ◽

Basic Ideas ◽

Good Point Set

As an important data analysis method in data mining, clustering analysis has been researched extensively and in depth. Aiming at the limitation of K-means clustering algorithm that it is sensitive to the distribution of initial clustering center, Glowworm Swarm Optimization (GSO) Algorithm is introduced to solve clustering problems. Firstly, this paper introduces the basic ideas of GSO algorithm, K-means algorithm, and good-point set and analyzes the feasibility of combining them for clustering optimization. Next, it designs a clustering method of improved GSO algorithm based on good-point set which combines GSO algorithm and classical K-means algorithm together, searches data object space, and provides initial clustering centers for K-means algorithm by means of improved GSO algorithm and thus obtains better clustering results. Major improvement of GSO algorithm is to optimize the initial distribution of glowworm swarm by introducing the theory and method of good-point set. Finally, the new clustering algorithm is applied to UCI data sets of different categories and numbers for clustering test. The advantages of the improved clustering algorithm in terms of sum of squared errors (SSE), clustering accuracy, and robustness are explained through comparison and analysis.

Download Full-text

Efficiently Producing the K Nearest Neighbors in the Skyline on Vertically Partitioned Tables

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2013040104 ◽

2013 ◽

Vol 3 (2) ◽

pp. 58-77

Author(s):

Marlene Goncalves ◽

Maria-Esther Vidal

Keyword(s):

State Of The Art ◽

Property Values ◽

Hybrid Approach ◽

Nearest Neighbors ◽

Query Point ◽

K Nearest Neighbors ◽

Speed Up ◽

Set Of Points ◽

Nearest Points

Criteria that induce a Skyline naturally represent user's preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large, making unfeasible for users to process this set of points. To identify the best points among the Skyline, the Top-k Skyline approach has been proposed. Top-k Skyline uses discriminatory criteria to induce a total order of the points that comprise the Skyline, and recognizes the best or top-k points based on these criteria. In this article the authors model queries as multi-dimensional points that represent bounds of VPT (Vertically Partitioned Table) property values, and datasets as sets of multi-dimensional points; the problem is to locate the k best tuples in the dataset whose distance to the query is minimized. A tuple is among the k best tuples whenever there is not another tuple that is better in all dimensions, and that is closer to the query point, i.e., the k best tuples correspond to the k nearest points to the query that are incomparable or belong to the skyline. The authors name these tuples the k nearest neighbors in the skyline. The authors propose a hybrid approach that combines Skyline and Top-k solutions and develop two algorithms: TKSI and k-NNSkyline. The proposed algorithms identify among the skyline tuples, the k ones with the lowest values of the distance metric, i.e., the k nearest neighbors to the multi-dimensional query that are incomparable. Empirically, we study the performance and quality of TKSI and k-NNSkyline. The authors’ experimental results show the TKSI is able to speed up the computation of the Top-k Skyline in at least 50% percent with respect to the state-of-the-art solutions, whenever k is smaller than the size of the Skyline. Additionally, the authors’ results suggest that k-NNSkyline outperforms existing solutions by up to three orders of magnitude.

Download Full-text

Metrics for Image Surface Approximation Based on Triangular Meshes

Image Analysis & Stereology ◽

10.5566/ias.1591 ◽

2018 ◽

Vol 37 (1) ◽

pp. 71

Author(s):

Eduardo Sant'Ana Da Silva ◽

Anderson Santos ◽

Helio Pedrini

Keyword(s):

Computer Aided Design ◽

Data Sets ◽

Triangular Meshes ◽

Surface Approximation ◽

Quality Models ◽

3D Data ◽

Data Density ◽

Application Fields ◽

Aided Design

Surface approximation plays an important role in several application fields, such as computer-aided design, computer graphics, remote sensing, computer vision, robotics, architecture, and manufacturing. A common problem present in these areas is to develop efficient methods for generating, processing, analyzing, and visualizing large amount of 3D data. Triangular meshes constitute a flexible representation of sampled points that are not regularly distributed in space, such that the model can be adaptively adjusted to the data density. The choice of metrics for building the triangular meshes is crucial to produce high quality models. This paper proposes and evaluates different measures to incrementally refine a Delaunay triangular mesh for image surface approximation until either a certain accuracy is obtained or a maximum number of iterations is achieved. Experiments on several data sets are performed to compare the quality of the resulting meshes.

Download Full-text

Direct phase determination in electron crystallography: small organic molecules

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100130468 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1166-1167

Author(s):

Douglas L. Dorset

Keyword(s):

Organic Molecules ◽

Data Sets ◽

Temperature Structure ◽

3D Analysis ◽

Intensity Data ◽

Electron Crystallography ◽

Phase Determination ◽

Measured Intensity ◽

3D Data

The quantitative use of electron diffraction intensity data for the determination of crystal structures represents the pioneering achievement in the electron crystallography of organic molecules, an effort largely begun by B. K. Vainshtein and his co-workers. However, despite numerous representative structure analyses yielding results consistent with X-ray determination, this entire effort was viewed with considerable mistrust by many crystallographers. This was no doubt due to the rather high crystallographic R-factors reported for some structures and, more importantly, the failure to convince many skeptics that the measured intensity data were adequate for ab initio structure determinations.We have recently demonstrated the utility of these data sets for structure analyses by direct phase determination based on the probabilistic estimate of three- and four-phase structure invariant sums. Examples include the structure of diketopiperazine using Vainshtein's 3D data, a similar 3D analysis of the room temperature structure of thiourea, and a zonal determination of the urea structure, the latter also based on data collected by the Moscow group.

Download Full-text

Time consumption and quality of an automated fusion tool for SPECT and MRI images of the brain

Nuklearmedizin ◽

10.1055/s-0038-1625192 ◽

2003 ◽

Vol 42 (05) ◽

pp. 215-219

Author(s):

G. Platsch ◽

A. Schwarz ◽

K. Schmiedehausen ◽

B. Tomandl ◽

W. Huk ◽

...

Keyword(s):

Data Sets ◽

3D Mri ◽

Clinical Routine ◽

Fusion Procedure ◽

3D Data ◽

Mri Scans ◽

2D Data ◽

Time Required ◽

The Brain ◽

Manual Correction

Summary: Aim: Although the fusion of images from different modalities may improve diagnostic accuracy, it is rarely used in clinical routine work due to logistic problems. Therefore we evaluated performance and time needed for fusing MRI and SPECT images using a semiautomated dedicated software. Patients, material and Method: In 32 patients regional cerebral blood flow was measured using 99mTc ethylcystein dimer (ECD) and the three-headed SPECT camera MultiSPECT 3. MRI scans of the brain were performed using either a 0,2 T Open or a 1,5 T Sonata. Twelve of the MRI data sets were acquired using a 3D-T1w MPRAGE sequence, 20 with a 2D acquisition technique and different echo sequences. Image fusion was performed on a Syngo workstation using an entropy minimizing algorithm by an experienced user of the software. The fusion results were classified. We measured the time needed for the automated fusion procedure and in case of need that for manual realignment after automated, but insufficient fusion. Results: The mean time of the automated fusion procedure was 123 s. It was for the 2D significantly shorter than for the 3D MRI datasets. For four of the 2D data sets and two of the 3D data sets an optimal fit was reached using the automated approach. The remaining 26 data sets required manual correction. The sum of the time required for automated fusion and that needed for manual correction averaged 320 s (50-886 s). Conclusion: The fusion of 3D MRI data sets lasted significantly longer than that of the 2D MRI data. The automated fusion tool delivered in 20% an optimal fit, in 80% manual correction was necessary. Nevertheless, each of the 32 SPECT data sets could be merged in less than 15 min with the corresponding MRI data, which seems acceptable for clinical routine use.

Download Full-text

Pengaruh Kualitas Sumber Daya Manusia, Pemanfaatan Teknologi Informasi, Dan Sistem Pengendalian Intern Terhadap Ketepatwaktuan Pelaporan Keuangan Pemerintah Desa

Kajian Bisnis STIE Widya Wiwaha ◽

10.32477/jkb.v26i2.274 ◽

2018 ◽

Vol 26 (2) ◽

pp. 131-143

Author(s):

Marlinawati Marlinawati ◽

Dewi Kusuma Wardani

Keyword(s):

Information Technology ◽

Control System ◽

Human Resources ◽

Financial Reporting ◽

Internal Control ◽

Positive Influence ◽

Internal Control System ◽

Speed Up ◽

Village Government

The purpose of this research is to know the influence between the Quality of Human Resources, Utilization of Information Technology and Internal Control System Against Timeliness of Village Government Financial Reporting at Gunungkidul Regency. This research is causative research. The population is the village government in Gunungkidul Regency, especially in Gedangsari subdistrict. Criteria of respondents in the study were to village and village apparatus. We use questionnaire to collect data. We use multiple regression with SPSS program version 16.0 to analyze data. We find that quality of human resources and internal control system have a positive influence on the timeliness of village government financial reporting. On the other hand, utilization of information technology does not influence the timeliness of village government financial reporting. These imply that the quality of human resources and internal control system can speed up the preparation of village government financial reporting.

Download Full-text