New K-means Clustering Method Using Minkowski’s Distance as its Metric

Cluster analysis is an unsupervised learning method that classifies data points, usually multidimensional into groups (called clusters) such that members of one cluster are more similar (in some sense) to each other than those in other clusters. In this paper, we propose a new k-means clustering method that uses Minkowski’s distance as its metric in a normed vector space which is the generalization of both the Euclidean distance and the Manhattan distance. The k-means clustering methods discussed in this paper are Forgy’s method, Lloyd’s method, MacQueen’s method, Hartigan and Wong’s method, Likas’ method and Faber’s method which uses the usual Euclidean distance. It was observed that the new k-means clustering method performed favourably in comparison with the existing methods in terms of minimization of the total intra-cluster variance using simulated data and real-life data sets.

Download Full-text

CLUSTERING USING SIMULATED ANNEALING WITH PROBABILISTIC REDISTRIBUTION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001401000927 ◽

2001 ◽

Vol 15 (02) ◽

pp. 269-285 ◽

Cited By ~ 54

Author(s):

SANGHAMITRA BANDYOPADHYAY ◽

UJJWAL MAULIK ◽

MALAY KUMAR PAKHIRA

Keyword(s):

Simulated Annealing ◽

Clustering Algorithm ◽

Minimum Energy ◽

Real Life ◽

Feature Space ◽

Cluster Center ◽

Data Sets ◽

Partitional Clustering ◽

Real Life Data ◽

Data Points

An efficient partitional clustering technique, called SAKM-clustering, that integrates the power of simulated annealing for obtaining minimum energy configuration, and the searching capability of K-means algorithm is proposed in this article. The clustering methodology is used to search for appropriate clusters in multidimensional feature space such that a similarity metric of the resulting clusters is optimized. Data points are redistributed among the clusters probabilistically, so that points that are farther away from the cluster center have higher probabilities of migrating to other clusters than those which are closer to it. The superiority of the SAKM-clustering algorithm over the widely used K-means algorithm is extensively demonstrated for artificial and real life data sets.

Download Full-text

Hierarchical clustering with concave data sets

Advances in Methodology and Statistics ◽

10.51936/mylp9878 ◽

2005 ◽

Vol 2 (2) ◽

Author(s):

Matej Francetič ◽

Mateja Nagode ◽

Bojan Nastav

Keyword(s):

Hierarchical Clustering ◽

Data Structures ◽

Real Life ◽

Data Sets ◽

Clustering Methods ◽

Cluster Membership ◽

Real Life Data ◽

Essential Knowledge ◽

Bootstrap Application ◽

Simple Convex

Clustering methods are among the most widely used methods in multivariate analysis. Two main groups of clustering methods can be distinguished: hierarchical and non-hierarchical. Due to the nature of the problem examined, this paper focuses on hierarchical methods such as the nearest neighbour, the furthest neighbour, Ward's method, between-groups linkage, within-groups linkage, centroid and median clustering. The goal is to assess the performance of different clustering methods when using concave sets of data, and also to figure out in which types of different data structures can these methods reveal and correctly assign group membership. The simulations were run in a two- and threedimensional space. Using different standard deviations of points around the skeleton further modified each of the two original shapes. In this manner various shapes of sets with different inter-cluster distances were generated. Generating the data sets provides the essential knowledge of cluster membership for comparing the clustering methods' performances. Conclusions are important and interesting since real life data seldom follow the simple convex-shaped structure, but need further work, such as the bootstrap application, the inclusion of the dendrogram-based analysis or other data structures. Therefore this paper can serve as a basis for further study of hierarchical clustering performance with concave sets.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

A Generalized Modification of the Kumaraswamy Distribution for Modeling and Analyzing Real-Life Data

Statistics Optimization & Information Computing ◽

10.19139/soic-2310-5070-869 ◽

2020 ◽

Vol 8 (2) ◽

pp. 521-548

Author(s):

Rafid Alshkaki

Keyword(s):

Real Life ◽

Estimation Method ◽

Simulated Data ◽

Likelihood Estimation ◽

Data Sets ◽

Kumaraswamy Distribution ◽

Life Data ◽

Real Life Data ◽

Beta Power ◽

Special Cases

In this paper, a generalized modification of the Kumaraswamy distribution is proposed, and its distributional and characterizing properties are studied. This distribution is closed under scaling and exponentiation, and has some well-known distributions as special cases, such as the generalized uniform, triangular, beta, power function, Minimax, and some other Kumaraswamy related distributions. Moment generating function, Lorenz and Bonferroni curves, with its moments consisting of the mean, variance, moments about the origin, harmonic, incomplete, probability weighted, L, and trimmed L moments, are derived. The maximum likelihood estimation method is used for estimating its parameters and applied to six different simulated data sets of this distribution, in order to check the performance of the estimation method through the estimated parameters mean squares errors computed from the different simulated sample sizes. Finally, four real-life data sets are used to illustrate the usefulness and the flexibility of this distribution in application to real-life data.

Download Full-text

On modified Kies distribution and its applications

10.47302/jsr.2017510103 ◽

2017 ◽

Vol 51 (1) ◽

pp. 41-60

Author(s):

C. SATHEESH KUMAR ◽

S. H. S. DHARMAJA

Keyword(s):

Maximum Likelihood ◽

Hazard Function ◽

Real Life ◽

Simulated Data ◽

Maximum Likelihood Estimators ◽

Data Sets ◽

Life Data ◽

Real Life Data ◽

Method Of Maximum Likelihood ◽

Simulated Data Sets

In this paper, we consider a class of bathtub-shaped hazard function distribution through modifying the Kies distribution and investigate some of its important properties by deriving expressions for its percentile function, raw moments, stress-strength reliability measure etc. The parameters of the distribution are estimated by the method of maximum likelihood and discussed some of its reliability applications with the help of certain real life data sets. In addition, the asymptotic behavior of the maximum likelihood estimators of the parameters of the distribution is examined by using simulated data sets.

Download Full-text

An Empirical Study for the Estimation of Autoregressive Hilbertian Processes by Wavelet Packet Method

Nonlinear Analysis Modelling and Control ◽

10.15388/na.2007.12.1.14722 ◽

2007 ◽

Vol 12 (1) ◽

pp. 65-75 ◽

Cited By ~ 1

Author(s):

A. Laukaitis

Keyword(s):

Wavelet Packet ◽

Real Life ◽

Simulated Data ◽

Partial Sums ◽

Data Sets ◽

Time Prediction ◽

Real Life Data ◽

Partial Sums Processes ◽

Operator Kernel ◽

Autoregressive Hilbertian Processes

In this paper wavelet packet bases are used for an estimation of the autoregressive Hilbertian processes operator. We assume that integral operator kernel can have some singular structures and estimate them by projecting functional processes on suitable bases. Linear methods for continuous-time prediction using Hilbert-valued autoregressive processes are compared with the suggested method on simulated data and on real-life data sets. Statistics of residual partial sums processes and Ex poste prediction are used to check the model.

Download Full-text

Distance Learning in Discriminative Vector Quantization

Neural Computation ◽

10.1162/neco.2009.10-08-892 ◽

2009 ◽

Vol 21 (10) ◽

pp. 2942-2969 ◽

Cited By ~ 53

Author(s):

Petra Schneider ◽

Michael Biehl ◽

Barbara Hammer

Keyword(s):

Vector Quantization ◽

Euclidean Distance ◽

Distance Measure ◽

Real Life ◽

Data Driven ◽

Data Sets ◽

Life Data ◽

Real Life Data ◽

Metric Structures ◽

The Given

Discriminative vector quantization schemes such as learning vector quantization (LVQ) and extensions thereof offer efficient and intuitive classifiers based on the representation of classes by prototypes. The original methods, however, rely on the Euclidean distance corresponding to the assumption that the data can be represented by isotropic clusters. For this reason, extensions of the methods to more general metric structures have been proposed, such as relevance adaptation in generalized LVQ (GLVQ) and matrix learning in GLVQ. In these approaches, metric parameters are learned based on the given classification task such that a data-driven distance measure is found. In this letter, we consider full matrix adaptation in advanced LVQ schemes. In particular, we introduce matrix learning to a recent statistical formalization of LVQ, robust soft LVQ, and we compare the results on several artificial and real-life data sets to matrix learning in GLVQ, a derivation of LVQ-like learning based on a (heuristic) cost function. In all cases, matrix adaptation allows a significant improvement of the classification accuracy. Interestingly, however, the principled behavior of the models with respect to prototype locations and extracted matrix dimensions shows several characteristic differences depending on the data sets.

Download Full-text

New Weighted Lomax (NWL) Distribution with Applications to Real and Simulated Data

Mathematical Problems in Engineering ◽

10.1155/2021/8558118 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Huda M. Alshanbari ◽

Muhammad Ijaz ◽

Syed Muhammad Asim ◽

Abd Al-Aziz Hosni El-Bagoury ◽

Javid Gani Dar

Keyword(s):

Probability Distribution ◽

Hazard Rate ◽

Probability Distributions ◽

Real Life ◽

Simulated Data ◽

Rate Function ◽

Data Sets ◽

Unknown Parameters ◽

Life Data ◽

Real Life Data

The rationale of the paper is to present a new probability distribution that can model both the monotonic and nonmonotonic hazard rate shapes and to increase their flexibility among other probability distributions available in the literature. The proposed probability distribution is called the New Weighted Lomax (NWL) distribution. Various statistical properties have been studied including with the estimation of the unknown parameters. To achieve the basic objectives, applications of NWL are presented by means of two real-life data sets as well as a simulated data. It is verified that NWL performs well in both monotonic and nonmonotonic hazard rate function than the Lomax (L), Power Lomax (PL), Exponential Lomax (EL), and Weibull Lomax (WL) distribution.

Download Full-text

A simple clustering technique to extract subsets of data for function approximation

Journal of Hydroinformatics ◽

10.2166/hydro.2015.065 ◽

2015 ◽

Vol 17 (5) ◽

pp. 719-732

Author(s):

Dulakshi Santhusitha Kumari Karunasingha ◽

Shie-Yui Liong

Keyword(s):

Function Approximation ◽

Prediction Models ◽

Data Extraction ◽

Single Parameter ◽

Subtractive Clustering ◽

Data Sets ◽

Clustering Methods ◽

Clustering Method ◽

Data Set ◽

Functional Relationships

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.

Download Full-text

Periodic Streaming Data Reduction Using Flexible Adjustment of Time Section Size

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch070 ◽

2008 ◽

pp. 1231-1249

Author(s):

Jaehoon Kim ◽

Seong Park

Keyword(s):

Data Stream ◽

Estimation Error ◽

Real Life ◽

Streaming Data ◽

Data Sets ◽

Storage Allocation ◽

Time Section ◽

Proper Size ◽

Real Life Data ◽

Past Data

Much of the research regarding streaming data has focused only on real time querying and analysis of recent data stream allowable in memory. However, as data stream mining, or tracking of past data streams, is often required, it becomes necessary to store large volumes of streaming data in stable storage. Moreover, as stable storage has restricted capacity, past data stream must be summarized. The summarization must be performed periodically because streaming data flows continuously, quickly, and endlessly. Therefore, in this paper, we propose an efficient periodic summarization method with a flexible storage allocation. It improves the overall estimation error by flexibly adjusting the size of the summarized data of each local time section. Additionally, as the processing overhead of compression and the disk I/O cost of decompression can be an important factor for quick summarization, we also consider setting the proper size of data stream to be summarized at a time. Some experimental results with artificial data sets as well as real life data show that our flexible approach is more efficient than the existing fixed approach.

Download Full-text