A Modified MinMaxk-Means Algorithm Based on PSO

The MinMaxk-means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMaxk-means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMaxk-means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to thek-means algorithm and the original MinMaxk-means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically.

Download Full-text

A simple clustering technique to extract subsets of data for function approximation

Journal of Hydroinformatics ◽

10.2166/hydro.2015.065 ◽

2015 ◽

Vol 17 (5) ◽

pp. 719-732

Author(s):

Dulakshi Santhusitha Kumari Karunasingha ◽

Shie-Yui Liong

Keyword(s):

Function Approximation ◽

Prediction Models ◽

Data Extraction ◽

Single Parameter ◽

Subtractive Clustering ◽

Data Sets ◽

Clustering Methods ◽

Clustering Method ◽

Data Set ◽

Functional Relationships

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.

Download Full-text

Scalable Non-Parametric Methods for Large Data Sets

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch260 ◽

2011 ◽

pp. 1708-1713

Author(s):

V. Suresh Babu ◽

P. Viswanath ◽

Narasimha M. Murty

Keyword(s):

Nearest Neighbor ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Parametric Methods ◽

Clustering Method ◽

Data Set ◽

Computational Burden ◽

Set Size ◽

Non Parametric

Non-parametric methods like the nearest neighbor classifier (NNC) and the Parzen-Window based density estimation (Duda, Hart & Stork, 2000) are more general than parametric methods because they do not make any assumptions regarding the probability distribution form. Further, they show good performance in practice with large data sets. These methods, either explicitly or implicitly estimates the probability density at a given point in a feature space by counting the number of points that fall in a small region around the given point. Popular classifiers which use this approach are the NNC and its variants like the k-nearest neighbor classifier (k-NNC) (Duda, Hart & Stock, 2000). Whereas the DBSCAN is a popular density based clustering method (Han & Kamber, 2001) which uses this approach. These methods show good performance, especially with larger data sets. Asymptotic error rate of NNC is less than twice the Bayes error (Cover & Hart, 1967) and DBSCAN can find arbitrary shaped clusters along with noisy outlier detection (Ester, Kriegel & Xu, 1996). The most prominent difficulty in applying the non-parametric methods for large data sets is its computational burden. The space and classification time complexities of NNC and k-NNC are O(n) where n is the training set size. The time complexity of DBSCAN is O(n2). So, these methods are not scalable for large data sets. Some of the remedies to reduce this burden are as follows. (1) Reduce the training set size by some editing techniques in order to eliminate some of the training patterns which are redundant in some sense (Dasarathy, 1991). For example, the condensed NNC (Hart, 1968) is of this type. (2) Use only a few selected prototypes from the data set. For example, Leaders-subleaders method and l-DBSCAN method are of this type (Vijaya, Murthy & Subramanian, 2004 and Viswanath & Rajwala, 2006). These two remedies can reduce the computational burden, but this can also result in a poor performance of the method. Using enriched prototypes can improve the performance as done in (Asharaf & Murthy, 2003) where the prototypes are derived using adaptive rough fuzzy set theory and as in (Suresh Babu & Viswanath, 2007) where the prototypes are used along with their relative weights. Using a few selected prototypes can reduce the computational burden. Prototypes can be derived by employing a clustering method like the leaders method (Spath, 1980), the k-means method (Jain, Dubes, & Chen, 1987), etc., which can find a partition of the data set where each block (cluster) of the partition is represented by a prototype called leader, centroid, etc. But these prototypes can not be used to estimate the probability density, since the density information present in the data set is lost while deriving the prototypes. The chapter proposes to use a modified leader clustering method called the counted-leader method which along with deriving the leaders preserves the crucial density information in the form of a count which can be used in estimating the densities. The chapter presents a fast and efficient nearest prototype based classifier called the counted k-nearest leader classifier (ck-NLC) which is on-par with the conventional k-NNC, but is considerably faster than the k-NNC. The chapter also presents a density based clustering method called l-DBSCAN which is shown to be a faster and scalable version of DBSCAN (Viswanath & Rajwala, 2006). Formally, under some assumptions, it is shown that the number of leaders is upper-bounded by a constant which is independent of the data set size and the distribution from which the data set is drawn.

Download Full-text

Online Strategy Clustering Based on Action Sequences in RoboCupSoccer Small Size League

Robotics ◽

10.3390/robotics8030058 ◽

2019 ◽

Vol 8 (3) ◽

pp. 58

Author(s):

Yusuke Adachi ◽

Masahide Ito ◽

Tadashi Naruse

Keyword(s):

Experimental Results ◽

Clustering Method ◽

Learning Problem ◽

Data Set ◽

Geometric Data ◽

Action Sequences ◽

Novel Method ◽

Online Strategy

This paper addresses a strategy learning problem in the RoboCupSoccer Small Size League (SSL). We propose a novel method based on action sequences to cluster an opponent’s strategies online. Our proposed method is composed of the following three steps: (1) extracting typical actions from geometric data to make action sequences, (2) calculating the dissimilarity of the sequences, and (3) clustering the sequences by using the dissimilarity. This method can reduce the amount of data used in the clustering process; handling action sequences instead of geometric data as data-set makes it easier to search actions. As a result, the proposed clustering method is online feasible and also is applicable to countering an opponent’s strategy. The effectiveness of the proposed method was validated by experimental results.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

Enhancement of diffractions in prestack domain by means of a finite-offset double-square-root traveltime

Geophysics ◽

10.1190/geo2018-0160.1 ◽

2019 ◽

Vol 84 (1) ◽

pp. V81-V96 ◽

Cited By ~ 4

Author(s):

Tiago A. Coimbra ◽

Jorge H. Faccipieri ◽

João H. Speglich ◽

Leiv-J. Gelius ◽

Martin Tygel

Keyword(s):

Extraction Process ◽

Two Dimensions ◽

Three Dimensions ◽

Data Sets ◽

Square Root ◽

Data Set ◽

Main Challenge ◽

Two Parameters ◽

General Source

Exploration of redundancy contained in the seismic data set assures enhancement of images that are based on stacking results. This enhancement is the goal of developing multiparametric traveltime equations that are able to approximate reflection and diffraction events in general source-receiver configurations. The main challenge of using these equations is to estimate a large number of parameters in a computationally feasible, reliable, and fast way. To obtain a better fit for diffraction traveltime events than the ones in the literature, we have derived a finite-offset (FO) double-square-root (DSR) diffraction traveltime equation (which depends on 10 parameters in three dimensions and four parameters in two dimensions). Moreover, to reduce the number of parameters, we have developed another version called simplified FO-DSR diffraction traveltime equation (which depends on five parameters in three dimensions and two parameters in two dimensions), which delivers a similar performance. We have developed operators that make use of the simplified FO-DSR traveltime equation to construct the so-called diffraction-only data set volumes (or, more simply, D-volumes) assuring enhancement in the diffraction extraction process. The D-volume construction has two steps: first, a stacking procedure to separate the diffraction events from the input data set and second, a spreading procedure to enhance the quality of these diffractions. As proof of concept, our approach has been tested on 2D/3D synthetic and 2D field data sets with successful results.

Download Full-text

UNSUPERVISED FUZZY CLUSTERING USING WEIGHTED INCREMENTAL NEURAL NETWORKS

International Journal of Neural Systems ◽

10.1142/s0129065704002121 ◽

2004 ◽

Vol 14 (06) ◽

pp. 355-371 ◽

Cited By ~ 5

Author(s):

HAMED HAMID MUHAMMED

Keyword(s):

Neural Network ◽

Fuzzy Clustering ◽

Input Data ◽

Data Sets ◽

Computationally Efficient ◽

Data Set ◽

System A ◽

Two Parameters ◽

The Given ◽

Unsupervised Fuzzy Clustering

A new more efficient variant of a recently developed algorithm for unsupervised fuzzy clustering is introduced. A Weighted Incremental Neural Network (WINN) is introduced and used for this purpose. The new approach is called FC-WINN (Fuzzy Clustering using WINN). The WINN algorithm produces a net of nodes connected by edges, which reflects and preserves the topology of the input data set. Additional weights, which are proportional to the local densities in input space, are associated with the resulting nodes and edges to store useful information about the topological relations in the given input data set. A fuzziness factor, proportional to the connectedness of the net, is introduced in the system. A watershed-like procedure is used to cluster the resulting net. The number of the resulting clusters is determined by this procedure. Only two parameters must be chosen by the user for the FC-WINN algorithm to determine the resolution and the connectedness of the net. Other parameters that must be specified are those which are necessary for the used incremental neural network, which is a modified version of the Growing Neural Gas algorithm (GNG). The FC-WINN algorithm is computationally efficient when compared to other approaches for clustering large high-dimensional data sets.

Download Full-text

Likelihood-Based Inference for the Asymmetric Beta-Skew Alpha-Power Distribution

Symmetry ◽

10.3390/sym12040613 ◽

2020 ◽

Vol 12 (4) ◽

pp. 613

Author(s):

Guillermo Martínez-Flórez ◽

Roger Tovar-Falón ◽

Marvin Jimémez-Narváez

Keyword(s):

Power Distribution ◽

Information Matrix ◽

Likelihood Method ◽

Data Sets ◽

Alpha Power ◽

Data Set ◽

Monte Carlo Simulation Study ◽

New Family ◽

Asymmetric Distributions ◽

Two Parameters

This paper introduces a new family of asymmetric distributions that allows to fit unimodal as well as bimodal and trimodal data sets. The model extends the normal model by introducing two parameters that control the shape and the asymmetry of the distribution. Basic properties of this new distribution are studied in detail. The problem of estimating parameters is addressed by considering the maximum likelihood method and Fisher information matrix is derived. A small Monte Carlo simulation study is conducted to examine the performance of the obtained estimators. Finally, two data set are considered to illustrate the developed methodology.

Download Full-text

Exploration of Campus Layout Based on Generative Adversarial Network

Proceedings of the 2020 DigitalFUTURES ◽

10.1007/978-981-33-4400-6_16 ◽

2021 ◽

pp. 169-178

Author(s):

Yubo Liu ◽

Yihua Luo ◽

Qiaoming Deng ◽

Xuanxing Zhou

Keyword(s):

Deep Learning ◽

Experimental Results ◽

Data Sets ◽

Generative Adversarial Network ◽

Data Set ◽

Effective Screening ◽

Adversarial Network ◽

Sample Data ◽

Layout Generation ◽

The Given

AbstractThis paper aims to explore the idea and method of using deep learning with a small amount sample to realize campus layout generation. From the perspective of the architect, we construct two small amount sample campus layout data sets through artificial screening with the preference of the specific architects. These data sets are used to train the ability of Pix2Pix model to automatically generate the campus layout under the condition of the given campus boundary and surrounding roads. Through the analysis of the experimental results, this paper finds that under the premise of effective screening of the collected samples, even using a small amount sample data set for deep learning can achieve a good result.

Download Full-text

STRATEGIES OF SELECTING THE BASIS VECTOR SET IN THE RELATIVE MDS

Technological and Economic Development of Economy ◽

10.3846/13928619.2006.9637755 ◽

2006 ◽

Vol 12 (4) ◽

pp. 283-288

Author(s):

Jolita Bernatavičienė ◽

Gintautas Dzemyda ◽

Olga Kurasova ◽

Virginijus Marcinkevičius

Keyword(s):

Experimental Investigation ◽

Data Visualization ◽

Basis Vector ◽

Multidimensional Data ◽

Clustering Method ◽

Data Set ◽

Original Algorithm ◽

Multidimensional Data Visualization ◽

Visualization Process ◽

Basis Vectors

In this paper, a method of large multidimensional data visualization that associates the multidimensional scaling (MDS) with clustering is modified and investigated. In the original algorithm, the visualization process is divided into three steps: the basis vector set is constructed using the k‐means clustering method; this set is projected onto the plane using the MDS algorithm; the remaining data set is visualized using the relative MDS algorithm. We propose a modification which differs from the original algorithm in the strategy of selecting the basis vectors. In our modification, the set of basis vectors consists of vectors that are selected from k clusters in a new way. The experimental investigation showed that the modification exceeds the original algorithm in visualization quality and computational expenses.

Download Full-text

A clustering method of Gas load based on FCM-SMOTE

E3S Web of Conferences ◽

10.1051/e3sconf/202125701032 ◽

2021 ◽

Vol 257 ◽

pp. 01032

Author(s):

Dong Hong Huang ◽

Dan Liu ◽

Ming Wen ◽

Xin Li Dong ◽

Min Wen ◽

...

Keyword(s):

Cluster Analysis ◽

Clustering Analysis ◽

Unbalanced Data ◽

Data Sets ◽

Current Load ◽

Clustering Methods ◽

Clustering Method ◽

Data Set ◽

Actual Application ◽

Data Objects

For the design and planning of gas-fired boiler system, the load of gas-fired boiler is an important basic data. Load clustering analysis, combined with the application of data mining technology and gas boiler system, excavates the hidden load patterns in a large number of disordered and irregular loads, and classifies them, so as to solve many problems in gas boiler system. The current load clustering methods have more or less problems. The invention first carries out data PVA dimension reduction processing on the huge gas data, and then carries out cluster analysis. In the actual application of gas-fired boilers, the data objects we are faced with are usually unbalanced data sets. In order to solve the problem of sample imbalance, we use the FCM-SMOTE algorithm to oversample the clustered data to make the data set into a balanced data set.

Download Full-text