Clustering functional data using forward search based on functional spatial ranks with medical applications

Cluster analysis of functional data is finding increasing application in the field of medical research and statistics. Here we introduce a functional version of the forward search methodology for the purpose of functional data clustering. The proposed forward search algorithm is based on the functional spatial ranks and is a data-driven non-parametric method. It does not require any preprocessing functional data steps, nor does it require any dimension reduction before clustering. The Forward Search Based on Functional Spatial Rank (FSFSR) algorithm identifies the number of clusters in the curves and provides the basis for the accurate assignment of each curve to its cluster. We apply it to three simulated datasets and two real medical datasets, and compare it with six other standard methods. Based on both simulated and real data, the FSFSR algorithm identifies the correct number of clusters. Furthermore, when compared with six standard methods used for clustering and classification, it records the lowest misclassification rate. We conclude that the FSFSR algorithm has the potential to cluster and classify functional data.

Download Full-text

Automatic Data Clustering Using Parameter Adaptive Harmony Search Algorithm and Its Application to Image Segmentation

Journal of Intelligent Systems ◽

10.1515/jisys-2015-0004 ◽

2016 ◽

Vol 25 (4) ◽

pp. 595-610 ◽

Cited By ~ 8

Author(s):

Vijay Kumar ◽

Jitender Kumar Chhabra ◽

Dinesh Kumar

Keyword(s):

Data Clustering ◽

Euclidean Distance ◽

Search Algorithm ◽

Harmony Search ◽

Real Life ◽

Optimal Number ◽

Optimization Strategy ◽

Number Of Clusters ◽

Automatic Data ◽

Parameter Adaptive

AbstractIn this paper, the problem of automatic data clustering is treated as the searching of optimal number of clusters so that the obtained partitions should be optimized. The automatic data clustering technique utilizes a recently developed parameter adaptive harmony search (PAHS) as an underlying optimization strategy. It uses real-coded variable length harmony vector, which is able to detect the number of clusters automatically. The newly developed concepts regarding “threshold setting” and “cutoff” are used to refine the optimization strategy. The assignment of data points to different cluster centers is done based on the newly developed weighted Euclidean distance instead of Euclidean distance. The developed approach is able to detect any type of cluster irrespective of their geometric shape. It is compared with four well-established clustering techniques. It is further applied for automatic segmentation of grayscale and color images, and its performance is compared with other existing techniques. For real-life datasets, statistical analysis is done. The technique shows its effectiveness and the usefulness.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Detection of Anomalies in Water Networks by Functional Data Analysis

Mathematical Problems in Engineering ◽

10.1155/2018/5129735 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 8

Author(s):

Laura Millán-Roures ◽

Irene Epifanio ◽

Vicente Martínez

Keyword(s):

Data Analysis ◽

Outlier Detection ◽

Functional Data Analysis ◽

Functional Data ◽

Real Data ◽

Water Networks ◽

Archetypal Analysis ◽

Detection Techniques ◽

Second Stage ◽

Two Phases

A functional data analysis (FDA) based methodology for detecting anomalous flows in urban water networks is introduced. Primary hydraulic variables are recorded in real-time by telecontrol systems, so they are functional data (FD). In the first stage, the data are validated (false data are detected) and reconstructed, since there could be not only false data, but also missing and noisy data. FDA tools are used such as tolerance bands for FD and smoothing for dense and sparse FD. In the second stage, functional outlier detection tools are used in two phases. In Phase I, the data are cleared of anomalies to ensure that data are representative of the in-control system. The objective of Phase II is system monitoring. A new functional outlier detection method is also proposed based on archetypal analysis. The methodology is applied and illustrated with real data. A simulated study is also carried out to assess the performance of the outlier detection techniques, including our proposal. The results are very promising.

Download Full-text

A Fuzzy Crow Search Algorithm for Solving Data Clustering Problem

Trends in Artificial Intelligence Theory and Applications. Artificial Intelligence Practices - Lecture Notes in Computer Science ◽

10.1007/978-3-030-55789-8_67 ◽

2020 ◽

pp. 782-791

Author(s):

Ze-Xue Wu ◽

Ko-Wei Huang ◽

Chu-Sing Yang

Keyword(s):

Data Clustering ◽

Search Algorithm ◽

Clustering Problem

Download Full-text

Functional Data Clustering Analysis via the Learning of Gaussian Processes with Wasserstein Distance

Neural Information Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-63833-7_33 ◽

2020 ◽

pp. 393-403

Author(s):

Tao Li ◽

Jinwen Ma

Keyword(s):

Gaussian Processes ◽

Functional Data ◽

Clustering Analysis ◽

Data Clustering ◽

Wasserstein Distance

Download Full-text

Finding the Number of Clusters in Data and Better Initial Centers for K-means Algorithm

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2020.06.01 ◽

2020 ◽

Vol 12 (6) ◽

pp. 1-20

Author(s):

Ahmed Fahim ◽

Keyword(s):

Data Clustering ◽

Linear Time ◽

Original Data ◽

Local Minima ◽

Expected Number ◽

Open Problems ◽

Number Of Clusters ◽

Benchmark Datasets ◽

Selection Of

The k-means is the most well-known algorithm for data clustering in data mining. Its simplicity and speed of convergence to local minima are the most important advantages of it, in addition to its linear time complexity. The most important open problems in this algorithm are the selection of initial centers and the determination of the exact number of clusters in advance. This paper proposes a solution for these two problems together; by adding a preprocess step to get the expected number of clusters in data and better initial centers. There are many researches to solve each of these problems separately, but there is no research to solve both problems together. The preprocess step requires o(n log n); where n is size of the dataset. This preprocess step aims to get initial portioning of data without determining the number of clusters in advance, then computes the means of initial clusters. After that we apply k-means on original data using the resulting information from the preprocess step to get the final clusters. We use many benchmark datasets to test the proposed method. The experimental results show the efficiency of the proposed method.

Download Full-text

Mixture Based Outlier Filtration

Acta Polytechnica ◽

10.14311/816 ◽

2006 ◽

Vol 46 (2) ◽

Author(s):

P. Pecherková ◽

I. Nagy

Keyword(s):

Model Identification ◽

Real Data ◽

Control Algorithms ◽

Linear Quadratic ◽

Process Data ◽

Linear Quadratic Gaussian ◽

Standard Methods ◽

Two Component ◽

Reconstruction Filter

Success/failure of adaptive control algorithms – especially those designed using the Linear Quadratic Gaussian criterion – depends on the quality of the process data used for model identification. One of the most harmful types of process data corruptions are outliers, i.e. ‘wrong data’ lying far away from the range of real data. The presence of outliers in the data negatively affects an estimation of the dynamics of the system. This effect is magnified when the outliers are grouped into blocks. In this paper, we propose an algorithm for outlier detection and removal. It is based on modelling the corrupted data by a two-component probabilistic mixture. The first component of the mixture models uncorrupted process data, while the second models outliers. When the outlier component is detected to be active, a prediction from the uncorrupted data component is computed and used as a reconstruction of the observed data. The resulting reconstruction filter is compared to standard methods on simulated and real data. The filter exhibits excellent properties, especially in the case of blocks of outliers.

Download Full-text

Fast Search Algorithm for Determining the Optimal Number of Clusters using Cluster Validity Index

The Journal of the Korea Contents Association ◽

10.5392/jkca.2009.9.9.080 ◽

2009 ◽

Vol 9 (9) ◽

pp. 80-89 ◽

Cited By ~ 1

Author(s):

Sang-Wook Lee

Keyword(s):

Search Algorithm ◽

Optimal Number ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Fast Search ◽

Fast Search Algorithm ◽

Optimal Number Of Clusters

Download Full-text

Categorical Functional Data Analysis. The cfda R Package

Mathematics ◽

10.3390/math9233074 ◽

2021 ◽

Vol 9 (23) ◽

pp. 3074

Author(s):

Cristian Preda ◽

Quentin Grimonprez ◽

Vincent Vandewalle

Keyword(s):

Functional Data ◽

Multiple Correspondence Analysis ◽

Real Data ◽

Jump Process ◽

R Package ◽

Finite Basis ◽

Data Set ◽

Stochastic Jump ◽

Finite Set ◽

Infinite Set

Categorical functional data represented by paths of a stochastic jump process with continuous time and a finite set of states are considered. As an extension of the multiple correspondence analysis to an infinite set of variables, optimal encodings of states over time are approximated using an arbitrary finite basis of functions. This allows dimension reduction, optimal representation, and visualisation of data in lower dimensional spaces. The methodology is implemented in the cfda R package and is illustrated using a real data set in the clustering framework.

Download Full-text

A new algorithm for data clustering based on gravitational search algorithm and genetic operators

2015 The International Symposium on Artificial Intelligence and Signal Processing (AISP) ◽

10.1109/aisp.2015.7123532 ◽

2015 ◽

Cited By ~ 4

Author(s):

Hamed Nikbakht ◽

Hamid Mirvaziri

Keyword(s):

Data Clustering ◽

Search Algorithm ◽

Gravitational Search Algorithm ◽

Genetic Operators ◽

Gravitational Search

Download Full-text