scholarly journals Clustering functional data using forward search based on functional spatial ranks with medical applications

2021 ◽  
pp. 096228022110028
Author(s):  
Mohammed Baragilly ◽  
Hend Gabr ◽  
Brian H Willis

Cluster analysis of functional data is finding increasing application in the field of medical research and statistics. Here we introduce a functional version of the forward search methodology for the purpose of functional data clustering. The proposed forward search algorithm is based on the functional spatial ranks and is a data-driven non-parametric method. It does not require any preprocessing functional data steps, nor does it require any dimension reduction before clustering. The Forward Search Based on Functional Spatial Rank (FSFSR) algorithm identifies the number of clusters in the curves and provides the basis for the accurate assignment of each curve to its cluster. We apply it to three simulated datasets and two real medical datasets, and compare it with six other standard methods. Based on both simulated and real data, the FSFSR algorithm identifies the correct number of clusters. Furthermore, when compared with six standard methods used for clustering and classification, it records the lowest misclassification rate. We conclude that the FSFSR algorithm has the potential to cluster and classify functional data.

2016 ◽  
Vol 25 (4) ◽  
pp. 595-610 ◽  
Author(s):  
Vijay Kumar ◽  
Jitender Kumar Chhabra ◽  
Dinesh Kumar

AbstractIn this paper, the problem of automatic data clustering is treated as the searching of optimal number of clusters so that the obtained partitions should be optimized. The automatic data clustering technique utilizes a recently developed parameter adaptive harmony search (PAHS) as an underlying optimization strategy. It uses real-coded variable length harmony vector, which is able to detect the number of clusters automatically. The newly developed concepts regarding “threshold setting” and “cutoff” are used to refine the optimization strategy. The assignment of data points to different cluster centers is done based on the newly developed weighted Euclidean distance instead of Euclidean distance. The developed approach is able to detect any type of cluster irrespective of their geometric shape. It is compared with four well-established clustering techniques. It is further applied for automatic segmentation of grayscale and color images, and its performance is compared with other existing techniques. For real-life datasets, statistical analysis is done. The technique shows its effectiveness and the usefulness.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2018 ◽  
Vol 2018 ◽  
pp. 1-13 ◽  
Author(s):  
Laura Millán-Roures ◽  
Irene Epifanio ◽  
Vicente Martínez

A functional data analysis (FDA) based methodology for detecting anomalous flows in urban water networks is introduced. Primary hydraulic variables are recorded in real-time by telecontrol systems, so they are functional data (FD). In the first stage, the data are validated (false data are detected) and reconstructed, since there could be not only false data, but also missing and noisy data. FDA tools are used such as tolerance bands for FD and smoothing for dense and sparse FD. In the second stage, functional outlier detection tools are used in two phases. In Phase I, the data are cleared of anomalies to ensure that data are representative of the in-control system. The objective of Phase II is system monitoring. A new functional outlier detection method is also proposed based on archetypal analysis. The methodology is applied and illustrated with real data. A simulated study is also carried out to assess the performance of the outlier detection techniques, including our proposal. The results are very promising.


Author(s):  
Ahmed Fahim ◽  

The k-means is the most well-known algorithm for data clustering in data mining. Its simplicity and speed of convergence to local minima are the most important advantages of it, in addition to its linear time complexity. The most important open problems in this algorithm are the selection of initial centers and the determination of the exact number of clusters in advance. This paper proposes a solution for these two problems together; by adding a preprocess step to get the expected number of clusters in data and better initial centers. There are many researches to solve each of these problems separately, but there is no research to solve both problems together. The preprocess step requires o(n log n); where n is size of the dataset. This preprocess step aims to get initial portioning of data without determining the number of clusters in advance, then computes the means of initial clusters. After that we apply k-means on original data using the resulting information from the preprocess step to get the final clusters. We use many benchmark datasets to test the proposed method. The experimental results show the efficiency of the proposed method.


10.14311/816 ◽  
2006 ◽  
Vol 46 (2) ◽  
Author(s):  
P. Pecherková ◽  
I. Nagy

Success/failure of adaptive control algorithms – especially those designed using the Linear Quadratic Gaussian criterion – depends on the quality of the process data used for model identification. One of the most harmful types of process data corruptions are outliers, i.e. ‘wrong data’ lying far away from the range of real data. The presence of outliers in the data negatively affects an estimation of the dynamics of the system. This effect is magnified when the outliers are grouped into blocks. In this paper, we propose an algorithm for outlier detection and removal. It is based on modelling the corrupted data by a two-component probabilistic mixture. The first component of the mixture models uncorrupted process data, while the second models outliers. When the outlier component is detected to be active, a prediction from the uncorrupted data component is computed and used as a reconstruction of the observed data. The resulting reconstruction filter is compared to standard methods on simulated and real data. The filter exhibits excellent properties, especially in the case of blocks of outliers. 


Mathematics ◽  
2021 ◽  
Vol 9 (23) ◽  
pp. 3074
Author(s):  
Cristian Preda ◽  
Quentin Grimonprez ◽  
Vincent Vandewalle

Categorical functional data represented by paths of a stochastic jump process with continuous time and a finite set of states are considered. As an extension of the multiple correspondence analysis to an infinite set of variables, optimal encodings of states over time are approximated using an arbitrary finite basis of functions. This allows dimension reduction, optimal representation, and visualisation of data in lower dimensional spaces. The methodology is implemented in the cfda R package and is illustrated using a real data set in the clustering framework.


Sign in / Sign up

Export Citation Format

Share Document