A Support Based Initialization Algorithm for Categorical Data Clustering

2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

2012 ◽  
Vol 8 (4) ◽  
pp. 82-107 ◽  
Author(s):  
Renxia Wan ◽  
Yuelin Gao ◽  
Caixia Li

Up to now, several algorithms for clustering large data sets have been presented. Most clustering approaches for data sets are the crisp ones, which cannot be well suitable to the fuzzy case. In this paper, the authors explore a single pass approach to fuzzy possibilistic clustering over large data set. The basic idea of the proposed approach (weighted fuzzy-possibilistic c-means, WFPCM) is to use a modified possibilistic c-means (PCM) algorithm to cluster the weighted data points and centroids with one data segment as a unit. Experimental results on both synthetic and real data sets show that WFPCM can save significant memory usage when comparing with the fuzzy c-means (FCM) algorithm and the possibilistic c-means (PCM) algorithm. Furthermore, the proposed algorithm is of an excellent immunity to noise and can avoid splitting or merging the exact clusters into some inaccurate clusters, and ensures the integrity and purity of the natural classes.


2019 ◽  
Vol 8 (2) ◽  
pp. 159
Author(s):  
Morteza Marzjarani

Heteroscedasticity plays an important role in data analysis. In this article, this issue along with a few different approaches for handling heteroscedasticity are presented. First, an iterative weighted least square (IRLS) and an iterative feasible generalized least square (IFGLS) are deployed and proper weights for reducing heteroscedasticity are determined. Next, a new approach for handling heteroscedasticity is introduced. In this approach, through fitting a multiple linear regression (MLR) model or a general linear model (GLM) to a sufficiently large data set, the data is divided into two parts through the inspection of the residuals based on the results of testing for heteroscedasticity, or via simulations. The first part contains the records where the absolute values of the residuals could be assumed small enough to the point that heteroscedasticity would be ignorable. Under this assumption, the error variances are small and close to their neighboring points. Such error variances could be assumed known (but, not necessarily equal).The second or the remaining portion of the said data is categorized as heteroscedastic. Through real data sets, it is concluded that this approach reduces the number of unusual (such as influential) data points suggested for further inspection and more importantly, it will lowers the root MSE (RMSE) resulting in a more robust set of parameter estimates.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Tressy Thomas ◽  
Enayat Rajabi

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.


2018 ◽  
Vol 8 (2) ◽  
pp. 377-406
Author(s):  
Almog Lahav ◽  
Ronen Talmon ◽  
Yuval Kluger

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.


2021 ◽  
Vol 87 (6) ◽  
pp. 445-455
Author(s):  
Yi Ma ◽  
Zezhong Zheng ◽  
Yutang Ma ◽  
Mingcang Zhu ◽  
Ran Huang ◽  
...  

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.


2018 ◽  
Vol 2018 ◽  
pp. 1-12 ◽  
Author(s):  
Suleman Nasiru

The need to develop generalizations of existing statistical distributions to make them more flexible in modeling real data sets is vital in parametric statistical modeling and inference. Thus, this study develops a new class of distributions called the extended odd Fréchet family of distributions for modifying existing standard distributions. Two special models named the extended odd Fréchet Nadarajah-Haghighi and extended odd Fréchet Weibull distributions are proposed using the developed family. The densities and the hazard rate functions of the two special distributions exhibit different kinds of monotonic and nonmonotonic shapes. The maximum likelihood method is used to develop estimators for the parameters of the new class of distributions. The application of the special distributions is illustrated by means of a real data set. The results revealed that the special distributions developed from the new family can provide reasonable parametric fit to the given data set compared to other existing distributions.


Author(s):  
Antonia J. Jones ◽  
Dafydd Evans ◽  
Steve Margetts ◽  
Peter J. Durrant

The Gamma Test is a non-linear modelling analysis tool that allows us to quantify the extent to which a numerical input/output data set can be expressed as a smooth relationship. In essence, it allows us to efficiently calculate that part of the variance of the output that cannot be accounted for by the existence of any smooth model based on the inputs, even though this model is unknown. A key aspect of this tool is its speed: the Gamma Test has time complexity O(Mlog M), where M is the number of datapoints. For data sets consisting of a few thousand points and a reasonable number of attributes, a single run of the Gamma Test typically takes a few seconds. In this chapter we will show how the Gamma Test can be used in the construction of predictive models and classifiers for numerical data. In doing so, we will demonstrate the use of this technique for feature selection, and for the selection of embedding dimension when dealing with a time-series.


Fractals ◽  
2001 ◽  
Vol 09 (01) ◽  
pp. 105-128 ◽  
Author(s):  
TAYFUN BABADAGLI ◽  
KAYHAN DEVELI

This paper presents an evaluation of the methods applied to calculate the fractal dimension of fracture surfaces. Variogram (applicable to 1D self-affine sets) and power spectral density analyses (applicable to 2D self-affine sets) are selected to calculate the fractal dimension of synthetic 2D data sets generated using fractional Brownian motion (fBm). Then, the calculated values are compared with the actual fractal dimensions assigned in the generation of the synthetic surfaces. The main factor considered is the size of the 2D data set (number of data points). The critical sample size that yields the best agreement between the calculated and actual values is defined for each method. Limitations and the proper use of each method are clarified after an extensive analysis. The two methods are also applied to synthetically and naturally developed fracture surfaces of different types of rocks. The methods yield inconsistent fractal dimensions for natural fracture surfaces and the reasons of this are discussed. The anisotropic feature of fractal dimension that may lead to a correlation of fracturing mechanism and multifractality of the fracture surfaces is also addressed.


Author(s):  
Carlos A. P. Bengaly ◽  
Uendert Andrade ◽  
Jailson S. Alcaniz

Abstract We address the $$\simeq 4.4\sigma $$≃4.4σ tension between local and the CMB measurements of the Hubble Constant using simulated Type Ia Supernova (SN) data-sets. We probe its directional dependence by means of a hemispherical comparison through the entire celestial sphere as an estimator of the $$H_0$$H0 cosmic variance. We perform Monte Carlo simulations assuming isotropic and non-uniform distributions of data points, the latter coinciding with the real data. This allows us to incorporate observational features, such as the sample incompleteness, in our estimation. We obtain that this tension can be alleviated to $$3.4\sigma $$3.4σ for isotropic realizations, and $$2.7\sigma $$2.7σ for non-uniform ones. We also find that the $$H_0$$H0 variance is largely reduced if the data-sets are augmented to 4 and 10 times the current size. Future surveys will be able to tell whether the Hubble Constant tension happens due to unaccounted cosmic variance, or whether it is an actual indication of physics beyond the standard cosmological model.


1994 ◽  
Vol 1 (2/3) ◽  
pp. 182-190 ◽  
Author(s):  
M. Eneva

Abstract. Using finite data sets and limited size of study volumes may result in significant spurious effects when estimating the scaling properties of various physical processes. These effects are examined with an example featuring the spatial distribution of induced seismic activity in Creighton Mine (northern Ontario, Canada). The events studied in the present work occurred during a three-month period, March-May 1992, within a volume of approximate size 400 x 400 x 180 m3. Two sets of microearthquake locations are studied: Data Set 1 (14,338 events) and Data Set 2 (1654 events). Data Set 1 includes the more accurately located events and amounts to about 30 per cent of all recorded data. Data Set 2 represents a portion of the first data set that is formed by the most accurately located and the strongest microearthquakes. The spatial distribution of events in the two data sets is examined for scaling behaviour using the method of generalized correlation integrals featuring various moments q. From these, generalized correlation dimensions are estimated using the slope method. Similar estimates are made for randomly generated point sets using the same numbers of events and the same study volumes as for the real data. Uniform and monofractal random distributions are used for these simulations. In addition, samples from the real data are randomly extracted and the dimension spectra for these are examined as well. The spectra for the uniform and monofractal random generations show spurious multifractality due only to the use of finite numbers of data points and limited size of study volume. Comparing these with the spectra of dimensions for Data Set 1 and Data Set 2 allows us to estimate the bias likely to be present in the estimates for the real data. The strong multifractality suggested by the spectrum for Data Set 2 appears to be largely spurious; the spatial distribution, while different from uniform, could originate from a monofractal process. The spatial distribution of microearthquakes in Data Set 1 is either monofractal as well, or only weakly multifractal. In all similar studies, comparisons of result from real data and simulated point sets may help distinguish between genuine and artificial multifractality, without necessarily resorting to large number of data.


Sign in / Sign up

Export Citation Format

Share Document