Iterative Approaches to Handling Heteroscedasticity With Partially Known Error Variances

Morteza Marzjarani

doi:10.5539/ijsp.v8n2p159

Iterative Approaches to Handling Heteroscedasticity With Partially Known Error Variances

International Journal of Statistics and Probability ◽

10.5539/ijsp.v8n2p159 ◽

2019 ◽

Vol 8 (2) ◽

pp. 159

Author(s):

Morteza Marzjarani

Keyword(s):

Large Data ◽

Real Data ◽

General Linear Model ◽

Least Square ◽

Parameter Estimates ◽

Data Sets ◽

Data Set ◽

New Approach ◽

Data Points ◽

Mlr Model

Heteroscedasticity plays an important role in data analysis. In this article, this issue along with a few different approaches for handling heteroscedasticity are presented. First, an iterative weighted least square (IRLS) and an iterative feasible generalized least square (IFGLS) are deployed and proper weights for reducing heteroscedasticity are determined. Next, a new approach for handling heteroscedasticity is introduced. In this approach, through fitting a multiple linear regression (MLR) model or a general linear model (GLM) to a sufficiently large data set, the data is divided into two parts through the inspection of the residuals based on the results of testing for heteroscedasticity, or via simulations. The first part contains the records where the absolute values of the residuals could be assumed small enough to the point that heteroscedasticity would be ignorable. Under this assumption, the error variances are small and close to their neighboring points. Such error variances could be assumed known (but, not necessarily equal).The second or the remaining portion of the said data is categorized as heteroscedastic. Through real data sets, it is concluded that this approach reduces the number of unusual (such as influential) data points suggested for further inspection and more importantly, it will lowers the root MSE (RMSE) resulting in a more robust set of parameter estimates.

Download Full-text

Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012100104 ◽

2012 ◽

Vol 8 (4) ◽

pp. 82-107 ◽

Cited By ~ 2

Author(s):

Renxia Wan ◽

Yuelin Gao ◽

Caixia Li

Keyword(s):

Large Data ◽

Real Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Data Segment ◽

Possibilistic Clustering ◽

Data Points ◽

Weighted Data ◽

Natural Classes

Up to now, several algorithms for clustering large data sets have been presented. Most clustering approaches for data sets are the crisp ones, which cannot be well suitable to the fuzzy case. In this paper, the authors explore a single pass approach to fuzzy possibilistic clustering over large data set. The basic idea of the proposed approach (weighted fuzzy-possibilistic c-means, WFPCM) is to use a modified possibilistic c-means (PCM) algorithm to cluster the weighted data points and centroids with one data segment as a unit. Experimental results on both synthetic and real data sets show that WFPCM can save significant memory usage when comparing with the fuzzy c-means (FCM) algorithm and the possibilistic c-means (PCM) algorithm. Furthermore, the proposed algorithm is of an excellent immunity to noise and can avoid splitting or merging the exact clusters into some inaccurate clusters, and ensures the integrity and purity of the natural classes.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

A new approach to the fuzzy c-means clustering algorithm by automatic weights and local clustering

10.24271/psr.18 ◽

2021 ◽

Vol 3 (1) ◽

pp. 1-7

Author(s):

Yadgar Sirwan Abdulrahman

Keyword(s):

Clustering Algorithm ◽

Similarity Criterion ◽

Real Data ◽

Well Being ◽

Classical Solutions ◽

Data Sets ◽

Data Set ◽

New Approach ◽

Fuzzy C Means Clustering ◽

Global And Local

Clustering is one of the essential strategies in data analysis. In classical solutions, all features are assumed to contribute equally to the data clustering. Of course, some features are more important than others in real data sets. As a result, essential features will have a more significant impact on identifying optimal clusters than other features. In this article, a fuzzy clustering algorithm with local automatic weighting is presented. The proposed algorithm has many advantages such as: 1) the weights perform features locally, meaning that each cluster's weight is different from the rest. 2) calculating the distance between the samples using a non-euclidian similarity criterion to reduce the noise effect. 3) the weight of the features is obtained comparatively during the learning process. In this study, mathematical analyzes were done to obtain the clustering centers well-being and the features' weights. Experiments were done on the data set range to represent the progressive algorithm's efficiency compared to other proposed algorithms with global and local features

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>

Download Full-text

Optimal fast Johnson–Lindenstrauss embeddings for large data sets

Sampling Theory, Signal Processing, and Data Analysis ◽

10.1007/s43670-021-00003-5 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Stefan Bamberger ◽

Felix Krahmer

Keyword(s):

Fast Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Embedding Dimension ◽

Data Set ◽

Optimal Dimension ◽

Discrete Algorithms ◽

Fast Multiplication ◽

Data Points

AbstractJohnson–Lindenstrauss embeddings are widely used to reduce the dimension and thus the processing time of data. To reduce the total complexity, also fast algorithms for applying these embeddings are necessary. To date, such fast algorithms are only available either for a non-optimal embedding dimension or up to a certain threshold on the number of data points. We address a variant of this problem where one aims to simultaneously embed larger subsets of the data set. Our method follows an approach by Nelson et al. (New constructions of RIP matrices with fast multiplication and fewer rows. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1515-1528, 2014): a subsampled Hadamard transform maps points into a space of lower, but not optimal dimension. Subsequently, a random matrix with independent entries projects to an optimal embedding dimension. For subsets whose size scales at least polynomially in the ambient dimension, the complexity of this method comes close to the number of operations just to read the data under mild assumptions on the size of the data set that are considerably less restrictive than in previous works. We also prove a lower bound showing that subsampled Hadamard matrices alone cannot reach an optimal embedding dimension. Hence, the second embedding cannot be omitted.

Download Full-text

Using anticlustering to partition data sets into equivalent parts

10.31234/osf.io/3razc ◽

2019 ◽

Author(s):

Martin Papenberg ◽

Gunnar W. Klau

Keyword(s):

Cross Validation ◽

Item Difficulty ◽

Large Data ◽

Real Data ◽

Psychological Research ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

R Programming Language ◽

R Programming

Numerous applications in psychological research require that a pool of elements is partitioned into multiple parts. While many applications seek groups that are well-separated, i.e., dissimilar from each other, others require the different groups to be as similar as possible. Examples include the assignment of students to parallel courses, assembling stimulus sets in experimental psychology, splitting achievement tests into parts of equal difficulty, and dividing a data set for cross validation. We present anticlust, an easy-to-use and free software package for solving these problems fast and in an automated manner. The package anticlust is an open source extension to the R programming language and implements the methodology of anticlustering. Anticlustering divides elements into similar parts, ensuring similarity between groups by enforcing heterogeneity within groups. Thus, anticlustering is the direct reversal of cluster analysis that aims to maximize homogeneity within groups and dissimilarity between groups. Our package anticlust implements two anticlustering criteria, reversing the clustering methods k-means and cluster editing, respectively. In a simulation study, we show that anticlustering returns excellent results and outperforms alternative approaches like random assignment and matching. In three example applications, we illustrate how to apply anticlust on real data sets. We demonstrate how to assign experimental stimuli to equivalent sets based on norming data, how to divide a large data set for cross validation, and how to split a test into parts of equal item difficulty and discrimination.

Download Full-text

PIQMEE: Bayesian Phylodynamic Method for Analysis of Large Data Sets with Duplicate Sequences

Molecular Biology and Evolution ◽

10.1093/molbev/msaa136 ◽

2020 ◽

Vol 37 (10) ◽

pp. 3061-3075 ◽

Cited By ~ 2

Author(s):

Veronika Boskova ◽

Tanja Stadler

Keyword(s):

Large Data ◽

Large Data Sets ◽

Parameter Estimates ◽

Data Sets ◽

Sequencing Data ◽

Full Data ◽

Data Set ◽

Reliable Parameter ◽

Phylodynamic Analysis ◽

Speed And Accuracy

Abstract Next-generation sequencing of pathogen quasispecies within a host yields data sets of tens to hundreds of unique sequences. However, the full data set often contains thousands of sequences, because many of those unique sequences have multiple identical copies. Data sets of this size represent a computational challenge for currently available Bayesian phylogenetic and phylodynamic methods. Through simulations, we explore how large data sets with duplicate sequences affect the speed and accuracy of phylogenetic and phylodynamic analysis within BEAST 2. We show that using unique sequences only leads to biases, and using a random subset of sequences yields imprecise parameter estimates. To overcome these shortcomings, we introduce PIQMEE, a BEAST 2 add-on that produces reliable parameter estimates from full data sets with increased computational efficiency as compared with the currently available methods within BEAST 2. The principle behind PIQMEE is to resolve the tree structure of the unique sequences only, while simultaneously estimating the branching times of the duplicate sequences. Distinguishing between unique and duplicate sequences allows our method to perform well even for very large data sets. Although the classic method converges poorly for data sets of 6,000 sequences when allowed to run for 7 days, our method converges in slightly more than 1 day. In fact, PIQMEE can handle data sets of around 21,000 sequences with 20 unique sequences in 14 days. Finally, we apply the method to a real, within-host HIV sequencing data set with several thousand sequences per patient.

Download Full-text

Data Clustering Using a Model Granular Magnet

Neural Computation ◽

10.1162/neco.1997.9.8.1805 ◽

1997 ◽

Vol 9 (8) ◽

pp. 1805-1842 ◽

Cited By ~ 105

Author(s):

Marcelo Blatt ◽

Shai Wiseman ◽

Eytan Domany

Keyword(s):

Intermediate Phase ◽

Real Data ◽

Spin Correlation ◽

Detailed Comparison ◽

Data Sets ◽

New Approach ◽

Underlying Distribution ◽

Strongly Coupled ◽

Data Points ◽

Very High

We present a new approach to clustering, based on the physical properties of an inhomogeneous ferromagnet. No assumption is made regarding the underlying distribution of the data. We assign a Potts spin to each data point and introduce an interaction between neighboring points, whose strength is a decreasing function of the distance between the neighbors. This magnetic system exhibits three phases. At very low temperatures, it is completely ordered; all spins are aligned. At very high temperatures, the system does not exhibit any ordering, and in an intermediate regime, clusters of relatively strongly coupled spins become ordered, whereas different clusters remain uncorrelated. This intermediate phase is identified by a jump in the order parameters. The spin-spin correlation function is used to partition the spins and the corresponding data points into clusters. We demonstrate on three synthetic and three real data sets how the method works. Detailed comparison to the performance of other techniques clearly indicates the relative success of our method.

Download Full-text

Representative reduction of crystallographic orientation data

Journal of Applied Crystallography ◽

10.1107/s0021889813010972 ◽

2013 ◽

Vol 46 (4) ◽

pp. 960-971 ◽

Cited By ~ 4

Author(s):

Katja Jöchen ◽

Thomas Böhlke

Keyword(s):

Crystallographic Orientation ◽

Electron Backscatter Diffraction ◽

Three Dimensional ◽

Large Data ◽

Data Sets ◽

Data Set ◽

Orientation Data ◽

Clustering Technique ◽

Orientation Space ◽

Data Points

Experimental techniques [e.g.electron backscatter diffraction (EBSD)] yield detailed crystallographic information on the grain scale. In both two- and three-dimensional applications of EBSD, large data sets in the range of 105–109single-crystal orientations are obtained. With regard to the precise but efficient micromechanical computation of the polycrystalline material response, small representative sets of crystallographic orientation data are required. This paper describes two methods to systematically reduce experimentally measured orientation data. Inspired by the work of Gao, Przybyla & Adams [Metall. Mater. Trans. A(2006),37, 2379–2387], who used a tessellation of the orientation space in order to compute correlation functions, one method in this work uses a similar procedure to partition the orientation space into boxes, but with the aim of extracting the mean orientation of the data points of each box. The second method to reduce crystallographic texture data is based on a clustering technique. It is shown that, in terms of representativity of the reduced data, both methods deliver equally good results. While the clustering technique is computationally more costly, it works particularly well when the measured data set shows pronounced clusters in the orientation space. The quality of the results and the performance of the tessellation method are independent of the examined data set.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text