Searching for clusters in population data
The article continues the series of publications developing new statistically motivated approach to data clustering. Proposed method is applied for searching clusters of increased or decreased frequencies of some events in sets of neighboring cells in two dimensional tessellations of plane. Such cells may correspond to administrative regions, counties etc. The case of simple frequency tables (histograms) with rectangular cells was considered earlier. The observed distribution of event frequencies in cells can be compared either with expected one (for instance uniform) or with distribution corresponding to the previous moment of time. The groups of neighboring cells with the same direction of changes are unified in clusters which are checked to be statistically significant with account on multiple comparisons. Each group of cells is characterized with two parameters – its size (the number of cells) and the intensity of changing. If the size of group or (and) its intensity are too pronounced then such group is considered to be statistically significant cluster. There are no a priori suggestions concerning the number, size or shape of potentially existing clusters. Method can be used for clustering any multidimensional arrays of p-values which are independent and uniformly distributed according null hypothesis, while alternative is that there are sets of neighboring cells where p-values are close to 0 or to 1.