Revealing Structure in Visualizations of Dense 2D and 3D Parallel Coordinates

Parallel coordinates is a well-known technique used for visualization of multivariate data. When the size of the data sets increases the parallel coordinates display results in an image far too cluttered to perceive any structure. We tackle this problem by constructing high-precision textures to represent the data. By using transfer functions that operate on the high-precision textures, it is possible to highlight different aspects of the entire data set or clusters of the data. Our methods are implemented in both standard 2D parallel coordinates and 3D multi-relational parallel coordinates. Furthermore, when visualizing a larger number of clusters, a technique called ‘feature animation’ may be used as guidance by presenting various cluster statistics. A case study is also performed to illustrate the analysis process when analysing large multivariate data sets using our proposed techniques.

Download Full-text

Estimating the Number of Clusters in Multivariate Data by Self-Organizing Maps

International Journal of Neural Systems ◽

10.1142/s0129065799000186 ◽

1999 ◽

Vol 09 (03) ◽

pp. 195-202 ◽

Cited By ~ 18

Author(s):

JOSÉ ALFREDO FERREIRA COSTA ◽

MÁRCIO LUIZ DE ANDRADE NETTO

Keyword(s):

A Priori ◽

Multivariate Data ◽

Feature Space ◽

Search Space ◽

Data Sets ◽

Self Organizing Maps ◽

Data Set ◽

Number Of Clusters ◽

Using Data ◽

Self Organizing

Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately [Formula: see text] possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are similar. The number of regions and their labels are automatically found and they are related to the number of clusters in a multivariate data set. New data can be classified by labeling it according to the best match neuron. Simulations using data sets drawn from finite mixtures of p-variate normal densities are presented as well as related advantages and drawbacks of the method.

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

Improving Quality and Safety Through Use of Secondary Data: Methods Case Study

Western Journal of Nursing Research ◽

10.1177/0193945916672449 ◽

2016 ◽

Vol 39 (11) ◽

pp. 1477-1501 ◽

Cited By ~ 1

Author(s):

Victoria Goode ◽

Nancy Crego ◽

Michael P. Cary ◽

Deirdre Thornlow ◽

Elizabeth Merwin

Keyword(s):

Research Question ◽

Secondary Data ◽

Data Access ◽

Data Sets ◽

Complex Data ◽

Management Skills ◽

Data Set ◽

Large Numbers ◽

Need To Evaluate

Researchers need to evaluate the strengths and weaknesses of data sets to choose a secondary data set to use for a health care study. This research method review informs the reader of the major issues necessary for investigators to consider while incorporating secondary data into their repertoire of potential research designs and shows the range of approaches the investigators may take to answer nursing research questions in a variety of context areas. The researcher requires expertise in locating and judging data sets and in the development of complex data management skills for managing large numbers of records. There are important considerations such as firm knowledge of the research question supported by the conceptual framework and the selection of appropriate databases, which guide the researcher in delineating the unit of analysis. Other more complex issues for researchers to consider when conducting secondary data research methods include data access, management and security, and complex variable construction.

Download Full-text

Multivariate Data Analytics in Surface Topography Assessments: Case Study High Precision Fine Grinding Processes

Procedia Manufacturing ◽

10.1016/j.promfg.2020.01.264 ◽

2019 ◽

Vol 39 ◽

pp. 1752-1761

Author(s):

Stefan Bracke ◽

Sebastian Sochacki ◽

Max Radetzky

Keyword(s):

Surface Topography ◽

High Precision ◽

Data Analytics ◽

Multivariate Data ◽

Fine Grinding ◽

Grinding Processes

Download Full-text

LCSS-Based Algorithm for Computing Multivariate Data Set Similarity: A Case Study of Real-Time WSN Data

Sensors ◽

10.3390/s19010166 ◽

2019 ◽

Vol 19 (1) ◽

pp. 166 ◽

Cited By ~ 2

Author(s):

Rahim Khan ◽

Ihsan Ali ◽

Saleh M. Altowaijri ◽

Muhammad Zakarya ◽

Atiq Ur Rahman ◽

...

Keyword(s):

Dynamic Programming ◽

Dna Analysis ◽

Multivariate Data ◽

Longest Common Subsequence ◽

Sensor Data ◽

Computational Time ◽

Data Sets ◽

Data Set ◽

Classical Dynamic ◽

Engineering Sciences

Multivariate data sets are common in various application areas, such as wireless sensor networks (WSNs) and DNA analysis. A robust mechanism is required to compute their similarity indexes regardless of the environment and problem domain. This study describes the usefulness of a non-metric-based approach (i.e., longest common subsequence) in computing similarity indexes. Several non-metric-based algorithms are available in the literature, the most robust and reliable one is the dynamic programming-based technique. However, dynamic programming-based techniques are considered inefficient, particularly in the context of multivariate data sets. Furthermore, the classical approaches are not powerful enough in scenarios with multivariate data sets, sensor data or when the similarity indexes are extremely high or low. To address this issue, we propose an efficient algorithm to measure the similarity indexes of multivariate data sets using a non-metric-based methodology. The proposed algorithm performs exceptionally well on numerous multivariate data sets compared with the classical dynamic programming-based algorithms. The performance of the algorithms is evaluated on the basis of several benchmark data sets and a dynamic multivariate data set, which is obtained from a WSN deployed in the Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology. Our evaluation suggests that the proposed algorithm can be approximately 39.9% more efficient than its counterparts for various data sets in terms of computational time.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>

Download Full-text

Comparison of silhouette-based reallocation methods for vegetation classification

10.1101/630384 ◽

2019 ◽

Cited By ~ 1

Author(s):

Attila Lengyel ◽

David W. Roberts ◽

Zoltán Botta-Dukát

Keyword(s):

Simulated Data ◽

Primary Objective ◽

Vegetation Classification ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Silhouette Width ◽

Diagnostic Species ◽

Order Of Magnitude ◽

Initial Classification

AbstractAimsTo introduce REMOS, a new iterative reallocation method (with two variants) for vegetation classification, and to compare its performance with OPTSIL. We test (1) how effectively REMOS and OPTSIL maximize mean silhouette width and minimize the number of negative silhouette widths when run on classifications with different structure; (2) how these three methods differ in runtime with different sample sizes; and (3) if classifications by the three reallocation methods differ in the number of diagnostic species, a surrogate for interpretability.Study areaSimulation; example data sets from grasslands in Hungary and forests in Wyoming and Utah, USA.MethodsWe classified random subsets of simulated data with the flexible-beta algorithm for different values of beta. These classifications were subsequently optimized by REMOS and OPTSIL and compared for mean silhouette widths and proportion of negative silhouette widths. Then, we classified three vegetation data sets of different sizes from two to ten clusters, optimized them with the reallocation methods, and compared their runtimes, mean silhouette widths, numbers of negative silhouette widths, and the number of diagnostic species.ResultsIn terms of mean silhouette width, OPTSIL performed the best when the initial classifications already had high mean silhouette width. REMOS algorithms had slightly lower mean silhouette width than what was maximally achievable with OPTSIL but their efficiency was consistent across different initial classifications; thus REMOS was significantly superior to OPTSIL when the initial classification had low mean silhouette width. REMOS resulted in zero or a negligible number of negative silhouette widths across all classifications. OPTSIL performed similarly when the initial classification was effective but could not reach as low proportion of misclassified objects when the initial classification was inefficient. REMOS algorithms were typically more than an order of magnitude faster to calculate than OPTSIL. There was no clear difference between REMOS and OPTSIL in the number of diagnostic species.ConclusionsREMOS algorithms may be preferable to OPTSIL when (1) the primary objective is to reduce or eliminate negative silhouette widths in a classification, (2) the initial classification has low mean silhouette width, or (3) when the time efficiency of the algorithm is important because of the size of the data set or the high number of clusters.

Download Full-text

Functional Reuse and Intensification of Rural-Urban Context

International Journal of Agricultural and Environmental Information Systems ◽

10.4018/ijaeis.2016010101 ◽

2016 ◽

Vol 7 (1) ◽

pp. 1-27 ◽

Cited By ~ 3

Author(s):

Tiziano Cattaneo ◽

Roberto De Lotto ◽

Elisabetta Maria Venco

Keyword(s):

Best Practices ◽

Dynamic Balance ◽

Methodological Approach ◽

Data Sets ◽

Logical Framework ◽

Urban Context ◽

Specific Knowledge ◽

Data Set ◽

Whole Process

In regional and urban planning such as in design actions they are usually involved different themes and disciplines; especially when the goal is to improve, restore and re-functionalize existing minor settlements in rural-urban context. For this reason it is necessary to define integrated methodologies able to face inter-scalar issues and interdisciplinary themes. Authors propose a framework for a decision support system based on the treatment of geographical data and on the integration of the data sets that have dissimilar origin, diverse formats (they may be not only digital) and different meaning value. This complete data set refers to various disciplines and it is possible to deduce specific knowledge throughout analytical passages and assessment steps. In the paper authors describe: a methodological approach to support planning activities; the technical support to seek a (dynamic) balance between urban density and rural fragmentation; a Best Practices database to support scenarios in rural-urban context. Authors first expose the application field, than the logical framework of the whole process, then describe some related spatial analysis applications and finally they introduce comprehensive case study of the whole procedure.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

Co-registration and bias corrections of satellite elevation data sets for quantifying glacier thickness change

The Cryosphere ◽

10.5194/tc-5-271-2011 ◽

2011 ◽

Vol 5 (1) ◽

pp. 271-290 ◽

Cited By ~ 398

Author(s):

C. Nuth ◽

A. Kääb

Keyword(s):

Regional Scale ◽

Statistical Error ◽

Data Sets ◽

Data Set ◽

Thickness Change ◽

Aster Gdem ◽

Digital Elevation ◽

Elevation Data ◽

Elevation Changes

Abstract. There are an increasing number of digital elevation models (DEMs) available worldwide for deriving elevation differences over time, including vertical changes on glaciers. Most of these DEMs are heavily post-processed or merged, so that physical error modelling becomes difficult and statistical error modelling is required instead. We propose a three-step methodological framework for assessing and correcting DEMs to quantify glacier elevation changes: (i) remove DEM shifts, (ii) check for elevation-dependent biases, and (iii) check for higher-order, sensor-specific biases. A simple, analytic and robust method to co-register elevation data is presented in regions where stable terrain is either plentiful (case study New Zealand) or limited (case study Svalbard). The method is demonstrated using the three global elevation data sets available to date, SRTM, ICESat and the ASTER GDEM, and with automatically generated DEMs from satellite stereo instruments of ASTER and SPOT5-HRS. After 3-D co-registration, significant biases related to elevation were found in some of the stereoscopic DEMs. Biases related to the satellite acquisition geometry (along/cross track) were detected at two frequencies in the automatically generated ASTER DEMs. The higher frequency bias seems to be related to satellite jitter, most apparent in the back-looking pass of the satellite. The origins of the more significant lower frequency bias is uncertain. ICESat-derived elevations are found to be the most consistent globally available elevation data set available so far. Before performing regional-scale glacier elevation change studies or mosaicking DEMs from multiple individual tiles (e.g. ASTER GDEM), we recommend to co-register all elevation data to ICESat as a global vertical reference system.

Download Full-text