Clusterdv, a simple density-based clustering method that is robust, general and automatic

Mapping Intimacies ◽

10.1101/224840 ◽

2017 ◽

Author(s):

João C. Marques ◽

Michael B. Orger

Keyword(s):

Clustering Algorithm ◽

Underlying Structure ◽

Data Sets ◽

Natural Phenomena ◽

Cluster Number ◽

Data Set ◽

Density Peaks ◽

Wide Range ◽

Cluster Shape ◽

Fully Automatic

AbstractHow to partition a data set into a set of distinct clusters is a ubiquitous and challenging problem. The fact that data varies widely in features such as cluster shape, cluster number, density distribution, background noise, outliers and degree of overlap, makes it difficult to find a single algorithm that can be broadly applied. One recent method, clusterdp, based on search of density peaks, can be applied successfully to cluster many kinds of data, but it is not fully automatic, and fails on some simple data distributions. We propose an alternative approach, clusterdv, which estimates density dips between points, and allows robust determination of cluster number and distribution across a wide range of data, without any manual parameter adjustment. We show that this method is able to solve a range of synthetic and experimental data sets, where the underlying structure is known, and identifies consistent and meaningful clusters in new behavioral data.Author summarIt is common that natural phenomena produce groupings, or clusters, in data, that can reveal the underlying processes. However, the form of these clusters can vary arbitrarily, making it challenging to find a single algorithm that identifies their structure correctly, without prior knowledge of the number of groupings or their distribution. We describe a simple clustering algorithm that is fully automatic and is able to correctly identify the number and shape of groupings in data of many types. We expect this algorithm to be useful in finding unknown natural phenomena present in data from a wide range of scientific fields.

Get full-text (via PubEx)

An Improved Random Seed Searching Clustering Algorithm Based on Shared Nearest Neighbor

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.719-720.1160 ◽

2015 ◽

Vol 719-720 ◽

pp. 1160-1165 ◽

Cited By ~ 1

Author(s):

Ya Ran Su ◽

Xi Xian Niu

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

Local Maximum ◽

Distribution Data ◽

Data Sets ◽

Cluster Number ◽

Data Set ◽

Maximum Cluster ◽

Data Objects ◽

Shared Nearest Neighbor

Clustering analysis continually consider as a hot field in Data Mining. For different types data sets and application purposes, the relevant researchers concern on various aspect, such as the adaptability to fit density and shape, noise detection, outliers identification, cluster number determination, accuracy and optimization. Lots of related works focus on the Shared Nearest Neighbor measure method, due to its best and wide adaptability to deal with complex distribution data set. Based on Shared Nearest Neighbor, an improved algorithm is proposed in this paper, it mainly target on the problems solution of natural distribute density, arbitrary shape and cluster number determination. The new algorithm start with random selected seed, follow the direction of its nearest neighbors, search and find its neighbors which have the greatest similar features, form the local maximum cluster, dynamically adjust the data objects’ affiliation to realize the local optimization at the same time, and then end the clustering procedure until identify all the data objects. Experiments verify the new algorithm has the advanced ability to fit the problems such as different density, shape, noise, cluster number and so on, and can realize fast optimization searching.

Get full-text (via PubEx)

A Visual and VAE Based Hierarchical Indoor Localization Method

Sensors ◽

10.3390/s21103406 ◽

2021 ◽

Vol 21 (10) ◽

pp. 3406

Author(s):

Jie Jiang ◽

Yin Zou ◽

Lidong Chen ◽

Yujie Fang

Keyword(s):

Image Retrieval ◽

Indoor Localization ◽

Data Sets ◽

Indoor Environments ◽

Global Features ◽

Data Set ◽

Data Annotation ◽

Wide Range ◽

Annotation Costs ◽

Global And Local

Precise localization and pose estimation in indoor environments are commonly employed in a wide range of applications, including robotics, augmented reality, and navigation and positioning services. Such applications can be solved via visual-based localization using a pre-built 3D model. The increase in searching space associated with large scenes can be overcome by retrieving images in advance and subsequently estimating the pose. The majority of current deep learning-based image retrieval methods require labeled data, which increase data annotation costs and complicate the acquisition of data. In this paper, we propose an unsupervised hierarchical indoor localization framework that integrates an unsupervised network variational autoencoder (VAE) with a visual-based Structure-from-Motion (SfM) approach in order to extract global and local features. During the localization process, global features are applied for the image retrieval at the level of the scene map in order to obtain candidate images, and are subsequently used to estimate the pose from 2D-3D matches between query and candidate images. RGB images only are used as the input of the proposed localization system, which is both convenient and challenging. Experimental results reveal that the proposed method can localize images within 0.16 m and 4° in the 7-Scenes data sets and 32.8% within 5 m and 20° in the Baidu data set. Furthermore, our proposed method achieves a higher precision compared to advanced methods.

Get full-text (via PubEx)

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Get full-text (via PubEx)

Global whole-rock geochemical database compilation

10.5194/essd-2019-50 ◽

2019 ◽

Cited By ~ 2

Author(s):

Matthew Gard ◽

Derrick Hasterok ◽

Jacqueline Halpin

Keyword(s):

Database Management ◽

Temporal Trends ◽

Physical Property ◽

Database Management System ◽

Geochemical Data ◽

Data Sets ◽

Unique Contribution ◽

Data Set ◽

Wide Range ◽

Geochemical Indices

Abstract. Dissemination and collation of geochemical data are critical to promote rapid, creative and accurate research and place new results in an appropriate global context. To this end, we have assembled a global whole-rock geochemical database, with other associated sample information and properties, sourced from various existing databases and supplemented with numerous individual publications and corrections. Currently the database stands at 1,023,490 samples with varying amounts of associated information including major and trace element concentrations, isotopic ratios, and location data. The distribution both spatially and temporally is quite heterogeneous, however temporal distributions are enhanced over some previous database compilations, particularly in terms of ages older than ~ 1000 Ma. Also included are a wide range of computed geochemical indices, physical property estimates and naming schema on a major element normalized version of the geochemical data for quick reference. This compilation will be useful for geochemical studies requiring extensive data sets, in particular those wishing to investigate secular temporal trends. The addition of physical properties, estimated by sample chemistry, represents a unique contribution to otherwise similar geochemical databases. The data is published in .csv format for the purposes of simple distribution but exists in a format acceptable for database management systems (e.g. SQL). One can either manipulate this data using conventional analysis tools such as MATLAB®, Microsoft® Excel, or R, or upload to a relational database management system for easy querying and management of the data as unique keys already exist. This data set will continue to grow, and we encourage readers to contact us or other database compilations contained within about any data that is yet to be included. The data files described in this paper are available at https://doi.org/10.5281/zenodo.2592823 (Gard et al., 2019).

Get full-text (via PubEx)

Panoramic stitching of heterogeneous single-cell transcriptomic data

10.1101/371179 ◽

2018 ◽

Cited By ~ 17

Author(s):

Brian Hie ◽

Bryan Bryson ◽

Bonnie Berger

Keyword(s):

Single Cell ◽

Cell Types ◽

Data Sets ◽

Cell Type ◽

Data Set ◽

Wide Range ◽

Data Set Integration ◽

Biological Patterns ◽

Insight Into ◽

Comprehensive Reference

AbstractResearchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 different technologies into a single comprehensive reference, demonstrating how Scanorama can be used to obtain a more complete picture of cellular function across a wide range of scRNA-seq experiments.

Get full-text (via PubEx)

The Application of Fuzzy Clustering Number Algorithm in Network Intrusion Detection

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.760-762.2220 ◽

2013 ◽

Vol 760-762 ◽

pp. 2220-2223

Author(s):

Lang Guo

Keyword(s):

Intrusion Detection ◽

Fuzzy Clustering ◽

Clustering Algorithm ◽

Local Optimum ◽

Cluster Number ◽

Data Set ◽

Network Intrusion ◽

Correlation Degree ◽

Indicator Data ◽

Detection Effect

In view of the defects of K-means algorithm in intrusion detection: the need of preassign cluster number and sensitive initial center and easy to fall into local optimum, this paper puts forward a fuzzy clustering algorithm. The fuzzy rules are utilized to express the invasion features, and standardized matrix is adopted to further process so as to reflect the approximation degree or correlation degree between the invasion indicator data and establish a similarity matrix. The simulation results of KDD CUP1999 data set show that the algorithm has better intrusion detection effect and can effectively detect the network intrusion data.

Get full-text (via PubEx)

Characterising RDF data sets

Journal of Information Science ◽

10.1177/0165551516677945 ◽

2017 ◽

Vol 44 (2) ◽

pp. 203-229 ◽

Cited By ~ 6

Author(s):

Javier D Fernández ◽

Miguel A Martínez-Prieto ◽

Pablo de la Fuente Redondo ◽

Claudio Gutiérrez

Keyword(s):

Data Structures ◽

Large Scale ◽

Open Data ◽

Structural Features ◽

Data Sets ◽

Data Set ◽

Wide Range ◽

Rdf Data ◽

Description Framework ◽

Resource Description

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.

Get full-text (via PubEx)

The Application of Probabilistic Neural Network in Speech Recognition Based on Partition Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.2173 ◽

2012 ◽

Vol 263-266 ◽

pp. 2173-2178

Author(s):

Xin Guang Li ◽

Min Feng Yao ◽

Li Rui Jian ◽

Zhen Jiang Li

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Clustering Algorithm ◽

Probabilistic Neural Network ◽

Back Propagation ◽

Back Propagation Neural Network ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Partition Clustering

A probabilistic neural network (PNN) speech recognition model based on the partition clustering algorithm is proposed in this paper. The most important advantage of PNN is that training is easy and instantaneous. Therefore, PNN is capable of dealing with real time speech recognition. Besides, in order to increase the performance of PNN, the selection of data set is one of the most important issues. In this paper, using the partition clustering algorithm to select data is proposed. The proposed model is tested on two data sets from the field of spoken Arabic numbers, with promising results. The performance of the proposed model is compared to single back propagation neural network and integrated back propagation neural network. The final comparison result shows that the proposed model performs better than the other two neural networks, and has an accuracy rate of 92.41%.

Get full-text (via PubEx)

Data Fusion Using a Multi-Sensor Sparse-Based Clustering Algorithm

Remote Sensing ◽

10.3390/rs12234007 ◽

2020 ◽

Vol 12 (23) ◽

pp. 4007

Author(s):

Kasra Rafiezadeh Shahi ◽

Pedram Ghamisi ◽

Behnood Rasti ◽

Robert Jackisch ◽

Paul Scheunders ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Information ◽

Clustering Algorithms ◽

Hyperspectral Data ◽

Sensor Data ◽

Data Sets ◽

Data Types ◽

Data Set ◽

Multiple Data Sets ◽

Imaging Sensors

The increasing amount of information acquired by imaging sensors in Earth Sciences results in the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) for monitoring of the Earth’s surface. Many studies were devoted to investigating the usage of multi-sensor data sets in the performance of supervised learning-based approaches at various tasks (i.e., classification and regression) while unsupervised learning-based approaches have received less attention. In this paper, we propose a new approach to fuse multiple data sets from imaging sensors using a multi-sensor sparse-based clustering algorithm (Multi-SSC). A technique for the extraction of spatial features (i.e., morphological profiles (MPs) and invariant attribute profiles (IAPs)) is applied to high spatial-resolution data to derive the spatial and contextual information. This information is then fused with spectrally rich data such as multi- or hyperspectral data. In order to fuse multi-sensor data sets a hierarchical sparse subspace clustering approach is employed. More specifically, a lasso-based binary algorithm is used to fuse the spectral and spatial information prior to automatic clustering. The proposed framework ensures that the generated clustering map is smooth and preserves the spatial structures of the scene. In order to evaluate the generalization capability of the proposed approach, we investigate its performance not only on diverse scenes but also on different sensors and data types. The first two data sets are geological data sets, which consist of hyperspectral and RGB data. The third data set is the well-known benchmark Trento data set, including hyperspectral and LiDAR data. Experimental results indicate that this novel multi-sensor clustering algorithm can provide an accurate clustering map compared to the state-of-the-art sparse subspace-based clustering algorithms.

Get full-text (via PubEx)

Seasonal and intra-diurnal variability of small-scale gravity waves in OH airglow at two Alpine stations

Atmospheric Measurement Techniques ◽

10.5194/amt-12-457-2019 ◽

2019 ◽

Vol 12 (1) ◽

pp. 457-469 ◽

Cited By ~ 2

Author(s):

Patrick Hannawald ◽

Carsten Schmidt ◽

René Sedlak ◽

Sabine Wüst ◽

Michael Bittner

Keyword(s):

Gravity Waves ◽

High Frequency ◽

Phase Speed ◽

Propagation Direction ◽

Field Of View ◽

Small Scale ◽

Data Sets ◽

Data Set ◽

Diurnal Variability ◽

Fully Automatic

Abstract. Between December 2013 and August 2017 the instrument FAIM (Fast Airglow IMager) observed the OH airglow emission at two Alpine stations. A year of measurements was performed at Oberpfaffenhofen, Germany (48.09∘ N, 11.28∘ E) and 2 years at Sonnblick, Austria (47.05∘ N, 12.96∘ E). Both stations are part of the network for the detection of mesospheric change (NDMC). The temporal resolution is two frames per second and the field-of-view is 55 km × 60 km and 75 km × 90 km at the OH layer altitude of 87 km with a spatial resolution of 200 and 280 m per pixel, respectively. This resulted in two dense data sets allowing precise derivation of horizontal gravity wave parameters. The analysis is based on a two-dimensional fast Fourier transform with fully automatic peak extraction. By combining the information of consecutive images, time-dependent parameters such as the horizontal phase speed are extracted. The instrument is mainly sensitive to high-frequency small- and medium-scale gravity waves. A clear seasonal dependency concerning the meridional propagation direction is found for these waves in summer in the direction to the summer pole. The zonal direction of propagation is eastwards in summer and westwards in winter. Investigations of the data set revealed an intra-diurnal variability, which may be related to tides. The observed horizontal phase speed and the number of wave events per observation hour are higher in summer than in winter.

Get full-text (via PubEx)