scholarly journals Structure Identification-Based Clustering According to Density Consistency

2011 ◽  
Vol 2011 ◽  
pp. 1-14 ◽  
Author(s):  
Chunzhong Li ◽  
Zongben Xu

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.

Author(s):  
Drew Levin ◽  
Patrick Finley

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e18725-e18725
Author(s):  
Ravit Geva ◽  
Barliz Waissengrin ◽  
Dan Mirelman ◽  
Felix Bokstein ◽  
Deborah T. Blumenthal ◽  
...  

e18725 Background: Healthcare data sharing is important for the creation of diverse and large data sets, supporting clinical decision making, and accelerating efficient research to improve patient outcomes. This is especially vital in the case of real world data analysis. However, stakeholders are reluctant to share their data without ensuring patients’ privacy, proper protection of their data sets and the ways they are being used. Homomorphic encryption is a cryptographic capability that can address these issues by enabling computation on encrypted data without ever decrypting it, so the analytics results are obtained without revealing the raw data. The aim of this study is to prove the accuracy of analytics results and the practical efficiency of the technology. Methods: A real-world data set of colorectal cancer patients’ survival data, following two different treatment interventions, including 623 patients and 24 variables, amounting to 14,952 items of data, was encrypted using leveled homomorphic encryption implemented in the PALISADE software library. Statistical analysis of key oncological endpoints was blindly performed on both the raw data and the homomorphically-encrypted data using descriptive statistics and survival analysis with Kaplan-Meier curves. Results were then compared with an accuracy goal of two decimals. Results: The difference between the raw data and the homomorphically encrypted data results, regarding all variables analyzed was within the pre-determined accuracy range goal, as well as the practical efficiency of the encrypted computation measured by run time, are presented in table. Conclusions: This study demonstrates that data encrypted with Homomorphic Encryption can be statistical analyzed with a precision of at least two decimal places, allowing safe clinical conclusions drawing while preserving patients’ privacy and protecting data owners’ data assets. Homomorphic encryption allows performing efficient computation on encrypted data non-interactively and without requiring decryption during computation time. Utilizing the technology will empower large-scale cross-institution and cross- stakeholder collaboration, allowing safe international collaborations. Clinical trial information: 0048-19-TLV. [Table: see text]


2020 ◽  
Author(s):  
Renato Cordeiro de Amorim

In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means


2020 ◽  
Vol 267 (S1) ◽  
pp. 185-196
Author(s):  
J. Gerb ◽  
S. A. Ahmadi ◽  
E. Kierig ◽  
B. Ertl-Wagner ◽  
M. Dieterich ◽  
...  

Abstract Background Objective and volumetric quantification is a necessary step in the assessment and comparison of endolymphatic hydrops (ELH) results. Here, we introduce a novel tool for automatic volumetric segmentation of the endolymphatic space (ELS) for ELH detection in delayed intravenous gadolinium-enhanced magnetic resonance imaging of inner ear (iMRI) data. Methods The core component is a novel algorithm based on Volumetric Local Thresholding (VOLT). The study included three different data sets: a real-world data set (D1) to develop the novel ELH detection algorithm and two validating data sets, one artificial (D2) and one entirely unseen prospective real-world data set (D3). D1 included 210 inner ears of 105 patients (50 male; mean age 50.4 ± 17.1 years), and D3 included 20 inner ears of 10 patients (5 male; mean age 46.8 ± 14.4 years) with episodic vertigo attacks of different etiology. D1 and D3 did not differ significantly concerning age, gender, the grade of ELH, or data quality. As an artificial data set, D2 provided a known ground truth and consisted of an 8-bit cuboid volume using the same voxel-size and grid as real-world data with different sized cylindrical and cuboid-shaped cutouts (signal) whose grayscale values matched the real-world data set D1 (mean 68.7 ± 7.8; range 48.9–92.8). The evaluation included segmentation accuracy using the Sørensen-Dice overlap coefficient and segmentation precision by comparing the volume of the ELS. Results VOLT resulted in a high level of performance and accuracy in comparison with the respective gold standard. In the case of the artificial data set, VOLT outperformed the gold standard in higher noise levels. Data processing steps are fully automated and run without further user input in less than 60 s. ELS volume measured by automatic segmentation correlated significantly with the clinical grading of the ELS (p < 0.01). Conclusion VOLT enables an open-source reproducible, reliable, and automatic volumetric quantification of the inner ears’ fluid space using MR volumetric assessment of endolymphatic hydrops. This tool constitutes an important step towards comparable and systematic big data analyses of the ELS in patients with the frequent syndrome of episodic vertigo attacks. A generic version of our three-dimensional thresholding algorithm has been made available to the scientific community via GitHub as an ImageJ-plugin.


2009 ◽  
Vol 2009 ◽  
pp. 1-16 ◽  
Author(s):  
David J. Miller ◽  
Carl A. Nelson ◽  
Molly Boeka Cannon ◽  
Kenneth P. Cannon

Fuzzy clustering algorithms are helpful when there exists a dataset with subgroupings of points having indistinct boundaries and overlap between the clusters. Traditional methods have been extensively studied and used on real-world data, but require users to have some knowledge of the outcome a priori in order to determine how many clusters to look for. Additionally, iterative algorithms choose the optimal number of clusters based on one of several performance measures. In this study, the authors compare the performance of three algorithms (fuzzy c-means, Gustafson-Kessel, and an iterative version of Gustafson-Kessel) when clustering a traditional data set as well as real-world geophysics data that were collected from an archaeological site in Wyoming. Areas of interest in the were identified using a crisp cutoff value as well as a fuzzyα-cut to determine which provided better elimination of noise and non-relevant points. Results indicate that theα-cut method eliminates more noise than the crisp cutoff values and that the iterative version of the fuzzy clustering algorithm is able to select an optimum number of subclusters within a point set (in both the traditional and real-world data), leading to proper indication of regions of interest for further expert analysis


2015 ◽  
Vol 2015 ◽  
pp. 1-11 ◽  
Author(s):  
Yuhua Zhang ◽  
Kun Wang ◽  
Min Gao ◽  
Zhiyou Ouyang ◽  
Siguang Chen

Mobile sensor networks (MSNs), consisting of mobile nodes, are sensitive to network attacks. Intrusion detection system (IDS) is a kind of active network security technology to protect network from attacks. In the data gathering phase of IDS, due to the high-dimension data collected in multidimension space, great pressure has been put on the subsequent data analysis and response phase. Therefore, traditional methods for intrusion detection can no longer be applicable in MSNs. To improve the performance of data analysis, we applyK-means algorithm to high-dimension data clustering analysis. Thus, an improvedK-means clustering algorithm based on linear discriminant analysis (LDA) is proposed, called LKM algorithm. In this algorithm, we firstly apply the dimension reduction of LDA to divide the high-dimension data set into 2-dimension data set; then we useK-means algorithm for clustering analysis of the dimension-reduced data. Simulation results show that LKM algorithm shortens the sample feature extraction time and improves the accuracy ofK-means clustering algorithm, both of which prove that LKM algorithm enhances the performance of high-dimension data analysis and the abnormal detection rate of IDS in MSNs.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2018 ◽  
Vol 30 (6) ◽  
pp. 1624-1646 ◽  
Author(s):  
Qidong Liu ◽  
Ruisheng Zhang ◽  
Zhili Zhao ◽  
Zhenghai Wang ◽  
Mengyao Jiao ◽  
...  

Minimax similarity stresses the connectedness of points via mediating elements rather than favoring high mutual similarity. The grouping principle yields superior clustering results when mining arbitrarily-shaped clusters in data. However, it is not robust against noises and outliers in the data. There are two main problems with the grouping principle: first, a single object that is far away from all other objects defines a separate cluster, and second, two connected clusters would be regarded as two parts of one cluster. In order to solve such problems, we propose robust minimum spanning tree (MST)-based clustering algorithm in this letter. First, we separate the connected objects by applying a density-based coarsening phase, resulting in a low-rank matrix in which the element denotes the supernode by combining a set of nodes. Then a greedy method is presented to partition those supernodes through working on the low-rank matrix. Instead of removing the longest edges from MST, our algorithm groups the data set based on the minimax similarity. Finally, the assignment of all data points can be achieved through their corresponding supernodes. Experimental results on many synthetic and real-world data sets show that our algorithm consistently outperforms compared clustering algorithms.


2012 ◽  
Vol 22 (04) ◽  
pp. 305-325 ◽  
Author(s):  
MRIDUL AANJANEYA ◽  
FREDERIC CHAZAL ◽  
DANIEL CHEN ◽  
MARC GLISSE ◽  
LEONIDAS GUIBAS ◽  
...  

Many real-world data sets can be viewed of as noisy samples of special types of metric spaces called metric graphs.19 Building on the notions of correspondence and Gromov-Hausdorff distance in metric geometry, we describe a model for such data sets as an approximation of an underlying metric graph. We present a novel algorithm that takes as an input such a data set, and outputs a metric graph that is homeomorphic to the underlying metric graph and has bounded distortion of distances. We also implement the algorithm, and evaluate its performance on a variety of real world data sets.


2020 ◽  
Vol 41 (Supplement_2) ◽  
Author(s):  
V McLaughlin ◽  
K Chin ◽  
N.H Kim ◽  
M Flynn ◽  
R Ong ◽  
...  

Abstract Introduction Restrictive inclusion criteria can exclude some CHD-PAH patients from clinical trials. The OPsumit® USers (OPUS) Registry and the OPsumit® Historical USers (OrPHeUS) data sets provide real-world data in PAH patients newly started on macitentan, including patients with CHD-PAH regardless of defect type. Purpose To describe the characteristics, safety and clinical outcomes of CHD-PAH patients newly treated with macitentan. Methods OPUS is a prospective, US, multicentre, observational drug registry ongoing since April 2014. OrPHeUS was a retrospective, US, multicentre chart review; observation period Oct 2013–Mar 2017. This analysis reports information on CHD-PAH patients in the combined OPUS/OrPHeUS data set, descriptively compared with idiopathic/heritable PAH (I/HPAH) patients. Results As of Sept 2019, there were 4268 PAH patients with follow-up data, of whom 264 (6%) had CHD-PAH and 2396 (56%) had I/HPAH. For CHD-PAH and I/HPAH patients respectively at macitentan initiation: median age (Q1, Q3) was 48 (36, 62) and 65 (53, 73) years; 199 (75%) and 1748 (73%) were female; 67/114 (59%) and 802/1301 (62%) were WHO functional class III/IV; median (Q1, Q3) 6-minute walk distances were 350 (274, 420) and 289 (195, 375) m for the 82 and 840 patients with measurements; median (Q1, Q3) time from PAH diagnosis to macitentan initiation was 37.3 (4.5, 113.1) and 7.4 (1.4, 38.3) months; and 99 (38%) and 1056 (44%) initiated macitentan as monotherapy. The number of patients with ≥1 hepatic adverse event (HAE) was similar for CHD-PAH and I/HPAH (22 [8%] and 184 [8%]), as were the adverse event (AE) profiles (collected from OPUS only). Exposure, discontinuations, outcomes and most common AEs are shown in the table. Conclusions In general, compared with I/HPAH patients, CHD-PAH patients were younger and a greater proportion had prevalent disease than I/HPAH patients. Safety and outcomes were similar between the groups. Funding Acknowledgement Type of funding source: Other. Main funding source(s): Actelion Pharmaceuticals Ltd


Sign in / Sign up

Export Citation Format

Share Document