Self-Training Algorithm Based on Density Peaks Combining Globally Adaptive Multi-Local Noise Filter

Mapping Intimacies ◽

10.21203/rs.3.rs-1048816/v1 ◽

2021 ◽

Author(s):

Shuaijun Li ◽

Jia Lu

Keyword(s):

Data Sets ◽

Training Algorithm ◽

Real World Data ◽

Density Peak ◽

Data Set ◽

Density Peaks ◽

Noise Filter ◽

Density Peak Clustering ◽

A Current ◽

Noise Filters

Abstract Self-training algorithm can quickly train an supervised classifier through a few labeled samples and lots of unlabeled samples. However, self-training algorithm is often affected by mislabeled samples, and local noise filter is proposed to detect the mislabeled samples. Nevertheless, current local noise filters have the problems: (a) Current local noise filters ignore the spatial distribution of the nearest neighbors in different classes. (b) They can’t perform well when mislabeled samples are located in the overlapping areas of different classes. To solve the above challenges, a new self-training algorithm based on density peaks combining globally adaptive multi-local noise filter (STDP-GAMNF) is proposed. Firstly, the spatial structure of data set is revealed by density peak clustering, and it is used for helping self-training to label unlabeled samples. In the meantime, after each epoch of labeling, GAMLNF can comprehensively judge whether a sample is a mislabeled sample from multiple classes or not, and will reduce the influence of edge samples effectively. The corresponding experimental results conducted on eighteen real-world data sets demonstrate that GAMLNF is not sensitive to the value of the neighbor parameter k, and it can be adaptive to find the appropriate number of neighbors of each class.

Download Full-text

Density Peak Clustering Based on Relative Density Optimization

Mathematical Problems in Engineering ◽

10.1155/2020/2816102 ◽

2020 ◽

Vol 2020 ◽

pp. 1-8

Author(s):

Chunzhong Li ◽

Yunong Zhang

Keyword(s):

Relative Density ◽

Clustering Algorithms ◽

Real Data ◽

Classification Problem ◽

Data Sets ◽

Density Peak ◽

Data Set ◽

Density Peaks ◽

Assignment Strategy ◽

Density Peak Clustering

Among numerous clustering algorithms, clustering by fast search and find of density peaks (DPC) is favoured because it is less affected by shapes and density structures of the data set. However, DPC still shows some limitations in clustering of data set with heterogeneity clusters and easily makes mistakes in assignment of remaining points. The new algorithm, density peak clustering based on relative density optimization (RDO-DPC), is proposed to settle these problems and try obtaining better results. With the help of neighborhood information of sample points, the proposed algorithm defines relative density of the sample data and searches and recognizes density peaks of the nonhomogeneous distribution as cluster centers. A new assignment strategy is proposed to solve the abundance classification problem. The experiments on synthetic and real data sets show good performance of the proposed algorithm.

Download Full-text

Auto-sharing parameters for transfer learning based on multi-objective optimization

Integrated Computer-Aided Engineering ◽

10.3233/ica-210655 ◽

2021 ◽

pp. 1-13

Author(s):

Hailin Liu ◽

Fangqing Gu ◽

Zixian Lin

Keyword(s):

Transfer Learning ◽

Optimization Problem ◽

Data Sets ◽

Multi Objective Optimization ◽

Particle Swarm Optimizer ◽

Real World Data ◽

Data Set ◽

Target Task ◽

Main Research ◽

Multi Objective

Transfer learning methods exploit similarities between different datasets to improve the performance of the target task by transferring knowledge from source tasks to the target task. “What to transfer” is a main research issue in transfer learning. The existing transfer learning method generally needs to acquire the shared parameters by integrating human knowledge. However, in many real applications, an understanding of which parameters can be shared is unknown beforehand. Transfer learning model is essentially a special multi-objective optimization problem. Consequently, this paper proposes a novel auto-sharing parameter technique for transfer learning based on multi-objective optimization and solves the optimization problem by using a multi-swarm particle swarm optimizer. Each task objective is simultaneously optimized by a sub-swarm. The current best particle from the sub-swarm of the target task is used to guide the search of particles of the source tasks and vice versa. The target task and source task are jointly solved by sharing the information of the best particle, which works as an inductive bias. Experiments are carried out to evaluate the proposed algorithm on several synthetic data sets and two real-world data sets of a school data set and a landmine data set, which show that the proposed algorithm is effective.

Download Full-text

Improving Density Peak Clustering by Automatic Peak Selection and Single Linkage Clustering

Symmetry ◽

10.3390/sym12071168 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1168

Author(s):

Jun-Lin Lin ◽

Jen-Chieh Kuo ◽

Hsing-Wang Chuang

Keyword(s):

Clustering Algorithm ◽

Academic Community ◽

Performance Study ◽

Potential Density ◽

Cluster Assignment ◽

Density Peak ◽

Single Linkage ◽

Density Peaks ◽

Assignment Strategy ◽

Density Peak Clustering

Density peak clustering (DPC) is a density-based clustering method that has attracted much attention in the academic community. DPC works by first searching density peaks in the dataset, and then assigning each data point to the same cluster as its nearest higher-density point. One problem with DPC is the determination of the density peaks, where poor selection of the density peaks could yield poor clustering results. Another problem with DPC is its cluster assignment strategy, which often makes incorrect cluster assignments for data points that are far from their nearest higher-density points. This study modifies DPC and proposes a new clustering algorithm to resolve the above problems. The proposed algorithm uses the radius of the neighborhood to automatically select a set of the likely density peaks, which are far from their nearest higher-density points. Using the potential density peaks as the density peaks, it then applies DPC to yield the preliminary clustering results. Finally, it uses single-linkage clustering on the preliminary clustering results to reduce the number of clusters, if necessary. The proposed algorithm avoids the cluster assignment problem in DPC because the cluster assignments for the potential density peaks are based on single-linkage clustering, not based on DPC. Our performance study shows that the proposed algorithm outperforms DPC for datasets with irregularly shaped clusters.

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Structure Identification-Based Clustering According to Density Consistency

Mathematical Problems in Engineering ◽

10.1155/2011/890901 ◽

2011 ◽

Vol 2011 ◽

pp. 1-14 ◽

Cited By ~ 1

Author(s):

Chunzhong Li ◽

Zongben Xu

Keyword(s):

High Dimension ◽

Real World ◽

Clustering Algorithm ◽

Density Difference ◽

Structure Identification ◽

Data Sets ◽

Critical Importance ◽

Real World Data ◽

Data Set ◽

High Dimension Data

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.

Download Full-text

EFFECTS OF NONSINGULAR PREPROCESSING ON FEEDFORWARD NETWORK TRAINING

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001405004022 ◽

2005 ◽

Vol 19 (02) ◽

pp. 217-247 ◽

Cited By ~ 6

Author(s):

CHANGHUA YU ◽

MICHAEL T. MANRY ◽

JIANG LI

Keyword(s):

Back Propagation ◽

Original Data ◽

Data Sets ◽

Training Algorithm ◽

Feedforward Network ◽

Data Set ◽

Network Training ◽

The Neural Network ◽

Hidden Layer ◽

Theoretical Analyses

In the neural network literature, many preprocessing techniques, such as feature de-correlation, input unbiasing and normalization, are suggested to accelerate multilayer perceptron training. In this paper, we show that a network trained with an original data set and one trained with a linear transformation of the original data will go through the same training dynamics, as long as they start from equivalent states. Thus preprocessing techniques may not be helpful and are merely equivalent to using a different weight set to initialize the network. Theoretical analyses of such preprocessing approaches are given for conjugate gradient, back propagation and the Newton method. In addition, an efficient Newton-like training algorithm is proposed for hidden layer training. Experiments on various data sets confirm the theoretical analyses and verify the improvement of the new algorithm.

Download Full-text

Multiclass Contour-Preserving Classification with Support Vector Machine (SVM)

Journal of Intelligent Systems ◽

10.1515/jisys-2015-0087 ◽

2017 ◽

Vol 26 (2) ◽

pp. 323-334 ◽

Cited By ~ 1

Author(s):

Piyabute Fuangkhon

Keyword(s):

Support Vector Machine ◽

Classification Accuracy ◽

University Of California ◽

Support Vector ◽

Data Sets ◽

Feed Forward Neural Network ◽

Real World Data ◽

Data Set ◽

The University ◽

Training Sets

AbstractMulticlass contour-preserving classification (MCOV) has been used to preserve the contour of the data set and improve the classification accuracy of a feed-forward neural network. It synthesizes two types of new instances, called fundamental multiclass outpost vector (FMCOV) and additional multiclass outpost vector (AMCOV), in the middle of the decision boundary between consecutive classes of data. This paper presents a comparison on the generalization of an inclusion of FMCOVs, AMCOVs, and both MCOVs on the final training sets with support vector machine (SVM). The experiments were carried out using MATLAB R2015a and LIBSVM v3.20 on seven types of the final training sets generated from each of the synthetic and real-world data sets from the University of California Irvine machine learning repository and the ELENA project. The experimental results confirm that an inclusion of FMCOVs on the final training sets having raw data can improve the SVM classification accuracy significantly.

Download Full-text

A Spatial Biosurveillance Synthetic Data Generator in R

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7583 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Drew Levin ◽

Patrick Finley

Keyword(s):

Power Law ◽

Real World ◽

Degree Distribution ◽

Transportation Network ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Scale Free ◽

Data Generator

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data

Download Full-text

Providing Data With High Utility And No Disclosure Risk For The Public and Researchers: An Evaluation By Advanced Statistical Disclosure Risk Methods

Austrian Journal of Statistics ◽

10.17713/ajs.v43i4.43 ◽

2014 ◽

Vol 43 (4) ◽

pp. 247-254

Author(s):

Matthias Templ

Keyword(s):

Data Privacy ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

The Public ◽

Disclosure Control ◽

High Data ◽

Disclosure Risk ◽

Statistical Disclosure ◽

High Utility

The demand of data from surveys, registers or other data sets containing sensibleinformation on people or enterprises have been increased significantly over the last years.However, before providing data to the public or to researchers, confidentiality has to berespected for any data set containing sensible individual information. Confidentiality canbe achieved by applying statistical disclosure control (SDC) methods to the data. Theresearch on SDC methods becomes more and more important in the last years because ofan increase of the awareness on data privacy and because of the fact that more and moredata are provided to the public or to researchers. However, for legal reasons this is onlyvisible when the released data has (very) low disclosure risk.In this contribution existing disclosure risk methods are review and summarized. Thesemethods are finally applied on a popular real-world data set - the Structural EarningsSurvey (SES) of Austria. It is shown that the application of few selected anonymisationmethods leads to well-protected anonymised data with high data utility and low informationloss.

Download Full-text

Clusterdv, a simple density-based clustering method that is robust, general and automatic

10.1101/224840 ◽

2017 ◽

Author(s):

João C. Marques ◽

Michael B. Orger

Keyword(s):

Clustering Algorithm ◽

Underlying Structure ◽

Data Sets ◽

Natural Phenomena ◽

Cluster Number ◽

Data Set ◽

Density Peaks ◽

Wide Range ◽

Cluster Shape ◽

Fully Automatic

AbstractHow to partition a data set into a set of distinct clusters is a ubiquitous and challenging problem. The fact that data varies widely in features such as cluster shape, cluster number, density distribution, background noise, outliers and degree of overlap, makes it difficult to find a single algorithm that can be broadly applied. One recent method, clusterdp, based on search of density peaks, can be applied successfully to cluster many kinds of data, but it is not fully automatic, and fails on some simple data distributions. We propose an alternative approach, clusterdv, which estimates density dips between points, and allows robust determination of cluster number and distribution across a wide range of data, without any manual parameter adjustment. We show that this method is able to solve a range of synthetic and experimental data sets, where the underlying structure is known, and identifies consistent and meaningful clusters in new behavioral data.Author summarIt is common that natural phenomena produce groupings, or clusters, in data, that can reveal the underlying processes. However, the form of these clusters can vary arbitrarily, making it challenging to find a single algorithm that identifies their structure correctly, without prior knowledge of the number of groupings or their distribution. We describe a simple clustering algorithm that is fully automatic and is able to correctly identify the number and shape of groupings in data of many types. We expect this algorithm to be useful in finding unknown natural phenomena present in data from a wide range of scientific fields.

Download Full-text