scholarly journals Robust Self-Weighted Multi-View Projection Clustering

2020 ◽  
Vol 34 (04) ◽  
pp. 6110-6117
Author(s):  
Beilei Wang ◽  
Yun Xiao ◽  
Zhihui Li ◽  
Xuanhong Wang ◽  
Xiaojiang Chen ◽  
...  

Many real-world applications involve data collected from different views and with high data dimensionality. Furthermore, multi-view data always has unavoidable noise. Clustering on this kind of high-dimensional and noisy multi-view data remains a challenge due to the curse of dimensionality and ineffective de-noising and integration of multiple views. Aiming at this problem, in this paper, we propose a Robust Self-weighted Multi-view Projection Clustering (RSwMPC) based on ℓ2,1-norm, which can simultaneously reduce dimensionality, suppress noise and learn local structure graph. Then the obtained optimal graph can be directly used for clustering while no further processing is required. In addition, a new method is introduced to automatically learn the optimal weight of each view with no need to generate additional parameters to adjust the weight. Extensive experimental results on different synthetic datasets and real-world datasets demonstrate that the proposed algorithm outperforms other state-of-the-art methods on clustering performance and robustness.

2021 ◽  
Vol 7 ◽  
pp. e604
Author(s):  
Peter Gnip ◽  
Liberios Vokorokos ◽  
Peter Drotár

Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2021 ◽  
Vol 54 (6) ◽  
pp. 1-35
Author(s):  
Ninareh Mehrabi ◽  
Fred Morstatter ◽  
Nripsuta Saxena ◽  
Kristina Lerman ◽  
Aram Galstyan

With the widespread use of artificial intelligence (AI) systems and applications in our everyday lives, accounting for fairness has gained significant importance in designing and engineering of such systems. AI systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that these decisions do not reflect discriminatory behavior toward certain groups or populations. More recently some work has been developed in traditional machine learning and deep learning that address such challenges in different subdomains. With the commercialization of these systems, researchers are becoming more aware of the biases that these applications can contain and are attempting to address them. In this survey, we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.


2020 ◽  
Vol 68 ◽  
pp. 311-364
Author(s):  
Francesco Trovo ◽  
Stefano Paladino ◽  
Marcello Restelli ◽  
Nicola Gatti

Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings -- very common in real-world applications -- received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for nonstationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.


2020 ◽  
Vol 34 (04) ◽  
pp. 6837-6844
Author(s):  
Xiaojin Zhang ◽  
Honglei Zhuang ◽  
Shengyu Zhang ◽  
Yuan Zhou

We study a variant of the thresholding bandit problem (TBP) in the context of outlier detection, where the objective is to identify the outliers whose rewards are above a threshold. Distinct from the traditional TBP, the threshold is defined as a function of the rewards of all the arms, which is motivated by the criterion for identifying outliers. The learner needs to explore the rewards of the arms as well as the threshold. We refer to this problem as "double exploration for outlier detection". We construct an adaptively updated confidence interval for the threshold, based on the estimated value of the threshold in the previous rounds. Furthermore, by automatically trading off exploring the individual arms and exploring the outlier threshold, we provide an efficient algorithm in terms of the sample complexity. Experimental results on both synthetic datasets and real-world datasets demonstrate the efficiency of our algorithm.


2020 ◽  
Vol 34 (01) ◽  
pp. 19-26 ◽  
Author(s):  
Chong Chen ◽  
Min Zhang ◽  
Yongfeng Zhang ◽  
Weizhi Ma ◽  
Yiqun Liu ◽  
...  

Recent studies on recommendation have largely focused on exploring state-of-the-art neural networks to improve the expressiveness of models, while typically apply the Negative Sampling (NS) strategy for efficient learning. Despite effectiveness, two important issues have not been well-considered in existing methods: 1) NS suffers from dramatic fluctuation, making sampling-based methods difficult to achieve the optimal ranking performance in practical applications; 2) although heterogeneous feedback (e.g., view, click, and purchase) is widespread in many online systems, most existing methods leverage only one primary type of user feedback such as purchase. In this work, we propose a novel non-sampling transfer learning solution, named Efficient Heterogeneous Collaborative Filtering (EHCF) for Top-N recommendation. It can not only model fine-grained user-item relations, but also efficiently learn model parameters from the whole heterogeneous data (including all unlabeled data) with a rather low time complexity. Extensive experiments on three real-world datasets show that EHCF significantly outperforms state-of-the-art recommendation methods in both traditional (single-behavior) and heterogeneous scenarios. Moreover, EHCF shows significant improvements in training efficiency, making it more applicable to real-world large-scale systems. Our implementation has been released 1 to facilitate further developments on efficient whole-data based neural methods.


Entropy ◽  
2020 ◽  
Vol 22 (4) ◽  
pp. 407 ◽  
Author(s):  
Dominik Weikert ◽  
Sebastian Mai ◽  
Sanaz Mostaghim

In this article, we present a new algorithm called Particle Swarm Contour Search (PSCS)—a Particle Swarm Optimisation inspired algorithm to find object contours in 2D environments. Currently, most contour-finding algorithms are based on image processing and require a complete overview of the search space in which the contour is to be found. However, for real-world applications this would require a complete knowledge about the search space, which may not be always feasible or possible. The proposed algorithm removes this requirement and is only based on the local information of the particles to accurately identify a contour. Particles search for the contour of an object and then traverse alongside using their known information about positions in- and out-side of the object. Our experiments show that the proposed PSCS algorithm can deliver comparable results as the state-of-the-art.


2008 ◽  
Vol 8 (5-6) ◽  
pp. 545-580 ◽  
Author(s):  
WOLFGANG FABER ◽  
GERALD PFEIFER ◽  
NICOLA LEONE ◽  
TINA DELL'ARMI ◽  
GIUSEPPE IELPA

AbstractDisjunctive logic programming (DLP) is a very expressive formalism. It allows for expressing every property of finite structures that is decidable in the complexity class ΣP2(=NPNP). Despite this high expressiveness, there are some simple properties, often arising in real-world applications, which cannot be encoded in a simple and natural manner. Especially properties that require the use of arithmetic operators (like sum, times, or count) on a set or multiset of elements, which satisfy some conditions, cannot be naturally expressed in classic DLP. To overcome this deficiency, we extend DLP by aggregate functions in a conservative way. In particular, we avoid the introduction of constructs with disputed semantics, by requiring aggregates to be stratified. We formally define the semantics of the extended language (called ), and illustrate how it can be profitably used for representing knowledge. Furthermore, we analyze the computational complexity of , showing that the addition of aggregates does not bring a higher cost in that respect. Finally, we provide an implementation of in DLV—a state-of-the-art DLP system—and report on experiments which confirm the usefulness of the proposed extension also for the efficiency of computation.


Author(s):  
Seiki Ubukata ◽  
◽  
Sho Sekiya ◽  
Akira Notsu ◽  
Katsuhiro Honda

In the field of cluster analysis, rough set-based extensions of hard C-means (HCM; k-means) including rough C-means (RCM), rough set C-means (RSCM), and rough membership C-means (RMCM) are promising approaches for dealing with the certainty, possibility, uncertainty of belonging of object to clusters. Since C-means-type methods are strongly affected by noise, noise clustering approaches have been proposed. In noise clustering approaches, noise objects, which are far from any cluster center, are rejected for robust estimation. In this paper, we introduce noise rejection approaches for rough set-based C-means based on probabilistic memberships and propose noise RCM with membership normalization (NRCM-MN), noise RSCM with membership normalization (NRSCM-MN), and noise RMCM (NRMCM). In addition, visualization demonstration of the cluster boundaries on the two-dimensional plane of the proposed methods is carried out to confirm the characteristics of each method. Furthermore, the clustering performance is verified by numerical experiments using real-world datasets.


2018 ◽  
Author(s):  
Aditi Kathpalia ◽  
Nithin Nagaraj

Causality testing methods are being widely used in various disciplines of science. Model-free methods for causality estimation are very useful as the underlying model generating the data is often unknown. However, existing model-free measures assume separability of cause and effect at the level of individual samples of measurements and unlike model-based methods do not perform any intervention to learn causal relationships. These measures can thus only capture causality which is by the associational occurrence of ‘cause’ and ‘effect’ between well separated samples. In real-world processes, often ‘cause’ and ‘effect’ are inherently inseparable or become inseparable in the acquired measurements. We propose a novel measure that uses an adaptive interventional scheme to capture causality which is not merely associational. The scheme is based on characterizing complexities associated with the dynamical evolution of processes on short windows of measurements. The formulated measure, Compression- Complexity Causality is rigorously tested on simulated and real datasets and its performance is compared with that of existing measures such as Granger Causality and Transfer Entropy. The proposed measure is robust to presence of noise, long-term memory, filtering and decimation, low temporal resolution (including aliasing), non-uniform sampling, finite length signals and presence of common driving variables. Our measure outperforms existing state-of-the-art measures, establishing itself as an effective tool for causality testing in real world applications.


Sign in / Sign up

Export Citation Format

Share Document