Self-weighted Multiview Clustering with Multiple Graphs

In multiview learning, it is essential to assign a reasonable weight to each view according to its importance. Thus, for multiview clustering task, a wise and elegant method should achieve clustering multiview data while learning the view weights. In this paper, we address this problem by exploring a Laplacian rank constrained graph, which can be approximately as the centroid of the built graph for each view with different confidences. We start our work with a natural thought that the weights can be learned by introducing a hyperparameter. By analyzing the weakness of it, we further propose a new multiview clustering method which is totally self-weighted. Furthermore, once the target graph is obtained in our models, we can directly assign the cluster label to each data point and do not need any postprocessing such as $K$-means in standard spectral clustering. Evaluations on two synthetic datasets prove the effectiveness of our methods. Compared with several representative graph-based multiview clustering approaches on four real-world datasets, experimental results demonstrate that the proposed methods achieve the better performances and our new clustering method is more practical to use.

Download Full-text

Adaptive Double-Exploration Tradeoff for Outlier Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6164 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6837-6844

Author(s):

Xiaojin Zhang ◽

Honglei Zhuang ◽

Shengyu Zhang ◽

Yuan Zhou

Keyword(s):

Confidence Interval ◽

Outlier Detection ◽

Real World ◽

Efficient Algorithm ◽

Experimental Results ◽

Sample Complexity ◽

Bandit Problem ◽

Real World Datasets ◽

Synthetic Datasets ◽

The Individual

We study a variant of the thresholding bandit problem (TBP) in the context of outlier detection, where the objective is to identify the outliers whose rewards are above a threshold. Distinct from the traditional TBP, the threshold is defined as a function of the rewards of all the arms, which is motivated by the criterion for identifying outliers. The learner needs to explore the rewards of the arms as well as the threshold. We refer to this problem as "double exploration for outlier detection". We construct an adaptively updated confidence interval for the threshold, based on the estimated value of the threshold in the previous rounds. Furthermore, by automatically trading off exploring the individual arms and exploring the outlier threshold, we provide an efficient algorithm in terms of the sample complexity. Experimental results on both synthetic datasets and real-world datasets demonstrate the efficiency of our algorithm.

Download Full-text

OFCOD: On the Fly Clustering Based Outlier Detection Framework

Data ◽

10.3390/data6010001 ◽

2020 ◽

Vol 6 (1) ◽

pp. 1

Author(s):

Ahmed Elmogy ◽

Hamada Rizk ◽

Amany M. Sarhan

Keyword(s):

Data Mining ◽

Image Processing ◽

Intrusion Detection ◽

Real Time ◽

Outlier Detection ◽

Real World ◽

Medical Data ◽

Experimental Results ◽

Real Time Applications ◽

Real World Datasets

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.

Download Full-text

Review Summary Generation in Online Systems: Frameworks for Supervised and Unsupervised Scenarios

ACM Transactions on the Web ◽

10.1145/3448015 ◽

2021 ◽

Vol 15 (3) ◽

pp. 1-33

Author(s):

Wenjun Jiang ◽

Jing Chen ◽

Xiaofei Ding ◽

Jie Wu ◽

Jiawei He ◽

...

Keyword(s):

Decision Making ◽

Real World ◽

Text Summarization ◽

Experimental Results ◽

Product Review ◽

Comprehensive Review ◽

Online Systems ◽

Real World Datasets ◽

Different Characteristics

In online systems, including e-commerce platforms, many users resort to the reviews or comments generated by previous consumers for decision making, while their time is limited to deal with many reviews. Therefore, a review summary, which contains all important features in user-generated reviews, is expected. In this article, we study “how to generate a comprehensive review summary from a large number of user-generated reviews.” This can be implemented by text summarization, which mainly has two types of extractive and abstractive approaches. Both of these approaches can deal with both supervised and unsupervised scenarios, but the former may generate redundant and incoherent summaries, while the latter can avoid redundancy but usually can only deal with short sequences. Moreover, both approaches may neglect the sentiment information. To address the above issues, we propose comprehensive Review Summary Generation frameworks to deal with the supervised and unsupervised scenarios. We design two different preprocess models of re-ranking and selecting to identify the important sentences while keeping users’ sentiment in the original reviews. These sentences can be further used to generate review summaries with text summarization methods. Experimental results in seven real-world datasets (Idebate, Rotten Tomatoes Amazon, Yelp, and three unlabelled product review datasets in Amazon) demonstrate that our work performs well in review summary generation. Moreover, the re-ranking and selecting models show different characteristics.

Download Full-text

SOCIAL INTEREST FOR USER SELECTING ITEMS IN RECOMMENDER SYSTEMS

International Journal of Modern Physics C ◽

10.1142/s0129183113500228 ◽

2013 ◽

Vol 24 (04) ◽

pp. 1350022 ◽

Cited By ~ 7

Author(s):

DA-CHENG NIE ◽

MING-JING DING ◽

YAN FU ◽

JUN-LIN ZHOU ◽

ZI-KE ZHANG

Keyword(s):

Recommender Systems ◽

Real World ◽

Social Interest ◽

Experimental Results ◽

Simple Method ◽

The Social ◽

Social Interests ◽

Similarity Computation ◽

Real World Datasets

Recommender systems have developed rapidly and successfully. The system aims to help users find relevant items from a potentially overwhelming set of choices. However, most of the existing recommender algorithms focused on the traditional user-item similarity computation, other than incorporating the social interests into the recommender systems. As we know, each user has their own preference field, they may influence their friends' preference in their expert field when considering the social interest on their friends' item collecting. In order to model this social interest, in this paper, we proposed a simple method to compute users' social interest on the specific items in the recommender systems, and then integrate this social interest with similarity preference. The experimental results on two real-world datasets Epinions and Friendfeed show that this method can significantly improve not only the algorithmic precision-accuracy but also the diversity-accuracy.

Download Full-text

A signal-diffusion-based spectral clustering method for community detection

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691319410194 ◽

2019 ◽

Vol 18 (01) ◽

pp. 1941019

Author(s):

Zheng Qiong

Keyword(s):

Community Detection ◽

Real World ◽

Adjacency Matrix ◽

Spectral Clustering ◽

Time Complexity ◽

Detection Method ◽

Signal Transmission ◽

Transmission Mechanism ◽

Clustering Method ◽

Spectral Clustering Method

As the traditional spectral community detection method uses adjacency matrix for clustering which might cause the problem of accuracy reduction, we proposed a signal-diffusion-based spectral clustering for community detection. This method solves the problem that unfixed total signal as using the signal transmission mechanism, provides optimization of algorithm time complexity, improves the performance of spectral clustering with construction of Laplacian based on signal diffusion. Experiments prove that the method reaches as better performance on real-world network and Lancichinetti–Fortunato–Radicchi (LFR) benchmark.

Download Full-text

Even-Sized Clustering Based on Optimization and its Variants

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2018.p0062 ◽

2018 ◽

Vol 22 (1) ◽

pp. 62-69 ◽

Cited By ~ 3

Author(s):

Yasunori Endo ◽

Yukihiro Hamasuna ◽

Tsubasa Hirano ◽

Naohiko Kinoshita ◽

◽

...

Keyword(s):

Linear Programming ◽

Simplex Method ◽

Benchmark Dataset ◽

Experimental Results ◽

Clustering Method ◽

Initial Value ◽

Cluster Number ◽

Clustering Problem ◽

Dataset Size ◽

Synthetic Datasets

A clustering method referred to as K-member clustering classifies a dataset into certain clusters, the size of which is more than a given constant K. Even-sized clustering, which classifies a dataset into even-sized clusters, is also considered along with K-member clustering. In our previous study, we proposed Even-sized Clustering Based on Optimization (ECBO) to output adequate results by formulating an even-sized clustering problem as linear programming. The simplex method is used to calculate the belongingness of each object to clusters in ECBO. In this study, ECBO is extended by introducing ideas that were introduced in K-means or fuzzy c-means to resolve problems of initial-value dependence, robustness against outliers, calculation costs, and nonlinear boundaries of clusters. We also reconsider the relation between the dataset size, the cluster number, and K in ECBO. Moreover, we verify the effectiveness of the variants of ECBO based on experimental results using synthetic datasets and a benchmark dataset.

Download Full-text

Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES ◽

10.3906/elk-1304-139 ◽

2016 ◽

Vol 24 ◽

pp. 219-233 ◽

Cited By ~ 12

Author(s):

Omid Naghash ALMASI ◽

Modjtaba ROUHANI

Keyword(s):

Support Vector Machine ◽

Fuzzy Clustering ◽

Real World ◽

Support Vector ◽

Training Method ◽

Clustering Method ◽

Support Vector Machine Training ◽

Fuzzy Clustering Method ◽

Real World Datasets

Download Full-text

Large-Scale Heterogeneous Feature Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013878 ◽

2019 ◽

Vol 33 ◽

pp. 3878-3885 ◽

Cited By ~ 5

Author(s):

Xiao Huang ◽

Qingquan Song ◽

Fan Yang ◽

Xia Hu

Keyword(s):

Real World ◽

Large Scale ◽

Single Type ◽

Heterogeneous Information ◽

Multiview Learning ◽

Efficiency And Effectiveness ◽

Joint Embedding ◽

Real World Datasets ◽

Low Dimensional ◽

Vector Representations

Feature embedding aims to learn a low-dimensional vector representation for each instance to preserve the information in its features. These representations can benefit various offthe-shelf learning algorithms. While embedding models for a single type of features have been well-studied, real-world instances often contain multiple types of correlated features or even information within a different modality such as networks. Existing studies such as multiview learning show that it is promising to learn unified vector representations from all sources. However, high computational costs of incorporating heterogeneous information limit the applications of existing algorithms. The number of instances and dimensions of features in practice are often large. To bridge the gap, we propose a scalable framework FeatWalk, which can model and incorporate instance similarities in terms of different types of features into a unified embedding representation. To enable the scalability, FeatWalk does not directly calculate any similarity measure, but provides an alternative way to simulate the similarity-based random walks among instances to extract the local instance proximity and preserve it in a set of instance index sequences. These sequences are homogeneous with each other. A scalable word embedding algorithm is applied to them to learn a joint embedding representation of instances. Experiments on four real-world datasets demonstrate the efficiency and effectiveness of FeatWalk.

Download Full-text

Spectral Clustering with Local Projection Distance Measurement

Mathematical Problems in Engineering ◽

10.1155/2015/829514 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Chen Diao ◽

Ai-Hua Zhang ◽

Bin Wang

Keyword(s):

Spatial Structure ◽

Spectral Clustering ◽

High Performance ◽

Distance Measure ◽

Affinity Matrix ◽

Local Projection ◽

Straight Line ◽

Projection Distance ◽

Real World Datasets ◽

Synthetic Datasets

Constructing a rational affinity matrix is crucial for spectral clustering. In this paper, a novel spectral clustering via local projection distance measure (LPDM) is proposed. In this method, the Local-Projection-Neighborhood (LPN) is defined, which is a region between a pair of data, and other data in the LPN are projected onto the straight line among the data pairs. Utilizing the Euclidean distance between projective points, the local spatial structure of data can be well detected to measure the similarity of objects. Then the affinity matrix can be obtained by using a new similarity measurement, which can squeeze or widen the projective distance with the different spatial structure of data. Experimental results show that the LPDM algorithm can obtain desirable results with high performance on synthetic datasets, real-world datasets, and images.

Download Full-text

Selective oversampling approach for strongly imbalanced data

PeerJ Computer Science ◽

10.7717/peerj-cs.604 ◽

2021 ◽

Vol 7 ◽

pp. e604

Author(s):

Peter Gnip ◽

Liberios Vokorokos ◽

Peter Drotár

Keyword(s):

Outlier Detection ◽

Real World ◽

State Of The Art ◽

Imbalanced Data ◽

Prediction Performance ◽

Classifier Performance ◽

Real World Applications ◽

Real World Datasets ◽

Synthetic Datasets ◽

Representative Samples

Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.

Download Full-text