Semi-Supervised Outlier Detection with Only Positive and Unlabeled Data Based on Fuzzy Clustering

The task of semi-supervised outlier detection is to find the instances that are exceptional from other data, using some labeled examples. In many applications such as fraud detection and intrusion detection, this issue becomes more important. Most existing techniques are unsupervised. On the other hand, semi-supervised approaches use both negative and positive instances to detect outliers. However, in many real world applications, very few positive labeled examples are available. This paper proposes an innovative approach to address this problem. The proposed method works as follows. First, some reliable negative instances are extracted by a kNN-based algorithm. Afterwards, fuzzy clustering using both negative and positive examples is utilized to detect outliers. Experimental results on real data sets demonstrate that the proposed approach outperforms the previous unsupervised state-of-the-art methods in detecting outliers.

Download Full-text

Restart and Random Walk in Local Search for Maximum Vertex Weight Cliques with Evaluations in Clustering Aggregation

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/87 ◽

2017 ◽

Cited By ~ 4

Author(s):

Yi Fan ◽

Nan Li ◽

Chengqian Li ◽

Zongjie Ma ◽

Longin Jan Latecki ◽

...

Keyword(s):

Random Walk ◽

Local Search ◽

Real World ◽

State Of The Art ◽

Real Data ◽

Experimental Results ◽

Data Sets ◽

Vertex Weight ◽

Real World Applications ◽

Clustering Aggregation

The Maximum Vertex Weight Clique (MVWC) problem is NP-hard and also important in real-world applications. In this paper we propose to use the restart and the random walk strategies to improve local search for MVWC. If a solution is revisited in some particular situation, the search will restart. In addition, when the local search has no other options except dropping vertices, it will use random walk. Experimental results show that our solver outperforms state-of-the-art solvers in DIMACS and finds a new best-known solution. Also it is the unique solver which is comparable with state-of-the-art methods on both BHOSLIB and large crafted graphs. Furthermore we evaluated our solver in clustering aggregation. Experimental results on a number of real data sets demonstrate that our solver outperforms the state-of-the-art for solving the derived MVWC problem and helps improve the final clustering results.

Download Full-text

OFCOD: On the Fly Clustering Based Outlier Detection Framework

Data ◽

10.3390/data6010001 ◽

2020 ◽

Vol 6 (1) ◽

pp. 1

Author(s):

Ahmed Elmogy ◽

Hamada Rizk ◽

Amany M. Sarhan

Keyword(s):

Data Mining ◽

Image Processing ◽

Intrusion Detection ◽

Real Time ◽

Outlier Detection ◽

Real World ◽

Medical Data ◽

Experimental Results ◽

Real Time Applications ◽

Real World Datasets

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.

Download Full-text

Selective oversampling approach for strongly imbalanced data

PeerJ Computer Science ◽

10.7717/peerj-cs.604 ◽

2021 ◽

Vol 7 ◽

pp. e604

Author(s):

Peter Gnip ◽

Liberios Vokorokos ◽

Peter Drotár

Keyword(s):

Outlier Detection ◽

Real World ◽

State Of The Art ◽

Imbalanced Data ◽

Prediction Performance ◽

Classifier Performance ◽

Real World Applications ◽

Real World Datasets ◽

Synthetic Datasets ◽

Representative Samples

Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.

Download Full-text

Sampling for Approximate Maximum Search in Factorized Tensor

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/334 ◽

2017 ◽

Author(s):

Zhi Lu ◽

Yang Hu ◽

Bing Zeng

Keyword(s):

Theoretical Analysis ◽

Collaborative Filtering ◽

Real World ◽

State Of The Art ◽

The Other ◽

Data Sets ◽

Real World Data ◽

Parafac Model ◽

The Matrix ◽

Special Case

Factorization models have been extensively used for recovering the missing entries of a matrix or tensor. However, directly computing all of the entries using the learned factorization models is prohibitive when the size of the matrix/tensor is large. On the other hand, in many applications, such as collaborative filtering, we are only interested in a few entries that are the largest among them. In this work, we propose a sampling-based approach for finding the top entries of a tensor which is decomposed by the CANDECOMP/PARAFAC model. We develop an algorithm to sample the entries with probabilities proportional to their values. We further extend it to make the sampling proportional to the $k$-th power of the values, amplifying the focus on the top ones. We provide theoretical analysis of the sampling algorithm and evaluate its performance on several real-world data sets. Experimental results indicate that the proposed approach is orders of magnitude faster than exhaustive computing. When applied to the special case of searching in a matrix, it also requires fewer samples than the other state-of-the-art method.

Download Full-text

Co-GCN for Multi-View Semi-Supervised Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5901 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4691-4698

Author(s):

Shu Li ◽

Wen-Tao Li ◽

Wei Wang

Keyword(s):

Supervised Learning ◽

Real World ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Convolutional Network ◽

The Past ◽

Real World Applications ◽

Supervised Methods ◽

Disjoint Sets

In many real-world applications, the data have several disjoint sets of features and each set is called as a view. Researchers have developed many multi-view learning methods in the past decade. In this paper, we bring Graph Convolutional Network (GCN) into multi-view learning and propose a novel multi-view semi-supervised learning method Co-GCN by adaptively exploiting the graph information from the multiple views with combined Laplacians. Experimental results on real-world data sets verify that Co-GCN can achieve better performance compared with state-of-the-art multi-view semi-supervised methods.

Download Full-text

Distributed Pareto Optimization for Subset Selection

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/207 ◽

2018 ◽

Cited By ~ 2

Author(s):

Chao Qian ◽

Guiying Li ◽

Chao Feng ◽

Ke Tang

Keyword(s):

Real World ◽

Large Scale ◽

State Of The Art ◽

Subset Selection ◽

Data Sets ◽

Mapreduce Framework ◽

Real World Data ◽

Real World Applications ◽

Approximation Guarantee ◽

Better Than

The subset selection problem that selects a few items from a ground set arises in many applications such as maximum coverage, influence maximization, sparse regression, etc. The recently proposed POSS algorithm is a powerful approximation solver for this problem. However, POSS requires centralized access to the full ground set, and thus is impractical for large-scale real-world applications, where the ground set is too large to be stored on one single machine. In this paper, we propose a distributed version of POSS (DPOSS) with a bounded approximation guarantee. DPOSS can be easily implemented in the MapReduce framework. Our extensive experiments using Spark, on various real-world data sets with size ranging from thousands to millions, show that DPOSS can achieve competitive performance compared with the centralized POSS, and is almost always better than the state-of-the-art distributed greedy algorithm RandGreeDi.

Download Full-text

DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processing

Journal Of Big Data ◽

10.1186/s40537-021-00437-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hossein Ahmadvand ◽

Fouzhan Foroutan ◽

Mahmood Fathy

Keyword(s):

Big Data ◽

Energy Consumption ◽

Processing Time ◽

Experimental Results ◽

The Other ◽

Data Sets ◽

Multiple Sources ◽

Evaluation Phase ◽

Dynamic Voltage ◽

Processing Resources

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.

Download Full-text

Theory and Applications of the Unit Gamma/Gompertz Distribution

Mathematics ◽

10.3390/math9161850 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1850

Author(s):

Rashad A. R. Bantan ◽

Farrukh Jamal ◽

Christophe Chesneau ◽

Mohammed Elgarhy

Keyword(s):

Stochastic Ordering ◽

Real Data ◽

Rate Function ◽

The Other ◽

Likelihood Method ◽

Model Parameters ◽

Data Sets ◽

Gompertz Distribution ◽

Probability And Statistics ◽

Analytical Behavior

Unit distributions are commonly used in probability and statistics to describe useful quantities with values between 0 and 1, such as proportions, probabilities, and percentages. Some unit distributions are defined in a natural analytical manner, and the others are derived through the transformation of an existing distribution defined in a greater domain. In this article, we introduce the unit gamma/Gompertz distribution, founded on the inverse-exponential scheme and the gamma/Gompertz distribution. The gamma/Gompertz distribution is known to be a very flexible three-parameter lifetime distribution, and we aim to transpose this flexibility to the unit interval. First, we check this aspect with the analytical behavior of the primary functions. It is shown that the probability density function can be increasing, decreasing, “increasing-decreasing” and “decreasing-increasing”, with pliant asymmetric properties. On the other hand, the hazard rate function has monotonically increasing, decreasing, or constant shapes. We complete the theoretical part with some propositions on stochastic ordering, moments, quantiles, and the reliability coefficient. Practically, to estimate the model parameters from unit data, the maximum likelihood method is used. We present some simulation results to evaluate this method. Two applications using real data sets, one on trade shares and the other on flood levels, demonstrate the importance of the new model when compared to other unit models.

Download Full-text

A New No Equilibrium Fractional Order Chaotic System, Dynamical Investigation, Synchronization, and Its Digital Implementation

Inventions ◽

10.3390/inventions6030049 ◽

2021 ◽

Vol 6 (3) ◽

pp. 49

Author(s):

Zain-Aldeen S. A. Rahman ◽

Basil H. Jasim ◽

Yasir I. A. Al-Yasir ◽

Raed A. Abd-Alhameed ◽

Bilal Naji Alhasnawi

Keyword(s):

Adaptive Control ◽

Fractional Order ◽

Chaotic System ◽

Real World ◽

Experimental Results ◽

Chaotic Attractors ◽

State Variables ◽

Digital Implementation ◽

Dynamical Behaviors ◽

Real World Applications

In this paper, a new fractional order chaotic system without equilibrium is proposed, analytically and numerically investigated, and numerically and experimentally tested. The analytical and numerical investigations were used to describe the system’s dynamical behaviors including the system equilibria, the chaotic attractors, the bifurcation diagrams, and the Lyapunov exponents. Based on the obtained dynamical behaviors, the system can excite hidden chaotic attractors since it has no equilibrium. Then, a synchronization mechanism based on the adaptive control theory was developed between two identical new systems (master and slave). The adaptive control laws are derived based on synchronization error dynamics of the state variables for the master and slave. Consequently, the update laws of the slave parameters are obtained, where the slave parameters are assumed to be uncertain and are estimated corresponding to the master parameters by the synchronization process. Furthermore, Arduino Due boards were used to implement the proposed system in order to demonstrate its practicality in real-world applications. The simulation experimental results were obtained by MATLAB and the Arduino Due boards, respectively, with a good consistency between the simulation results and the experimental results, indicating that the new fractional order chaotic system is capable of being employed in real-world applications.

Download Full-text

A Survey on Bias and Fairness in Machine Learning

ACM Computing Surveys ◽

10.1145/3457607 ◽

2021 ◽

Vol 54 (6) ◽

pp. 1-35

Author(s):

Ninareh Mehrabi ◽

Fred Morstatter ◽

Nripsuta Saxena ◽

Kristina Lerman ◽

Aram Galstyan

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Deep Learning ◽

Real World ◽

State Of The Art ◽

Future Directions ◽

Discriminatory Behavior ◽

Real World Applications ◽

Near Future ◽

Different Sources

With the widespread use of artificial intelligence (AI) systems and applications in our everyday lives, accounting for fairness has gained significant importance in designing and engineering of such systems. AI systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that these decisions do not reflect discriminatory behavior toward certain groups or populations. More recently some work has been developed in traditional machine learning and deep learning that address such challenges in different subdomains. With the commercialization of these systems, researchers are becoming more aware of the biases that these applications can contain and are attempting to address them. In this survey, we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.

Download Full-text