A Novel Maximum Mean Discrepancy-Based Semi-Supervised Learning Algorithm

Qihang Huang; Yulin He; Zhexue Huang

doi:10.3390/math10010039

A Novel Maximum Mean Discrepancy-Based Semi-Supervised Learning Algorithm

Mathematics ◽

10.3390/math10010039 ◽

2021 ◽

Vol 10 (1) ◽

pp. 39

Author(s):

Qihang Huang ◽

Yulin He ◽

Zhexue Huang

Keyword(s):

Statistical Analysis ◽

Supervised Learning ◽

Clustering Algorithm ◽

Learning Algorithm ◽

Data Sets ◽

Generalization Capability ◽

Maximum Mean Discrepancy ◽

Benchmark Data ◽

Testing Accuracy ◽

Multiple Groups

To provide more external knowledge for training self-supervised learning (SSL) algorithms, this paper proposes a maximum mean discrepancy-based SSL (MMD-SSL) algorithm, which trains a well-performing classifier by iteratively refining the classifier using highly confident unlabeled samples. The MMD-SSL algorithm performs three main steps. First, a multilayer perceptron (MLP) is trained based on the labeled samples and is then used to assign labels to unlabeled samples. Second, the unlabeled samples are divided into multiple groups with the k-means clustering algorithm. Third, the maximum mean discrepancy (MMD) criterion is used to measure the distribution consistency between k-means-clustered samples and MLP-classified samples. The samples having a consistent distribution are labeled as highly confident samples and used to retrain the MLP. The MMD-SSL algorithm performs an iterative training until all unlabeled samples are consistently labeled. We conducted extensive experiments on 29 benchmark data sets to validate the rationality and effectiveness of the MMD-SSL algorithm. Experimental results show that the generalization capability of the MLP algorithm can gradually improve with the increase of labeled samples and the statistical analysis demonstrates that the MMD-SSL algorithm can provide better testing accuracy and kappa values than 10 other self-training and co-training SSL algorithms.

Download Full-text

LAMDA-HAD, an Extension to the LAMDA Classifier in the Context of Supervised Learning

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622019500457 ◽

2020 ◽

Vol 19 (01) ◽

pp. 283-316 ◽

Cited By ~ 1

Author(s):

Luis Morales ◽

José Aguilar ◽

Danilo Chávez ◽

Claudia Isaza

Keyword(s):

Data Analysis ◽

Statistical Analysis ◽

Unsupervised Learning ◽

Supervised Learning ◽

Learning Algorithm ◽

New Approach ◽

Supervised And Unsupervised Learning ◽

Training Stage

This paper proposes a new approach to improve the performance of Learning Algorithm for Multivariable Data Analysis (LAMDA). This algorithm can be used for supervised and unsupervised learning, based on the calculation of the Global Adequacy Degree (GAD) of one individual to a class, through the contributions of all its descriptors. LAMDA has the capability of creating new classes after the training stage. If an individual does not have enough similarity to the preexisting classes, it is evaluated with respect to a threshold called the Non-Informative Class (NIC), this being the novelty of the algorithm. However, LAMDA has problems making good classifications, either because the NIC is constant for all classes, or because the GAD calculation is unreliable. In this work, its efficiency is improved by two strategies, the first one, by the calculation of adaptable NICs for each class, which prevents that correctly classified individuals create new classes; and the second one, by computing the Higher Adequacy Degree (HAD), which grants more robustness to the algorithm. LAMDA-HAD is validated by applying it in different benchmarks and comparing it with LAMDA and other classifiers, through a statistical analysis to determinate the cases in which our algorithm presents a better performance.

Download Full-text

Projection Network for Unsupervised Pattern Classification

Dynamic Systems and Control, Parts A and B ◽

10.1115/imece2005-79603 ◽

2005 ◽

Author(s):

C. James Li ◽

C. Jansuwan

Keyword(s):

Clustering Algorithm ◽

Unsupervised Classification ◽

Least Square ◽

Data Sets ◽

Numerical Instability ◽

Benchmark Data ◽

Supervised Classifiers ◽

The Difference ◽

Projection Network ◽

Iris Data

Projection network, being a non-linear dynamic system itself, has been shown to be superior to static classifiers such as neural networks in some applications where noise is significant. However it is a supervised classifier by nature. To extend its utility for unsupervised classification, this study proposes an unsupervised pattern classifier integrating a clustering algorithm based on DBSCAN and a dynamic classifier based on the projection network. The former is used to form clusters out of un-labeled data and eliminate outliers. Then, significant clusters in terms of size are identified. Subsequently, a system of projection networks is established to recognize all the significant clusters. The unsupervised classifier is tested with three well-known benchmark data sets (by ignoring data labels during training) including the Fisher’s iris data, the heart disease data and the credit screening data and the results are compared to those of supervised classifiers based on the projection network. The difference in performance is small. However, the ability of unsupervised classification comes at a price of a more complex classifier system and the need of data pre-conditioning. The former is because more than one cluster could be formed for a class and therefore more computational units are needed for the classifier, and the latter is because increased similarity of data after clustering increases the chances of numerical instability in the least square algorithm used to initialize the classifier.

Download Full-text

The fast clustering algorithm for the big data based on K-means

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691320500538 ◽

2020 ◽

Vol 18 (06) ◽

pp. 2050053

Author(s):

Ting Xie ◽

Taiping Zhang

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Feature Space ◽

Data Sets ◽

Benchmark Data ◽

Clustering Model ◽

Alternating Direction ◽

Learning Technique ◽

Noise Data

As a powerful unsupervised learning technique, clustering is the fundamental task of big data analysis. However, many traditional clustering algorithms for big data that is a collection of high dimension, sparse and noise data do not perform well both in terms of computational efficiency and clustering accuracy. To alleviate these problems, this paper presents Feature K-means clustering model on the feature space of big data and introduces its fast algorithm based on Alternating Direction Multiplier Method (ADMM). We show the equivalence of the Feature K-means model in the original space and the feature space and prove the convergence of its iterative algorithm. Computationally, we compare the Feature K-means with Spherical K-means and Kernel K-means on several benchmark data sets, including artificial data and four face databases. Experiments show that the proposed approach is comparable to the state-of-the-art algorithm in big data clustering.

Download Full-text

A Multiple Cause Mixture Model for Unsupervised Learning

Neural Computation ◽

10.1162/neco.1995.7.1.51 ◽

1995 ◽

Vol 7 (1) ◽

pp. 51-71 ◽

Cited By ~ 61

Author(s):

Eric Saund

Keyword(s):

Unsupervised Learning ◽

Mixture Model ◽

Binary Data ◽

Gradient Descent ◽

Clustering Algorithm ◽

Learning Algorithm ◽

Causal Structure ◽

Data Sets ◽

Weighted Sum ◽

Crucial Issue

This paper presents a formulation for unsupervised learning of clusters reflecting multiple causal structure in binary data. Unlike the “hard” k-means clustering algorithm and the “soft” mixture model, each of which assumes that a single hidden event generates each data point, a multiple cause model accounts for observed data by combining assertions from many hidden causes, each of which can pertain to varying degree to any subset of the observable dimensions. We employ an objective function and iterative gradient descent learning algorithm resembling the conventional mixture model. A crucial issue is the mixing function for combining beliefs from different cluster centers in order to generate data predictions whose errors are minimized both during recognition and learning. The mixing function constitutes a prior assumption about underlying structural regularities of the data domain; we demonstrate a weakness inherent to the popular weighted sum followed by sigmoid squashing, and offer alternative forms of the nonlinearity for two types of data domain. Results are presented demonstrating the algorithm's ability successfully to discover coherent multiple causal representations in several experimental data sets.

Download Full-text

IMPROVING SUPERVISED LEARNING BY ADAPTING THE PROBLEM TO THE LEARNER

International Journal of Neural Systems ◽

10.1142/s0129065709001793 ◽

2009 ◽

Vol 19 (01) ◽

pp. 1-9 ◽

Cited By ~ 11

Author(s):

JOSHUA MENKE ◽

TONY MARTINEZ

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Supervised Learning ◽

Learning Algorithm ◽

Training Data ◽

Data Sets ◽

Proof Of Concept ◽

Original Function ◽

Equivalent Function ◽

Training Examples

While no supervised learning algorithm can do well over all functions, we show that it may be possible to adapt a given function to a given supervised learning algorithm so as to allow the learning algorithm to better classify the original function. Although this seems counterintuitive, adapting the problem to the learner may result in an equivalent function that is "easier" for the algorithm to learn. One method of adapting a problem to the learner is to relabel the targets given in the training data. The following presents two problem adaptation methods, SOL-CTR-E and SOL-CTR-P, variants of Self-Oracle Learning with Confidence-based Target Relabeling (SOL-CTR) as a proof of concept for problem adaptation. The SOL-CTR methods produce "easier" target functions for training artificial neural networks (ANNs). Applying SOL-CTR over 41 data sets consistently results in a statistically significant (p < 0.05) improvement in accuracy over 0/1 targets on data sets containing over 10,000 training examples.

Download Full-text

Time series clustering in large data sets

Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis ◽

10.11118/actaun201159020075 ◽

2011 ◽

Vol 59 (2) ◽

pp. 75-80 ◽

Cited By ~ 4

Author(s):

Jiří Fejfar ◽

Jiří Šťastný

Keyword(s):

Time Series ◽

Digital Libraries ◽

Clustering Algorithm ◽

Learning Algorithm ◽

Large Data ◽

Data Sets ◽

Self Organizing Map ◽

Time Series Clustering ◽

Feature Vectors ◽

Cover Songs

The clustering of time series is a widely researched area. There are many methods for dealing with this task. We are actually using the Self-organizing map (SOM) with the unsupervised learning algorithm for clustering of time series. After the first experiment (Fejfar, Weinlichová, Šťastný, 2009) it seems that the whole concept of the clustering algorithm is correct but that we have to perform time series clustering on much larger dataset to obtain more accurate results and to find the correlation between configured parameters and results more precisely. The second requirement arose in a need for a well-defined evaluation of results. It seems useful to use sound recordings as instances of time series again. There are many recordings to use in digital libraries, many interesting features and patterns can be found in this area. We are searching for recordings with the similar development of information density in this experiment. It can be used for musical form investigation, cover songs detection and many others applications.The objective of the presented paper is to compare clustering results made with different parameters of feature vectors and the SOM itself. We are describing time series in a simplistic way evaluating standard deviations for separated parts of recordings. The resulting feature vectors are clustered with the SOM in batch training mode with different topologies varying from few neurons to large maps.There are other algorithms discussed, usable for finding similarities between time series and finally conclusions for further research are presented. We also present an overview of the related actual literature and projects.

Download Full-text

Improved Extension Neural Network and Its Applications

Mathematical Problems in Engineering ◽

10.1155/2014/593021 ◽

2014 ◽

Vol 2014 ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

Yu Zhou ◽

Lian Tian ◽

Linfei Liu

Keyword(s):

Neural Network ◽

Supervised Learning ◽

Clustering Algorithm ◽

Learning Algorithm ◽

Training Data ◽

Learning Performance ◽

Data Selection ◽

Data Set ◽

Application Range ◽

Shadowed Sets

Extension neural network (ENN) is a new neural network that is a combination of extension theory and artificial neural network (ANN). The learning algorithm of ENN is based on supervised learning algorithm. One of important issues in the field of classification and recognition of ENN is how to achieve the best possible classifier with a small number of labeled training data. Training data selection is an effective approach to solve this issue. In this work, in order to improve the supervised learning performance and expand the engineering application range of ENN, we use a novel data selection method based on shadowed sets to refine the training data set of ENN. Firstly, we use clustering algorithm to label the data and induce shadowed sets. Then, in the framework of shadowed sets, the samples located around each cluster centers (core data) and the borders between clusters (boundary data) are selected as training data. Lastly, we use selected data to train ENN. Compared with traditional ENN, the proposed improved ENN (IENN) has a better performance. Moreover, IENN is independent of the supervised learning algorithms and initial labeled data. Experimental results verify the effectiveness and applicability of our proposed work.

Download Full-text

A Fast Logdet Divergence Based Metric Learning Algorithm for Large Data Sets Classification

Abstract and Applied Analysis ◽

10.1155/2014/463981 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Jiangyuan Mei ◽

Jian Hou ◽

Jicheng Chen ◽

Hamid Reza Karimi

Keyword(s):

Learning Algorithm ◽

Metric Learning ◽

Large Data ◽

Feature Space ◽

Industrial Applications ◽

Large Data Sets ◽

Training Data ◽

High Dimensional ◽

Data Sets ◽

Benchmark Data

Large data sets classification is widely used in many industrial applications. It is a challenging task to classify large data sets efficiently, accurately, and robustly, as large data sets always contain numerous instances with high dimensional feature space. In order to deal with this problem, in this paper we present an online Logdet divergence based metric learning (LDML) model by making use of the powerfulness of metric learning. We firstly generate a Mahalanobis matrix via learning the training data with LDML model. Meanwhile, we propose a compressed representation for high dimensional Mahalanobis matrix to reduce the computation complexity in each iteration. The final Mahalanobis matrix obtained this way measures the distances between instances accurately and serves as the basis of classifiers, for example, thek-nearest neighbors classifier. Experiments on benchmark data sets demonstrate that the proposed algorithm compares favorably with the state-of-the-art methods.

Download Full-text

An Extensional Clustering Algorithm of FCM Based on Intuitionistic Extension Index

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.490-495.1372 ◽

2012 ◽

Vol 490-495 ◽

pp. 1372-1376

Author(s):

Qing Feng Liu

Keyword(s):

Iterative Algorithm ◽

Clustering Algorithm ◽

Experimental Results ◽

Data Sets ◽

Number Of Clusters ◽

Fcm Algorithm ◽

Object A ◽

Benchmark Data ◽

Degree Of Membership ◽

Fuzzy C Means Algorithm

The fuzzy C-means algorithm is an iterative algorithm in which the desired number of clusters C and the initial clustering seeds has to be pre-defined. The seeds are modified in each stage of the algorithm and for each object a degree of membership to each of the clusters is estimated. In this paper, an extensional clustering algorithm of FCM based on an intuitionistic extension index, denoted E-FCM algorithm, is proposed. For comparing the performance of the above mentioned two algorithms, the experimental results of three benchmark data sets show that the E-FCM algorithm outperforms the FCM algorithm.

Download Full-text

Prediction of Suitability of Soil for Different Crops using Spatial Data Mining

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1377.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2330-2337 ◽

Cited By ~ 4

Keyword(s):

Data Mining ◽

Statistical Analysis ◽

Spatial Data ◽

Clustering Algorithm ◽

Geographical Location ◽

Spatial Data Mining ◽

Soil Conditions ◽

Data Sets ◽

Grid Pattern ◽

Multiple Data Sets

The main Objective of Data mining in agriculture is to improvise the productivity based on the data observed and timelines of cultivation. Spatial Data mining, a key to capture the data by proposing sensors on a particular geographical location and observe various parameters to enhance the productivity based on the statistical analysis of data collected. In general, Data mining is an anticipating measurement and prognosticates the various data sets and mutate into useful data sets which can be applied on various applications. In this paper, data mining is applied in bridging the soil conditions to the applicable crop for cultivation in enhancing the productivity and multiple crops cultivation for enriched productivity based on the data sets acquired. A Statistical analysis resulted from a backend algorithm with the data sets and displayed as dashboard with the forecasted productivity. A Grid based clustering algorithm is adhered at the backend for performing analysis on the collected data sets results crop selectivity & productivity timelines. Geographical analysis forms a grid pattern with multiple data sets as matrix results in multiple crop selectivity based on the soil conditions and analyzed data sets obtained from various sensor parameters on a particular location. Data visualization is performed after the algorithmic process at the backend and data stored in the cloud server. Spatial Survey & Collective data Sets analyzed with the algorithm are used to elevate the Crop Selectivity and productivity on a soil based on the Biological Predicts, defoliant and manure usage timelines yields Improved Monetary generation.

Download Full-text