scholarly journals Evolutionary Feature Manipulation in Unsupervised Learning

2021 ◽  
Author(s):  
◽  
Andrew Lensen

<p>Unsupervised learning is a fundamental category of machine learning that works on data for which no pre-existing labels are available. Unlike in supervised learning, which has such labels, methods that perform unsupervised learning must discover intrinsic patterns within data.  The size and complexity of data has increased substantially in recent years, which has necessitated the creation of new techniques for reducing the complexity and dimensionality of data in order to allow humans to understand the knowledge contained within data. This is particularly problematic in unsupervised learning, as the number of possible patterns in a dataset grows exponentially with regard to the number of dimensions. Feature manipulation techniques such as feature selection (FS) and feature construction (FC) are often used in these situations. FS automatically selects the most valuable features (attributes) in a dataset, whereas FC constructs new, more powerful and meaningful features that provide a lower-dimensional space.  Evolutionary computation (EC) approaches have become increasingly recognised for their potential to provide high-quality solutions to data mining problems in a reasonable amount of computational time. Unlike other popular techniques such as neural networks, EC methods have global search ability without needing gradient information, which makes them much more flexible and applicable to a wider range of problems. EC approaches have shown significant potential in feature manipulation tasks with methods such as Particle Swarm Optimisation (PSO) commonly used for FS, and Genetic Programming (GP) for FC.  The use of EC for feature manipulation has, until now, been predominantly restricted to supervised learning problems. This is a notable gap in the research: if unsupervised learning is even more sensitive to high-dimensionality, then why is EC-based feature manipulation not used for unsupervised learning problems?  This thesis provides the first comprehensive investigation into the use of evolutionary feature manipulation for unsupervised learning tasks. It clearly shows the ability of evolutionary feature manipulation to improve both the performance of algorithms and interpretability of solutions in unsupervised learning tasks. A variety of tasks are investigated, including the well-established task of clustering, as well as more recent unsupervised learning problems, such as benchmark dataset creation and manifold learning.  This thesis proposes a new PSO-based approach to performing simultaneous FS and clustering. A number of improvements to the state-of-the-art are made, including the introduction of a new medoid-based representation and an improved fitness function. A sophisticated three-stage algorithm, which takes advantage of heuristic techniques to determine the number of clusters and to fine-tune clustering performance is also developed. Empirical evaluation on a range of clustering problems demonstrates a decrease in the number of features used, while also improving the clustering performance.  This thesis also introduces two innovative approaches to performing wrapper-based FC in clustering tasks using GP. An initial approach where constructed features are directly provided to the k-means clustering algorithm demonstrates the clear strength of GP-based FC for improving clustering results. A more advanced method is proposed that utilises the functional nature of GP-based FC to evolve more specific, concise, and understandable similarity functions for use in clustering algorithms. These similarity functions provide clear improvements in performance and can be easily interpreted by machine learning practitioners.  This thesis demonstrates the ability of evolutionary feature manipulation to solve unsupervised learning tasks that traditional methods have struggled with. The synthesis of benchmark datasets has long been a technique used for evaluating machine learning techniques, but this research is the first to present an approach that automatically creates diverse and challenging redundant features for a given dataset. This thesis introduces a GP-based FC approach that creates difficult benchmark datasets for evaluating FS algorithms. It also makes the intriguing discovery that using a mutual information-based fitness function with GP has the potential to be used to improve supervised learning tasks even when the labels are not utilised.  Manifold learning is an approach to dimensionality reduction that aims to reduce dimensionality by discovering the inherent lower-dimensional structure of a dataset. While state-of-the-art manifold learning approaches show impressive performance in reducing data dimensionality, they do so at the cost of removing the ability for humans to understand the data in terms of the original features. By utilising a GP-based approach, this thesis proposes new methods that can perform interpretable manifold learning, which provides deep insight into patterns in the data.  These four contributions clearly support the hypothesis that evolutionary feature manipulation has untapped potential in unsupervised learning. This thesis demonstrates that EC-based feature manipulation can be successfully applied to a variety of unsupervised learning tasks with clear improvements in both performance and interpretability. A plethora of future research directions in this area are also discovered, which we hope will lead to further valuable findings in this area.</p>

2021 ◽  
Author(s):  
◽  
Andrew Lensen

<p>Unsupervised learning is a fundamental category of machine learning that works on data for which no pre-existing labels are available. Unlike in supervised learning, which has such labels, methods that perform unsupervised learning must discover intrinsic patterns within data.  The size and complexity of data has increased substantially in recent years, which has necessitated the creation of new techniques for reducing the complexity and dimensionality of data in order to allow humans to understand the knowledge contained within data. This is particularly problematic in unsupervised learning, as the number of possible patterns in a dataset grows exponentially with regard to the number of dimensions. Feature manipulation techniques such as feature selection (FS) and feature construction (FC) are often used in these situations. FS automatically selects the most valuable features (attributes) in a dataset, whereas FC constructs new, more powerful and meaningful features that provide a lower-dimensional space.  Evolutionary computation (EC) approaches have become increasingly recognised for their potential to provide high-quality solutions to data mining problems in a reasonable amount of computational time. Unlike other popular techniques such as neural networks, EC methods have global search ability without needing gradient information, which makes them much more flexible and applicable to a wider range of problems. EC approaches have shown significant potential in feature manipulation tasks with methods such as Particle Swarm Optimisation (PSO) commonly used for FS, and Genetic Programming (GP) for FC.  The use of EC for feature manipulation has, until now, been predominantly restricted to supervised learning problems. This is a notable gap in the research: if unsupervised learning is even more sensitive to high-dimensionality, then why is EC-based feature manipulation not used for unsupervised learning problems?  This thesis provides the first comprehensive investigation into the use of evolutionary feature manipulation for unsupervised learning tasks. It clearly shows the ability of evolutionary feature manipulation to improve both the performance of algorithms and interpretability of solutions in unsupervised learning tasks. A variety of tasks are investigated, including the well-established task of clustering, as well as more recent unsupervised learning problems, such as benchmark dataset creation and manifold learning.  This thesis proposes a new PSO-based approach to performing simultaneous FS and clustering. A number of improvements to the state-of-the-art are made, including the introduction of a new medoid-based representation and an improved fitness function. A sophisticated three-stage algorithm, which takes advantage of heuristic techniques to determine the number of clusters and to fine-tune clustering performance is also developed. Empirical evaluation on a range of clustering problems demonstrates a decrease in the number of features used, while also improving the clustering performance.  This thesis also introduces two innovative approaches to performing wrapper-based FC in clustering tasks using GP. An initial approach where constructed features are directly provided to the k-means clustering algorithm demonstrates the clear strength of GP-based FC for improving clustering results. A more advanced method is proposed that utilises the functional nature of GP-based FC to evolve more specific, concise, and understandable similarity functions for use in clustering algorithms. These similarity functions provide clear improvements in performance and can be easily interpreted by machine learning practitioners.  This thesis demonstrates the ability of evolutionary feature manipulation to solve unsupervised learning tasks that traditional methods have struggled with. The synthesis of benchmark datasets has long been a technique used for evaluating machine learning techniques, but this research is the first to present an approach that automatically creates diverse and challenging redundant features for a given dataset. This thesis introduces a GP-based FC approach that creates difficult benchmark datasets for evaluating FS algorithms. It also makes the intriguing discovery that using a mutual information-based fitness function with GP has the potential to be used to improve supervised learning tasks even when the labels are not utilised.  Manifold learning is an approach to dimensionality reduction that aims to reduce dimensionality by discovering the inherent lower-dimensional structure of a dataset. While state-of-the-art manifold learning approaches show impressive performance in reducing data dimensionality, they do so at the cost of removing the ability for humans to understand the data in terms of the original features. By utilising a GP-based approach, this thesis proposes new methods that can perform interpretable manifold learning, which provides deep insight into patterns in the data.  These four contributions clearly support the hypothesis that evolutionary feature manipulation has untapped potential in unsupervised learning. This thesis demonstrates that EC-based feature manipulation can be successfully applied to a variety of unsupervised learning tasks with clear improvements in both performance and interpretability. A plethora of future research directions in this area are also discovered, which we hope will lead to further valuable findings in this area.</p>


2020 ◽  
Vol 15 ◽  
Author(s):  
Shuwen Zhang ◽  
Qiang Su ◽  
Qin Chen

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.


Author(s):  
Joseph D. Romano ◽  
Trang T. Le ◽  
Weixuan Fu ◽  
Jason H. Moore

AbstractAutomated machine learning (AutoML) and artificial neural networks (ANNs) have revolutionized the field of artificial intelligence by yielding incredibly high-performing models to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists on when to use one versus the other. Furthermore, relatively few tools exist that allow the integration of both AutoML and ANNs in the same analysis to yield results combining both of their strengths. Here, we present TPOT-NN—a new extension to the tree-based AutoML software TPOT—and use it to explore the behavior of automated machine learning augmented with neural network estimators (AutoML+NN), particularly when compared to non-NN AutoML in the context of simple binary classification on a number of public benchmark datasets. Our observations suggest that TPOT-NN is an effective tool that achieves greater classification accuracy than standard tree-based AutoML on some datasets, with no loss in accuracy on others. We also provide preliminary guidelines for performing AutoML+NN analyses, and recommend possible future directions for AutoML+NN methods research, especially in the context of TPOT.


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 39
Author(s):  
Carlos Lassance ◽  
Vincent Gripon ◽  
Antonio Ortega

Deep Learning (DL) has attracted a lot of attention for its ability to reach state-of-the-art performance in many machine learning tasks. The core principle of DL methods consists of training composite architectures in an end-to-end fashion, where inputs are associated with outputs trained to optimize an objective function. Because of their compositional nature, DL architectures naturally exhibit several intermediate representations of the inputs, which belong to so-called latent spaces. When treated individually, these intermediate representations are most of the time unconstrained during the learning process, as it is unclear which properties should be favored. However, when processing a batch of inputs concurrently, the corresponding set of intermediate representations exhibit relations (what we call a geometry) on which desired properties can be sought. In this work, we show that it is possible to introduce constraints on these latent geometries to address various problems. In more detail, we propose to represent geometries by constructing similarity graphs from the intermediate representations obtained when processing a batch of inputs. By constraining these Latent Geometry Graphs (LGGs), we address the three following problems: (i) reproducing the behavior of a teacher architecture is achieved by mimicking its geometry, (ii) designing efficient embeddings for classification is achieved by targeting specific geometries, and (iii) robustness to deviations on inputs is achieved via enforcing smooth variation of geometry between consecutive latent spaces. Using standard vision benchmarks, we demonstrate the ability of the proposed geometry-based methods in solving the considered problems.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Nasser Assery ◽  
Yuan (Dorothy) Xiaohong ◽  
Qu Xiuli ◽  
Roy Kaushik ◽  
Sultan Almalki

Purpose This study aims to propose an unsupervised learning model to evaluate the credibility of disaster-related Twitter data and present a performance comparison with commonly used supervised machine learning models. Design/methodology/approach First historical tweets on two recent hurricane events are collected via Twitter API. Then a credibility scoring system is implemented in which the tweet features are analyzed to give a credibility score and credibility label to the tweet. After that, supervised machine learning classification is implemented using various classification algorithms and their performances are compared. Findings The proposed unsupervised learning model could enhance the emergency response by providing a fast way to determine the credibility of disaster-related tweets. Additionally, the comparison of the supervised classification models reveals that the Random Forest classifier performs significantly better than the SVM and Logistic Regression classifiers in classifying the credibility of disaster-related tweets. Originality/value In this paper, an unsupervised 10-point scoring model is proposed to evaluate the tweets’ credibility based on the user-based and content-based features. This technique could be used to evaluate the credibility of disaster-related tweets on future hurricanes and would have the potential to enhance emergency response during critical events. The comparative study of different supervised learning methods has revealed effective supervised learning methods for evaluating the credibility of Tweeter data.


Author(s):  
Greg Ver Steeg

Learning by children and animals occurs effortlessly and largely without obvious supervision. Successes in automating supervised learning have not translated to the more ambiguous realm of unsupervised learning where goals and labels are not provided. Barlow (1961) suggested that the signal that brains leverage for unsupervised learning is dependence, or redundancy, in the sensory environment. Dependence can be characterized using the information-theoretic multivariate mutual information measure called total correlation. The principle of Total Cor-relation Ex-planation (CorEx) is to learn representations of data that "explain" as much dependence in the data as possible. We review some manifestations of this principle along with successes in unsupervised learning problems across diverse domains including human behavior, biology, and language.


2021 ◽  
Author(s):  
◽  
Mouna Hakami

<p><b>This thesis presents two studies on non-intrusive speech quality assessment methods. The first applies supervised learning methods to speech quality assessment, which is a common approach in machine learning based quality assessment. To outperform existing methods, we concentrate on enhancing the feature set. In the second study, we analyse quality assessment from a different point of view inspired by the biological brain and present the first unsupervised learning based non-intrusive quality assessment that removes the need for labelled training data.</b></p> <p>Supervised learning based, non-intrusive quality predictors generally involve the development of a regressor that maps signal features to a representation of perceived quality. The performance of the predictor largely depends on 1) how sensitive the features are to the different types of distortion, and 2) how well the model learns the relation between the features and the quality score. We improve the performance of the quality estimation by enhancing the feature set and using a contemporary machine learning model that fits this objective. We propose an augmented feature set that includes raw features that are presumably redundant. The speech quality assessment system benefits from this redundancy as it results in reducing the impact of unwanted noise in the input. Feature set augmentation generally leads to the inclusion of features that have non-smooth distributions. We introduce a new pre-processing method and re-distribute the features to facilitate the training. The evaluation of the system on the ITU-T Supplement23 database illustrates that the proposed system outperforms the popular standards and contemporary methods in the literature.</p> <p>The unsupervised learning quality assessment approach presented in this thesis is based on a model that is learnt from clean speech signals. Consequently, it does not need to learn the statistics of any corruption that exists in the degraded speech signals and is trained only with unlabelled clean speech samples. The quality has a new definition, which is based on the divergence between 1) the distribution of the spectrograms of test signals, and 2) the pre-existing model that represents the distribution of the spectrograms of good quality speech. The distribution of the spectrogram of the speech is complex, and hence comparing them is not trivial. To tackle this problem, we propose to map the spectrograms of speech signals to a simple latent space.</p> <p>Generative models that map simple latent distributions into complex distributions are excellent platforms for our work. Generative models that are trained on the spectrograms of clean speech signals learned to map the latent variable $Z$ from a simple distribution $P_Z$ into a spectrogram $X$ from the distribution of good quality speech.</p> <p>Consequently, an inference model is developed by inverting the pre-trained generator, which maps spectrograms of the signal under the test, $X_t$, into its relevant latent variable, $Z_t$, in the latent space. We postulate the divergence between the distribution of the latent variable and the prior distribution $P_Z$ is a good measure of the quality of speech.</p> <p>Generative adversarial nets (GAN) are an effective training method and work well in this application. The proposed system is a novel application for a GAN. The experimental results with the TIMIT and NOIZEUS databases show that the proposed measure correlates positively with the objective quality scores.</p>


Author(s):  
Jan Žižka ◽  
František Dařena

The automated categorization of unstructured textual documents according to their semantic contents plays important role particularly linked with the ever growing volume of such data originating from the Internet. Having a sufficient number of labeled examples, a suitable supervised machine learning-based classifier can be trained. When no labeling is available, an unsupervised learning method can be applied, however, the missing label information often leads to worse classification results. This chapter demonstrates a method based on semi-supervised learning when a smallish set of manually labeled examples improves the categorization process in comparison with clustering, and the results are comparable with the supervised learning output. For the illustration, a real-world dataset coming from the Internet is used as the input of the supervised, unsupervised, and semi-supervised learning. The results are shown for different number of the starting labeled samples used as “seeds” to automatically label the remaining volume of unlabeled items.


2020 ◽  
Vol 34 (07) ◽  
pp. 13041-13049 ◽  
Author(s):  
Luowei Zhou ◽  
Hamid Palangi ◽  
Lei Zhang ◽  
Houdong Hu ◽  
Jason Corso ◽  
...  

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.


Sign in / Sign up

Export Citation Format

Share Document