scholarly journals Exploiting Synthetically Generated Data with Semi-Supervised Learning for Small and Imbalanced Datasets

Author(s):  
M. Peréz-Ortiz ◽  
P. Tiňo ◽  
R. Mantiuk ◽  
C. Hervás-Martínez

Data augmentation is rapidly gaining attention in machine learning. Synthetic data can be generated by simple transformations or through the data distribution. In the latter case, the main challenge is to estimate the label associated to new synthetic patterns. This paper studies the effect of generating synthetic data by convex combination of patterns and the use of these as unsupervised information in a semi-supervised learning framework with support vector machines, avoiding thus the need to label synthetic examples. We perform experiments on a total of 53 binary classification datasets. Our results show that this type of data over-sampling supports the well-known cluster assumption in semi-supervised learning, showing outstanding results for small high-dimensional datasets and imbalanced learning problems.

Author(s):  
Diana Benavides-Prado

In our research, we study the problem of learning a sequence of supervised tasks. This is a long-standing challenge in machine learning. Our work relies on transfer of knowledge between hypotheses learned with Support Vector Machines. Transfer occurs in two directions: forward and backward. We have proposed to selectively transfer forward support vector coefficients from previous hypotheses as upper-bounds on support vector coefficients to be learned on a target task. We also proposed a novel method for refining existing hypotheses by transferring backward knowledge from a target hypothesis learned recently. We have improved this method through a hypothesis refinement approach that refines whilst encouraging retention of knowledge. Our contribution is represented in a long-term learning framework for binary classification tasks received sequentially one at a time.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1330
Author(s):  
Maxime Haddouche ◽  
Benjamin Guedj ◽  
Omar Rivasplata ◽  
John Shawe-Taylor

We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this classical assumption, we propose to allow the range of the loss to depend on each predictor. This relaxation is captured by our new notion of HYPothesis-dependent rangE (HYPE). Based on this, we derive a novel PAC-Bayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions.


Energies ◽  
2021 ◽  
Vol 14 (4) ◽  
pp. 1081
Author(s):  
Spyros Theocharides ◽  
Marios Theristis ◽  
George Makrides ◽  
Marios Kynigos ◽  
Chrysovalantis Spanias ◽  
...  

A main challenge for integrating the intermittent photovoltaic (PV) power generation remains the accuracy of day-ahead forecasts and the establishment of robust performing methods. The purpose of this work is to address these technological challenges by evaluating the day-ahead PV production forecasting performance of different machine learning models under different supervised learning regimes and minimal input features. Specifically, the day-ahead forecasting capability of Bayesian neural network (BNN), support vector regression (SVR), and regression tree (RT) models was investigated by employing the same dataset for training and performance verification, thus enabling a valid comparison. The training regime analysis demonstrated that the performance of the investigated models was strongly dependent on the timeframe of the train set, training data sequence, and application of irradiance condition filters. Furthermore, accurate results were obtained utilizing only the measured power output and other calculated parameters for training. Consequently, useful information is provided for establishing a robust day-ahead forecasting methodology that utilizes calculated input parameters and an optimal supervised learning approach. Finally, the obtained results demonstrated that the optimally constructed BNN outperformed all other machine learning models achieving forecasting accuracies lower than 5%.


Sensors ◽  
2021 ◽  
Vol 21 (13) ◽  
pp. 4519
Author(s):  
Livia Petrescu ◽  
Cătălin Petrescu ◽  
Ana Oprea ◽  
Oana Mitruț ◽  
Gabriela Moise ◽  
...  

This paper focuses on the binary classification of the emotion of fear, based on the physiological data and subjective responses stored in the DEAP dataset. We performed a mapping between the discrete and dimensional emotional information considering the participants’ ratings and extracted a substantial set of 40 types of features from the physiological data, which represented the input to various machine learning algorithms—Decision Trees, k-Nearest Neighbors, Support Vector Machine and artificial networks—accompanied by dimensionality reduction, feature selection and the tuning of the most relevant hyperparameters, boosting classification accuracy. The methodology we approached included tackling different situations, such as resolving the problem of having an imbalanced dataset through data augmentation, reducing overfitting, computing various metrics in order to obtain the most reliable classification scores and applying the Local Interpretable Model-Agnostic Explanations method for interpretation and for explaining predictions in a human-understandable manner. The results show that fear can be predicted very well (accuracies ranging from 91.7% using Gradient Boosting Trees to 93.5% using dimensionality reduction and Support Vector Machine) by extracting the most relevant features from the physiological data and by searching for the best parameters which maximize the machine learning algorithms’ classification scores.


Mathematics ◽  
2020 ◽  
Vol 8 (4) ◽  
pp. 512
Author(s):  
Marcos Eduardo Valle

Dilation and erosion are two elementary operations from mathematical morphology, a non-linear lattice computing methodology widely used for image processing and analysis. The dilation-erosion perceptron (DEP) is a morphological neural network obtained by a convex combination of a dilation and an erosion followed by the application of a hard-limiter function for binary classification tasks. A DEP classifier can be trained using a convex-concave procedure along with the minimization of the hinge loss function. As a lattice computing model, the DEP classifier assumes the feature and class spaces are partially ordered sets. In many practical situations, however, there is no natural ordering for the feature patterns. Using concepts from multi-valued mathematical morphology, this paper introduces the reduced dilation-erosion (r-DEP) classifier. An r-DEP classifier is obtained by endowing the feature space with an appropriate reduced ordering. Such reduced ordering can be determined using two approaches: one based on an ensemble of support vector classifiers (SVCs) with different kernels and the other based on a bagging of similar SVCs trained using different samples of the training set. Using several binary classification datasets from the OpenML repository, the ensemble and bagging r-DEP classifiers yielded mean higher balanced accuracy scores than the linear, polynomial, and radial basis function (RBF) SVCs as well as their ensemble and a bagging of RBF SVCs.


Processes ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 105 ◽  
Author(s):  
Xuqing Jia ◽  
Wende Tian ◽  
Chuankun Li ◽  
Xia Yang ◽  
Zhongjun Luo ◽  
...  

A novel active semi-supervised learning framework using unlabeled data is proposed for fault identification in labeled expensive chemical processes. A principal component analysis (PCA) feature selection strategy is first given to calculate the weight of the variables. Secondly, the identification model is trained based on the obtained key process variables. Thirdly, the pseudo label confidence of identification model is dynamically optimized with an historical, current, and future pseudo label confidence mean. To increase the upper limit of the identification model that is self-learning with high entropy process data, active learning is used to identify process data and diagnosis fault causes by ontology. Finally, a PCA-dynamic active safe semi-supervised support vector machine (PCA-DAS4VM) for fault identification in labeled expensive chemical processes is built. The application in the Tennessee Eastman (TE) process shows that this hybrid technology is able to: (i) eliminate chemical process noise and redundant process variables simultaneously, (ii) combine historical pseudo label confidence with future pseudo label confidence to improve the identification accuracy of abnormal working conditions, (iii) efficiently select and diagnose high entropy unlabeled process data, and (iv) fully utilize unlabeled data to enhance the identification performance.


Sensors ◽  
2021 ◽  
Vol 21 (18) ◽  
pp. 6323
Author(s):  
Carlo Dindorf ◽  
Jürgen Konradi ◽  
Claudia Wolf ◽  
Bertram Taetz ◽  
Gabriele Bleser ◽  
...  

Clinical classification models are mostly pathology-dependent and, thus, are only able to detect pathologies they have been trained for. Research is needed regarding pathology-independent classifiers and their interpretation. Hence, our aim is to develop a pathology-independent classifier that provides prediction probabilities and explanations of the classification decisions. Spinal posture data of healthy subjects and various pathologies (back pain, spinal fusion, osteoarthritis), as well as synthetic data, were used for modeling. A one-class support vector machine was used as a pathology-independent classifier. The outputs were transformed into a probability distribution according to Platt’s method. Interpretation was performed using the explainable artificial intelligence tool Local Interpretable Model-Agnostic Explanations. The results were compared with those obtained by commonly used binary classification approaches. The best classification results were obtained for subjects with a spinal fusion. Subjects with back pain were especially challenging to distinguish from the healthy reference group. The proposed method proved useful for the interpretation of the predictions. No clear inferiority of the proposed approach compared to commonly used binary classifiers was demonstrated. The application of dynamic spinal data seems important for future works. The proposed approach could be useful to provide an objective orientation and to individually adapt and monitor therapy measures pre- and post-operatively.


Mathematics ◽  
2021 ◽  
Vol 9 (18) ◽  
pp. 2336
Author(s):  
Asif Khan ◽  
Hyunho Hwang ◽  
Heung Soo Kim

As failures in rotating machines can have serious implications, the timely detection and diagnosis of faults in these machines is imperative for their smooth and safe operation. Although deep learning offers the advantage of autonomously learning the fault characteristics from the data, the data scarcity from different health states often limits its applicability to only binary classification (healthy or faulty). This work proposes synthetic data augmentation through virtual sensors for the deep learning-based fault diagnosis of a rotating machine with 42 different classes. The original and augmented data were processed in a transfer learning framework and through a deep learning model from scratch. The two-dimensional visualization of the feature space from the original and augmented data showed that the latter’s data clusters are more distinct than the former’s. The proposed data augmentation showed a 6–15% improvement in training accuracy, a 44–49% improvement in validation accuracy, an 86–98% decline in training loss, and a 91–98% decline in validation loss. The improved generalization through data augmentation was verified by a 39–58% improvement in the test accuracy.


2015 ◽  
Vol 2015 ◽  
pp. 1-12
Author(s):  
Zhi-Xia Yang ◽  
Yuan-Hai Shao ◽  
Yao-Lin Jiang

A novel learning framework of nonparallel hyperplanes support vector machines (NPSVMs) is proposed for binary classification and multiclass classification. This framework not only includes twin SVM (TWSVM) and its many deformation versions but also extends them into multiclass classification problem when different parameters or loss functions are chosen. Concretely, we discuss the linear and nonlinear cases of the framework, in which we select the hinge loss function as example. Moreover, we also give the primal problems of several extension versions of TWSVM’s deformation versions. It is worth mentioning that, in the decision function, the Euclidean distance is replaced by the absolute value|wTx+b|, which keeps the consistency between the decision function and the optimization problem and reduces the computational cost particularly when the kernel function is introduced. The numerical experiments on several artificial and benchmark datasets indicate that our framework is not only fast but also shows good generalization.


Sign in / Sign up

Export Citation Format

Share Document