Artificial Data Generation with Language Models for Imbalanced Classification in Maintenance

Author(s):  
Juan Pablo Usuga-Cadavid ◽  
Bernard Grabot ◽  
Samir Lamouri ◽  
Arnaud Fortin
2017 ◽  
Vol 892 ◽  
pp. 012016
Author(s):  
Masitah Abdul Lateh ◽  
Azah Kamilah Muda ◽  
Zeratul Izzah Mohd Yusof ◽  
Noor Azilah Muda ◽  
Mohd Sanusi Azmi

2021 ◽  
Vol 13 (13) ◽  
pp. 2619
Author(s):  
Joao Fonseca ◽  
Georgios Douzas ◽  
Fernando Bacao

In remote sensing, Active Learning (AL) has become an important technique to collect informative ground truth data ``on-demand'' for supervised classification tasks. Despite its effectiveness, it is still significantly reliant on user interaction, which makes it both expensive and time consuming to implement. Most of the current literature focuses on the optimization of AL by modifying the selection criteria and the classifiers used. Although improvements in these areas will result in more effective data collection, the use of artificial data sources to reduce human--computer interaction remains unexplored. In this paper, we introduce a new component to the typical AL framework, the data generator, a source of artificial data to reduce the amount of user-labeled data required in AL. The implementation of the proposed AL framework is done using Geometric SMOTE as the data generator. We compare the new AL framework to the original one using similar acquisition functions and classifiers over three AL-specific performance metrics in seven benchmark datasets. We show that this modification of the AL framework significantly reduces cost and time requirements for a successful AL implementation in all of the datasets used in the experiment.


Author(s):  
Mehmet Niyazi Çankaya

The asymmetric bimodal exponential power (ABEP) distribution is an extension of the generalized gamma distribution to the real line via adding two parameters which fit the shape of peakedness in bimodality on real line. The special values of peakedness parameters of the distribution are combination of half Laplace and half normal distributions on real line. The distribution has two parameters fitting the height of bimodality, so capacity of bimodality is enhanced by using these parameters. Adding a skewness parameter is considered to model asymmetry in data. The location-scale form of this distribution is proposed. The Fisher information matrix of these parameters in ABEP is obtained explicitly. Properties of ABEP are examined. Real data examples are given to illustrate the modelling capacity of ABEP. The replicated artificial data from maximum likelihood estimates of parameters of ABEP and distributions having an algorithm for artificial data generation procedure are provided to test the similarity with real data.


Author(s):  
Radhwane Gherbaoui ◽  
Mohammed Ouali ◽  
Nacéra Benamrane

The ad hoc nature of the clustering methods makes simulated data paramount in assessing the performance of clustering methods. Real datasets could be used in the evaluation of clustering methods with the major drawback of missing the assessment of many test scenarios. In this paper, we propose a formal quantification of component overlap. This quantification is derived from a set of theorems which allow us to derive an automatic method for artificial data generation. We also derive a method to estimate parameters of existing models and to evaluate the results of other approaches. Automatic estimation of the overlap rate can also be used as an unsupervised learning approach in data mining to determine the parameters of mixture models from actual observations.


2019 ◽  
Vol 0 (9/2019) ◽  
pp. 13-18
Author(s):  
Karol Antczak

The paper discusses regularization properties of artificial data for deep learning. Artificial datasets allow to train neural networks in the case of a real data shortage. It is demonstrated that the artificial data generation process, described as injecting noise to high-level features, bears several similarities to existing regularization methods for deep neural networks. One can treat this property of artificial data as a kind of “deep” regularization. It is thus possible to regularize hidden layers of the network by generating the training data in a certain way.


2021 ◽  
Vol 11 (2) ◽  
pp. 869
Author(s):  
Sarang Shaikh ◽  
Sher Muhammad Daudpota ◽  
Ali Shariq Imran ◽  
Zenun Kastrati

Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17% when datasets are balanced using generated text.


Sign in / Sign up

Export Citation Format

Share Document