Artificial Data Generation with Language Models for Imbalanced Classification in Maintenance

In remote sensing, Active Learning (AL) has become an important technique to collect informative ground truth data ``on-demand'' for supervised classification tasks. Despite its effectiveness, it is still significantly reliant on user interaction, which makes it both expensive and time consuming to implement. Most of the current literature focuses on the optimization of AL by modifying the selection criteria and the classifiers used. Although improvements in these areas will result in more effective data collection, the use of artificial data sources to reduce human--computer interaction remains unexplored. In this paper, we introduce a new component to the typical AL framework, the data generator, a source of artificial data to reduce the amount of user-labeled data required in AL. The implementation of the proposed AL framework is done using Geometric SMOTE as the data generator. We compare the new AL framework to the original one using similar acquisition functions and classifiers over three AL-specific performance metrics in seven benchmark datasets. We show that this modification of the AL framework significantly reduces cost and time requirements for a successful AL implementation in all of the datasets used in the experiment.

Download Full-text

Asymmetric Bimodal Exponential Power Distribution on Real Line

10.20944/preprints201712.0080.v1 ◽

2017 ◽

Author(s):

Mehmet Niyazi Çankaya

Keyword(s):

Real Line ◽

Power Distribution ◽

Information Matrix ◽

Real Data ◽

Maximum Likelihood Estimates ◽

Data Generation ◽

Artificial Data ◽

Exponential Power Distribution ◽

Exponential Power ◽

Two Parameters

The asymmetric bimodal exponential power (ABEP) distribution is an extension of the generalized gamma distribution to the real line via adding two parameters which fit the shape of peakedness in bimodality on real line. The special values of peakedness parameters of the distribution are combination of half Laplace and half normal distributions on real line. The distribution has two parameters fitting the height of bimodality, so capacity of bimodality is enhanced by using these parameters. Adding a skewness parameter is considered to model asymmetry in data. The location-scale form of this distribution is proposed. The Fisher information matrix of these parameters in ABEP is obtained explicitly. Properties of ABEP are examined. Real data examples are given to illustrate the modelling capacity of ABEP. The replicated artificial data from maximum likelihood estimates of parameters of ABEP and distributions having an algorithm for artificial data generation procedure are provided to test the similarity with real data.

Download Full-text

Estimation of white-box model parameters via artificial data generation: a two-stage approach

IFAC Proceedings Volumes ◽

10.3182/20080706-5-kr-1001.01933 ◽

2008 ◽

Vol 41 (2) ◽

pp. 11409-11414 ◽

Cited By ~ 4

Author(s):

Simone Garatti ◽

Sergio Bittanti

Keyword(s):

Box Model ◽

Model Parameters ◽

Data Generation ◽

Artificial Data ◽

Two Stage

Download Full-text

Estimating and Controlling Overlap in Gaussian Mixtures for Clustering Methods Evaluation

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488520500087 ◽

2020 ◽

Vol 28 (02) ◽

pp. 183-211

Author(s):

Radhwane Gherbaoui ◽

Mohammed Ouali ◽

Nacéra Benamrane

Keyword(s):

Ad Hoc ◽

Simulated Data ◽

Gaussian Mixtures ◽

Data Generation ◽

Clustering Methods ◽

Artificial Data ◽

Major Drawback ◽

Automatic Estimation ◽

Overlap Rate ◽

Test Scenarios

The ad hoc nature of the clustering methods makes simulated data paramount in assessing the performance of clustering methods. Real datasets could be used in the evaluation of clustering methods with the major drawback of missing the assessment of many test scenarios. In this paper, we propose a formal quantification of component overlap. This quantification is derived from a set of theorems which allow us to derive an automatic method for artificial data generation. We also derive a method to estimate parameters of existing models and to evaluate the results of other approaches. Automatic estimation of the overlap rate can also be used as an unsupervised learning approach in data mining to determine the parameters of mixture models from actual observations.

Download Full-text

Empirical Similarity for Absent Data Generation in Imbalanced Classification

Lecture Notes in Networks and Systems - Advances in Information and Communication ◽

10.1007/978-3-030-12388-8_70 ◽

2019 ◽

pp. 1010-1030

Author(s):

Arash Pourhabib

Keyword(s):

Data Generation ◽

Imbalanced Classification

Download Full-text

A collective data generation method for speech language models

2010 IEEE Spoken Language Technology Workshop ◽

10.1109/slt.2010.5700855 ◽

2010 ◽

Cited By ~ 2

Author(s):

Sean Liu ◽

Stephanie Seneff ◽

James Glass

Keyword(s):

Language Models ◽

Data Generation ◽

Collective Data

Download Full-text

On regularization properties of artificial datasets for deep learning

Computer Science and Mathematical Modelling ◽

10.5604/01.3001.0013.6599 ◽

2019 ◽

Vol 0 (9/2019) ◽

pp. 13-18

Author(s):

Karol Antczak

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Deep Neural Networks ◽

Real Data ◽

Training Data ◽

Generation Process ◽

Data Generation ◽

Artificial Data ◽

High Level ◽

Artificial Datasets

The paper discusses regularization properties of artificial data for deep learning. Artificial datasets allow to train neural networks in the case of a real data shortage. It is demonstrated that the artificial data generation process, described as injecting noise to high-level features, bears several similarities to existing regularization methods for deep neural networks. One can treat this property of artificial data as a kind of “deep” regularization. It is thus possible to regularize hidden layers of the network by generating the training data in a certain way.

Download Full-text

Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models

Applied Sciences ◽

10.3390/app11020869 ◽

2021 ◽

Vol 11 (2) ◽

pp. 869

Author(s):

Sarang Shaikh ◽

Sher Muhammad Daudpota ◽

Ali Shariq Imran ◽

Zenun Kastrati

Keyword(s):

Real Life ◽

Bank Loans ◽

Language Models ◽

Data Generation ◽

Grammatical Structure ◽

Neural Network Models ◽

Imbalanced Datasets ◽

Minority Class ◽

Data Imbalance ◽

Number Of Customers

Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17% when datasets are balanced using generated text.

Download Full-text