Transfer Learning for Spam Text Classification

Pratiksha Bongale

doi:10.22214/ijraset.2021.37349

Transfer Learning for Spam Text Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.37349 ◽

2021 ◽

Vol 9 (VIII) ◽

pp. 638-641

Author(s):

Pratiksha Bongale

Keyword(s):

Transfer Learning ◽

Text Classification ◽

Data Transfer ◽

The Other ◽

Training Data ◽

Data Driven ◽

Specific Training ◽

Text Data ◽

Domain Specific ◽

Corpus Data

Today’s world is mostly data-driven. To deal with the humongous amount of data, Machine Learning and Data Mining strategies are put into usage. Traditional ML approaches presume that the model is tested on a dataset extracted from the same domain from where the training data has been taken from. Nevertheless, some real-world situations require machines to provide good results with very little domain-specific training data. This creates room for the development of machines that are capable of predicting accurately by being trained on easily found data. Transfer Learning is the key to it. It is the scientific art of applying the knowledge gained while learning a task to another task that is similar to the previous one in some or another way. This article focuses on building a model that is capable of differentiating text data into binary classes; one roofing the text data that is spam and the other not containing spam using BERT’s pre-trained model (bert-base-uncased). This pre-trained model has been trained on Wikipedia and Book Corpus data and the goal of this paper is to highlight the pre-trained model’s capabilities to transfer the knowledge that it has learned from its training (Wiki and Book Corpus) to classifying spam texts from the rest.

Download Full-text

Deep Learning for text in limted data settings

10.36227/techrxiv.12100692 ◽

2020 ◽

Author(s):

Pathikkumar Patel ◽

Bhargav Lad ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Time Series ◽

Deep Learning ◽

Sentiment Analysis ◽

Transfer Learning ◽

Text Classification ◽

State Of The Art ◽

Time Series Forecasting ◽

Text Data ◽

Performance Levels

During the last few years, RNN models have been extensively used and they have proven to be better for sequence and text data. RNNs have achieved state-of-the-art performance levels in several applications such as text classification, sequence to sequence modelling and time series forecasting. In this article we will review different Machine Learning and Deep Learning based approaches for text data and look at the results obtained from these methods. This work also explores the use of transfer learning in NLP and how it affects the performance of models on a specific application of sentiment analysis.

Download Full-text

Improving Semi-Supervised Learning for Audio Classification with FixMatch

Electronics ◽

10.3390/electronics10151807 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1807

Author(s):

Sascha Grollmisch ◽

Estefanía Cano

Keyword(s):

Neural Networks ◽

Supervised Learning ◽

Transfer Learning ◽

Data Transfer ◽

State Of The Art ◽

Training Data ◽

Audio Classification ◽

Image Domain ◽

Full Dataset ◽

Audio Data

Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement.

Download Full-text

Corpus of Usage Examples: What Is It Good For?

10.33011/computel.v1i.411 ◽

2019 ◽

Author(s):

Timofey Arkhangelskiy

Keyword(s):

The Other ◽

Data Driven ◽

Reference Grammar ◽

Text Corpora ◽

Corpus Studies ◽

Other Hand ◽

History Of ◽

Corpus Data ◽

Good For ◽

Word Senses

Lexicography and corpus studies of grammar have a long history of fruitful interaction. For the most part, however, this has been a one-way relationship. Lexicographers have extensively used corpora to identify previously undetected word senses or find natural usage examples; using lexicographic materials when conducting data-driven investigations of grammar, on the other hand, is hardly commonplace. In this paper, I present a Beserman Udmurt corpus made out of "artificial" dictionary examples. I argue that, although such a corpus can not be used for certain kinds of corpus-based research, it is nevertheless a very useful tool for writing a reference grammar of a language. This is particularly important in the case of underresourced endangered varieties, which Beserman is, because of the scarcity of available corpus data. The paper describes the process of developing the Beserman usage example corpus, explores its differences compared to traditional text corpora, and discusses how those can be beneficial for grammar research.

Download Full-text

Sampling the Web as Training Data for Text Classification

Multimedia Storage and Retrieval Innovations for Digital Library Systems ◽

10.4018/978-1-4666-0900-6.ch015 ◽

2012 ◽

pp. 293-310

Author(s):

Wei-Yen Day ◽

Chun-Yi Chi ◽

Ruey-Cheng Chen ◽

Pu-Jen Cheng

Keyword(s):

Data Acquisition ◽

Text Classification ◽

Document Classification ◽

The Other ◽

Training Data ◽

Class A ◽

Classifier Performance ◽

Series Of Experiments ◽

The Common ◽

The Web

Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, the authors look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. The first of two methods presented in this paper is based on sampling the common concepts among classes and the other is based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets and results show that the proposed methods significantly improve classifier performance even without using manually labeled training data. The authors’ strategy for retrieving Web samples substantially helps in the conventional document classification in terms of accuracy and efficiency.

Download Full-text

The identification of stages in diachronic data: variability-based neighbour clustering

Corpora ◽

10.3366/e1749503208000075 ◽

2008 ◽

Vol 3 (1) ◽

pp. 59-81 ◽

Cited By ~ 42

Author(s):

Stefan Th. Gries ◽

Martin Hilpert

Keyword(s):

Case Studies ◽

Hierarchical Clustering ◽

The Other ◽

Data Driven ◽

Clustering Method ◽

Chronological Order ◽

Bottom Up ◽

Data Variability ◽

Data Points ◽

Corpus Data

In this paper, we introduce a data-driven bottom-up clustering method for the identification of stages in diachronic corpus data that differ from each other quantitatively. Much like regular approaches to hierarchical clustering, it is based on identifying and merging the most cohesive groups of data points, but, unlike regular approaches to clustering, it allows for the merging of temporally adjacent data, thus, in effect, preserving the chronological order. We exemplify the method with two case studies, one on verbal complementation of shall, the other on the development of the perfect in English.

Download Full-text

Iterative Data Programming for Expanding Text Classification Corpora

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i08.7045 ◽

2020 ◽

Vol 34 (08) ◽

pp. 13332-13337

Author(s):

Neil Mallinar ◽

Abhishek Shah ◽

Tin Kam Ho ◽

Rajendra Ugrani ◽

Ayush Gupta

Keyword(s):

Text Classification ◽

Training Data ◽

Data Sets ◽

Conversational Agents ◽

Intent Recognition ◽

Text Data ◽

Learning Techniques ◽

Programming Techniques ◽

Classification Tasks ◽

Training Examples

Real-world text classification tasks often require many labeled training examples that are expensive to obtain. Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data sets quickly via a general framework for building weak models, also known as labeling functions, and denoising them through ensemble learning techniques. We present a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models with minimal supervision. Furthermore, our method employs an iterative procedure to identify sparsely distributed examples from large volumes of unlabeled data. The iterative data programming techniques improve newer weak models as more labeled data is confirmed with human-in-loop. We show empirical results on sentence classification tasks, including those from a task of improving intent recognition in conversational agents.

Download Full-text

Cross-Modality Transfer Learning for Image-Text Information Management

ACM Transactions on Management Information Systems ◽

10.1145/3464324 ◽

2022 ◽

Vol 13 (1) ◽

pp. 1-14

Author(s):

Shuteng Niu ◽

Yushan Jiang ◽

Bowen Chen ◽

Jian Wang ◽

Yongxin Liu ◽

...

Keyword(s):

Image Classification ◽

Information Management ◽

Transfer Learning ◽

Information Transfer ◽

Image Data ◽

Machine Learning Algorithms ◽

Training Data ◽

Classification Problems ◽

Target Domain ◽

Text Data

In the past decades, information from all kinds of data has been on a rapid increase. With state-of-the-art performance, machine learning algorithms have been beneficial for information management. However, insufficient supervised training data is still an adversity in many real-world applications. Therefore, transfer learning (TF) was proposed to address this issue. This article studies a not well investigated but important TL problem termed cross-modality transfer learning (CMTL). This topic is closely related to distant domain transfer learning (DDTL) and negative transfer. In general, conventional TL disciplines assume that the source domain and the target domain are in the same modality. DDTL aims to make efficient transfers even when the domains or the tasks are entirely different. As an extension of DDTL, CMTL aims to make efficient transfers between two different data modalities, such as from image to text. As the main focus of this study, we aim to improve the performance of image classification by transferring knowledge from text data. Previously, a few CMTL algorithms were proposed to deal with image classification problems. However, most existing algorithms are very task specific, and they are unstable on convergence. There are four main contributions in this study. First, we propose a novel heterogeneous CMTL algorithm, which requires only a tiny set of unlabeled target data and labeled source data with associate text tags. Second, we introduce a latent semantic information extraction method to connect the information learned from the image data and the text data. Third, the proposed method can effectively handle the information transfer across different modalities (text-image). Fourth, we examined our algorithm on a public dataset, Office-31. It has achieved up to 5% higher classification accuracy than “non-transfer” algorithms and up to 9% higher than existing CMTL algorithms.

Download Full-text

Exploiting Multiple Optimizers with Transfer Learning Techniques for the Identification of COVID-19 Patients

Journal of Healthcare Engineering ◽

10.1155/2020/8889412 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Zeming Fan ◽

Mudasir Jamil ◽

Muhammad Tariq Sadiq ◽

Xiwei Huang ◽

Xiaojun Yu

Keyword(s):

Early Detection ◽

Transfer Learning ◽

Data Augmentation ◽

Data Transfer ◽

Training Data ◽

Detection Accuracy ◽

Boundary Segment ◽

Learning Rates ◽

Learning Techniques ◽

Average Accuracy

Due to the rapid spread of COVID-19 and its induced death worldwide, it is imperative to develop a reliable tool for the early detection of this disease. Chest X-ray is currently accepted to be one of the reliable means for such a detection purpose. However, most of the available methods utilize large training data, and there is a need for improvement in the detection accuracy due to the limited boundary segment of the acquired images for symptom identifications. In this study, a robust and efficient method based on transfer learning techniques is proposed to identify normal and COVID-19 patients by employing small training data. Transfer learning builds accurate models in a timesaving way. First, data augmentation was performed to help the network for memorization of image details. Next, five state-of-the-art transfer learning models, AlexNet, MobileNetv2, ShuffleNet, SqueezeNet, and Xception, with three optimizers, Adam, SGDM, and RMSProp, were implemented at various learning rates, 1e-4, 2e-4, 3e-4, and 4e-4, to reduce the probability of overfitting. All the experiments were performed on publicly available datasets with several analytical measurements attained after execution with a 10-fold cross-validation method. The results suggest that MobileNetv2 with Adam optimizer at a learning rate of 3e-4 provides an average accuracy, recall, precision, and F-score of 97%, 96.5%, 97.5%, and 97%, respectively, which are higher than those of all other combinations. The proposed method is competitive with the available literature, demonstrating that it could be used for the early detection of COVID-19 patients.

Download Full-text

Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification

10.18653/v1/2020.coling-main.481 ◽

2020 ◽

Author(s):

Aleksandra Edwards ◽

Jose Camacho-Collados ◽

Hélène De Ribaupierre ◽

Alun Preece

Keyword(s):

Text Classification ◽

Training Data ◽

Domain Specific

Download Full-text

Predicting SARS-CoV-2 epitope-specific TCR recognition using pre-trained protein embeddings

10.1101/2021.11.17.468929 ◽

2021 ◽

Author(s):

Youngmahn Han ◽

Aeri Lee

Keyword(s):

T Cell ◽

Transfer Learning ◽

Transmission Rate ◽

Cell Epitope ◽

Predictive Performance ◽

Training Data ◽

Viral Escape ◽

Data Driven ◽

Effective Vaccine ◽

T Cell Epitopes

The COVID-19 pandemic is ongoing because of the high transmission rate and the emergence of SARS-CoV-2 variants. The P272L mutation in SARS-Cov-2 S-protein is known to be highly relevant to the viral escape associated with the second pandemic wave in Europe. Epitope-specific T-cell receptor (TCR) recognition is a key factor in determining the T-cell immunogenicity of a SARS-CoV-2 epitope. Although several data-driven methods for predicting epitope-specific TCR recognition have been proposed, they remain challenging owing to the enormous diversity of TCRs and the lack of available training data. Self-supervised transfer learning has recently been demonstrated to be powerful for extracting useful information from unlabeled protein sequences and increasing the predictive performance of the fine-tuned models in downstream tasks. Here, we present a predictive model based on Bidirectional Encoder Representations from Transformers (BERT), employing self-supervised transfer learning, to predict SARS-CoV-2 T-cell epitope-specific TCR recognition. The fine-tuned model showed notably high predictive performance for independent evaluation using the SARS-CoV-2 epitope-specific TCR CDR3β sequence datasets. In particular, we found the proline at position 4 corresponding to the P272L mutation in the SARS-CoV-2 S-protein269-277 epitope (YLQPRTFLL) may contribute substantially to TCR recognition of the epitope through interpreting the output attention weights of our model. We anticipate that our findings will provide new directions for constructing a reliable data-driven model to predict the immunogenic T-cell epitopes using limited training data and help accelerate the development of an effective vaccine in response to SARS-CoV-2 variants.

Download Full-text