scholarly journals Transfer Learning for Spam Text Classification

Author(s):  
Pratiksha Bongale

Today’s world is mostly data-driven. To deal with the humongous amount of data, Machine Learning and Data Mining strategies are put into usage. Traditional ML approaches presume that the model is tested on a dataset extracted from the same domain from where the training data has been taken from. Nevertheless, some real-world situations require machines to provide good results with very little domain-specific training data. This creates room for the development of machines that are capable of predicting accurately by being trained on easily found data. Transfer Learning is the key to it. It is the scientific art of applying the knowledge gained while learning a task to another task that is similar to the previous one in some or another way. This article focuses on building a model that is capable of differentiating text data into binary classes; one roofing the text data that is spam and the other not containing spam using BERT’s pre-trained model (bert-base-uncased). This pre-trained model has been trained on Wikipedia and Book Corpus data and the goal of this paper is to highlight the pre-trained model’s capabilities to transfer the knowledge that it has learned from its training (Wiki and Book Corpus) to classifying spam texts from the rest.

2020 ◽  
Author(s):  
Pathikkumar Patel ◽  
Bhargav Lad ◽  
Jinan Fiaidhi

During the last few years, RNN models have been extensively used and they have proven to be better for sequence and text data. RNNs have achieved state-of-the-art performance levels in several applications such as text classification, sequence to sequence modelling and time series forecasting. In this article we will review different Machine Learning and Deep Learning based approaches for text data and look at the results obtained from these methods. This work also explores the use of transfer learning in NLP and how it affects the performance of models on a specific application of sentiment analysis.


Electronics ◽  
2021 ◽  
Vol 10 (15) ◽  
pp. 1807
Author(s):  
Sascha Grollmisch ◽  
Estefanía Cano

Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement.


2019 ◽  
Author(s):  
Timofey Arkhangelskiy

Lexicography and corpus studies of grammar have a long history of fruitful interaction. For the most part, however, this has been a one-way relationship. Lexicographers have extensively used corpora to identify previously undetected word senses or find natural usage examples; using lexicographic materials when conducting data-driven investigations of grammar, on the other hand, is hardly commonplace. In this paper, I present a Beserman Udmurt corpus made out of "artificial" dictionary examples. I argue that, although such a corpus can not be used for certain kinds of corpus-based research, it is nevertheless a very useful tool for writing a reference grammar of a language. This is particularly important in the case of underresourced endangered varieties, which Beserman is, because of the scarcity of available corpus data. The paper describes the process of developing the Beserman usage example corpus, explores its differences compared to traditional text corpora, and discusses how those can be beneficial for grammar research.


Author(s):  
Wei-Yen Day ◽  
Chun-Yi Chi ◽  
Ruey-Cheng Chen ◽  
Pu-Jen Cheng

Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, the authors look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. The first of two methods presented in this paper is based on sampling the common concepts among classes and the other is based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets and results show that the proposed methods significantly improve classifier performance even without using manually labeled training data. The authors’ strategy for retrieving Web samples substantially helps in the conventional document classification in terms of accuracy and efficiency.


Corpora ◽  
2008 ◽  
Vol 3 (1) ◽  
pp. 59-81 ◽  
Author(s):  
Stefan Th. Gries ◽  
Martin Hilpert

In this paper, we introduce a data-driven bottom-up clustering method for the identification of stages in diachronic corpus data that differ from each other quantitatively. Much like regular approaches to hierarchical clustering, it is based on identifying and merging the most cohesive groups of data points, but, unlike regular approaches to clustering, it allows for the merging of temporally adjacent data, thus, in effect, preserving the chronological order. We exemplify the method with two case studies, one on verbal complementation of shall, the other on the development of the perfect in English.


2020 ◽  
Vol 34 (08) ◽  
pp. 13332-13337
Author(s):  
Neil Mallinar ◽  
Abhishek Shah ◽  
Tin Kam Ho ◽  
Rajendra Ugrani ◽  
Ayush Gupta

Real-world text classification tasks often require many labeled training examples that are expensive to obtain. Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data sets quickly via a general framework for building weak models, also known as labeling functions, and denoising them through ensemble learning techniques. We present a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models with minimal supervision. Furthermore, our method employs an iterative procedure to identify sparsely distributed examples from large volumes of unlabeled data. The iterative data programming techniques improve newer weak models as more labeled data is confirmed with human-in-loop. We show empirical results on sentence classification tasks, including those from a task of improving intent recognition in conversational agents.


2022 ◽  
Vol 13 (1) ◽  
pp. 1-14
Author(s):  
Shuteng Niu ◽  
Yushan Jiang ◽  
Bowen Chen ◽  
Jian Wang ◽  
Yongxin Liu ◽  
...  

In the past decades, information from all kinds of data has been on a rapid increase. With state-of-the-art performance, machine learning algorithms have been beneficial for information management. However, insufficient supervised training data is still an adversity in many real-world applications. Therefore, transfer learning (TF) was proposed to address this issue. This article studies a not well investigated but important TL problem termed cross-modality transfer learning (CMTL). This topic is closely related to distant domain transfer learning (DDTL) and negative transfer. In general, conventional TL disciplines assume that the source domain and the target domain are in the same modality. DDTL aims to make efficient transfers even when the domains or the tasks are entirely different. As an extension of DDTL, CMTL aims to make efficient transfers between two different data modalities, such as from image to text. As the main focus of this study, we aim to improve the performance of image classification by transferring knowledge from text data. Previously, a few CMTL algorithms were proposed to deal with image classification problems. However, most existing algorithms are very task specific, and they are unstable on convergence. There are four main contributions in this study. First, we propose a novel heterogeneous CMTL algorithm, which requires only a tiny set of unlabeled target data and labeled source data with associate text tags. Second, we introduce a latent semantic information extraction method to connect the information learned from the image data and the text data. Third, the proposed method can effectively handle the information transfer across different modalities (text-image). Fourth, we examined our algorithm on a public dataset, Office-31. It has achieved up to 5% higher classification accuracy than “non-transfer” algorithms and up to 9% higher than existing CMTL algorithms.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Zeming Fan ◽  
Mudasir Jamil ◽  
Muhammad Tariq Sadiq ◽  
Xiwei Huang ◽  
Xiaojun Yu

Due to the rapid spread of COVID-19 and its induced death worldwide, it is imperative to develop a reliable tool for the early detection of this disease. Chest X-ray is currently accepted to be one of the reliable means for such a detection purpose. However, most of the available methods utilize large training data, and there is a need for improvement in the detection accuracy due to the limited boundary segment of the acquired images for symptom identifications. In this study, a robust and efficient method based on transfer learning techniques is proposed to identify normal and COVID-19 patients by employing small training data. Transfer learning builds accurate models in a timesaving way. First, data augmentation was performed to help the network for memorization of image details. Next, five state-of-the-art transfer learning models, AlexNet, MobileNetv2, ShuffleNet, SqueezeNet, and Xception, with three optimizers, Adam, SGDM, and RMSProp, were implemented at various learning rates, 1e-4, 2e-4, 3e-4, and 4e-4, to reduce the probability of overfitting. All the experiments were performed on publicly available datasets with several analytical measurements attained after execution with a 10-fold cross-validation method. The results suggest that MobileNetv2 with Adam optimizer at a learning rate of 3e-4 provides an average accuracy, recall, precision, and F-score of 97%, 96.5%, 97.5%, and 97%, respectively, which are higher than those of all other combinations. The proposed method is competitive with the available literature, demonstrating that it could be used for the early detection of COVID-19 patients.


Author(s):  
Aleksandra Edwards ◽  
Jose Camacho-Collados ◽  
Hélène De Ribaupierre ◽  
Alun Preece

2021 ◽  
Author(s):  
Youngmahn Han ◽  
Aeri Lee

The COVID-19 pandemic is ongoing because of the high transmission rate and the emergence of SARS-CoV-2 variants. The P272L mutation in SARS-Cov-2 S-protein is known to be highly relevant to the viral escape associated with the second pandemic wave in Europe. Epitope-specific T-cell receptor (TCR) recognition is a key factor in determining the T-cell immunogenicity of a SARS-CoV-2 epitope. Although several data-driven methods for predicting epitope-specific TCR recognition have been proposed, they remain challenging owing to the enormous diversity of TCRs and the lack of available training data. Self-supervised transfer learning has recently been demonstrated to be powerful for extracting useful information from unlabeled protein sequences and increasing the predictive performance of the fine-tuned models in downstream tasks. Here, we present a predictive model based on Bidirectional Encoder Representations from Transformers (BERT), employing self-supervised transfer learning, to predict SARS-CoV-2 T-cell epitope-specific TCR recognition. The fine-tuned model showed notably high predictive performance for independent evaluation using the SARS-CoV-2 epitope-specific TCR CDR3β sequence datasets. In particular, we found the proline at position 4 corresponding to the P272L mutation in the SARS-CoV-2 S-protein269-277 epitope (YLQPRTFLL) may contribute substantially to TCR recognition of the epitope through interpreting the output attention weights of our model. We anticipate that our findings will provide new directions for constructing a reliable data-driven model to predict the immunogenic T-cell epitopes using limited training data and help accelerate the development of an effective vaccine in response to SARS-CoV-2 variants.


Sign in / Sign up

Export Citation Format

Share Document