Text Data Augmentation for Deep Learning

Mapping Intimacies ◽

10.21203/rs.3.rs-650804/v1 ◽

2021 ◽

Author(s):

Connor Shorten ◽

Taghi M. Khoshgoftaar ◽

Borko Furht

Keyword(s):

Deep Learning ◽

Language Processing ◽

Data Augmentation ◽

Early Stage ◽

Practical Implementation ◽

Text Data ◽

Training Strategy ◽

Local Decision ◽

Decision Boundaries

Abstract Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

Download Full-text

Text Data Augmentation for Deep Learning

Journal Of Big Data ◽

10.1186/s40537-021-00492-0 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Connor Shorten ◽

Taghi M. Khoshgoftaar ◽

Borko Furht

Keyword(s):

Deep Learning ◽

Language Processing ◽

Data Augmentation ◽

Early Stage ◽

Practical Implementation ◽

Text Data ◽

Training Strategy ◽

Local Decision ◽

Decision Boundaries

AbstractNatural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

Download Full-text

Deep Learning Model for Improving the Characterization of Coronavirus on Chest X-ray Images Using CNN

10.1101/2020.10.30.20222786 ◽

2020 ◽

Author(s):

Olaide N. Oyelade ◽

Absalom E. Ezugwu

Keyword(s):

Deep Learning ◽

Language Processing ◽

Data Augmentation ◽

Knowledge Engineering ◽

The Novel ◽

X Ray ◽

Proposed Model ◽

Chest X Ray ◽

Novel Coronavirus

AbstractThe novel Coronavirus, also known as Covid19, is a pandemic that has weighed heavily on the socio-economic affairs of the world. Although researches into the production of relevant vaccine are being advanced, there is, however, a need for a computational solution to mediate the process of aiding quick detection of the disease. Different computational solutions comprised of natural language processing, knowledge engineering and deep learning have been adopted for this task. However, deep learning solutions have shown interesting performance compared to other methods. This paper therefore aims to advance the application deep learning technique to the problem of characterization and detection of novel coronavirus. The approach adopted in this study proposes a convolutional neural network (CNN) model which is further enhanced using the technique of data augmentation. The motive for the enhancement of the CNN model through the latter technique is to investigate the possibility of further improving the performances of deep learning models in detection of coronavirus. The proposed model is then applied to the COVID-19 X-ray dataset in this study which is the National Institutes of Health (NIH) Chest X-Ray dataset obtained from Kaggle for the purpose of promoting early detection and screening of coronavirus disease. Results obtained showed that our approach achieved a performance of 100% accuracy, recall/precision of 0.85, F-measure of 0.9, and specificity of 1.0. The proposed CNN model and data augmentation solution may be adopted in pre-screening suspected cases of Covid19 to provide support to the use of the well-known RT-PCR testing.

Download Full-text

Characterization of Optical Coherence Tomography Images for Colon Lesion Differentiation under Deep Learning

Applied Sciences ◽

10.3390/app11073119 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3119

Author(s):

Cristina L. Saratxaga ◽

Jorge Bote ◽

Juan F. Ortega-Morán ◽

Artzai Picón ◽

Elena Terradillos ◽

...

Keyword(s):

Optical Coherence Tomography ◽

Deep Learning ◽

Data Augmentation ◽

Optical Biopsy ◽

Colon Polyps ◽

Optical Coherence ◽

Colon Lesion ◽

Patient Prognosis ◽

B Scan Images

(1) Background: Clinicians demand new tools for early diagnosis and improved detection of colon lesions that are vital for patient prognosis. Optical coherence tomography (OCT) allows microscopical inspection of tissue and might serve as an optical biopsy method that could lead to in-situ diagnosis and treatment decisions; (2) Methods: A database of murine (rat) healthy, hyperplastic and neoplastic colonic samples with more than 94,000 images was acquired. A methodology that includes a data augmentation processing strategy and a deep learning model for automatic classification (benign vs. malignant) of OCT images is presented and validated over this dataset. Comparative evaluation is performed both over individual B-scan images and C-scan volumes; (3) Results: A model was trained and evaluated with the proposed methodology using six different data splits to present statistically significant results. Considering this, 0.9695 (±0.0141) sensitivity and 0.8094 (±0.1524) specificity were obtained when diagnosis was performed over B-scan images. On the other hand, 0.9821 (±0.0197) sensitivity and 0.7865 (±0.205) specificity were achieved when diagnosis was made considering all the images in the whole C-scan volume; (4) Conclusions: The proposed methodology based on deep learning showed great potential for the automatic characterization of colon polyps and future development of the optical biopsy paradigm.

Download Full-text

FastPacket: Towards Pre-trained Packets Embedding based on FastText for next-generation NIDS

10.21203/rs.3.rs-555961/v1 ◽

2021 ◽

Author(s):

Khloud Al Jallad

Keyword(s):

Big Data ◽

Deep Learning ◽

Language Processing ◽

Intrusion Detection Systems ◽

Dynamic Feature ◽

Statistical Features ◽

Text Data ◽

New Approach ◽

New Era ◽

Detection Systems

Abstract New Attacks are increasingly used by attackers every day but many of them are not detected by Intrusion Detection Systems as most IDS ignore raw packet information and only care about some basic statistical information extracted from PCAP files. Using networking programs to extract fixed statistical features from packets is good, but may not enough to detect nowadays challenges. We think that it is time to utilize big data and deep learning for automatic dynamic feature extraction from packets. It is time to get inspired by deep learning pre-trained models in computer vision and natural language processing, so security deep learning solutions will have its pre-trained models on big datasets to be used in future researches. In this paper, we proposed a new approach for embedding packets based on character-level embeddings, inspired by FastText success on text data. We called this approach FastPacket. Results are measured on subsets of CIC-IDS-2017 dataset, but we expect promising results on big data pre-trained models. We suggest building pre-trained FastPacket on MAWI big dataset and make it available to community, similar to FastText. To be able to outperform currently used NIDS, to start a new era of packet-level NIDS that can better detect complex attacks

Download Full-text

Deep Learning Based Truth Discovery Algorithm for Research the Genuineness of Given Text Corpus

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1112.0782s319 ◽

2019 ◽

Vol 8 (2S3) ◽

pp. 605-611

Keyword(s):

Deep Learning ◽

Language Processing ◽

Learning Algorithm ◽

Multiple Sources ◽

Text Data ◽

Text Corpus ◽

Automatic Feature Extraction ◽

Deep Learning Algorithm ◽

Truth Discovery ◽

Improved Accuracy

Lot of research has gone into Natural language processing and the state of the art algorithms in deep learning that unambiguously helps in converting an English text into a data structure without loss of meaning. Also with the advent of neural networks for learning word representations as vectors has helped a lot in revolutionizing the automatic feature extraction from text data corpus. A combination of word embedding and the use of a deep learning algorithm like a convolution neural network helped in better accuracy for text classification. In this era of Internet of things and the voluminous amounts of data that is overwhelming the users determining the veracity of the data is a very challenging task. There are many truth discovery algorithms in literature that help in resolving the conflicts that arise due to multiple sources of data. These algorithms help in estimating the trustworthiness of the data and reliability of the sources. In this paper, a convolution based truth discovery with multitasking is proposed to estimate the genuineness of the data for a given text corpus. The proposed algorithm has been tested on analysing the genuineness of Quora questions dataset and experimental results showed an improved accuracy and speed over other existing approaches.

Download Full-text

Research on application of GPS-based wireless communication system in highway landslide

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-021-02038-7 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Zhiwen Xiong

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Wireless Communication ◽

Language Processing ◽

Learning Algorithm ◽

Detection System ◽

Early Stage ◽

Small Error ◽

Deformation Monitoring ◽

Analysis Method

AbstractMachine learning is a branch of the field of artificial intelligence. Deep learning is a complex machine learning algorithm that has unique advantages in image recognition, speech recognition, natural language processing, and industrial process control. Deep learning has It is widely used in the field of wireless communication. Prediction of geological disasters (such as landslides) is currently a difficult problem. Because landslides are difficult to detect in the early stage, this paper proposes a GPS-based wireless communication continuous detection system and applies it to landslide deformation monitoring to achieve early treatment and prevention. This article introduces the GPS multi-antenna detection system based on deep learning wireless communication, and introduces the time series analysis method and its application. The test results show that the GPS multi-antenna detection system of the wireless communication network has great advantages in response time, with high accuracy and small error. The horizontal accuracy is controlled at 0–2 mm and the vertical accuracy is about 1 mm. The analysis method is simple and efficient, and can obtain good results for short-term deformation prediction.

Download Full-text

On Graph Mining With Deep Learning: Introducing Model R for Link Weight Prediction

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2018-0022 ◽

2019 ◽

Vol 9 (1) ◽

pp. 21-40 ◽

Cited By ~ 5

Author(s):

Yuchen Hou ◽

Lawrence B. Holder

Keyword(s):

Deep Learning ◽

Language Processing ◽

Graph Mining ◽

Early Stage ◽

Block Model ◽

New Approach ◽

Stochastic Block Model ◽

Link Weight ◽

Weight Prediction ◽

Node Embeddings

Abstract Deep learning has been successful in various domains including image recognition, speech recognition and natural language processing. However, the research on its application in graph mining is still in an early stage. Here we present Model R, a neural network model created to provide a deep learning approach to the link weight prediction problem. This model uses a node embedding technique that extracts node embeddings (knowledge of nodes) from the known links’ weights (relations between nodes) and uses this knowledge to predict the unknown links’ weights. We demonstrate the power of Model R through experiments and compare it with the stochastic block model and its derivatives. Model R shows that deep learning can be successfully applied to link weight prediction and it outperforms stochastic block model and its derivatives by up to 73% in terms of prediction accuracy. We analyze the node embeddings to confirm that closeness in embedding space correlates with stronger relationships as measured by the link weight. We anticipate this new approach will provide effective solutions to more graph mining tasks.

Download Full-text

Deep Learning for Sentiment Analysis

Advances in Business Information Systems and Analytics - Natural Language Processing for Global and Local Business ◽

10.4018/978-1-7998-4240-8.ch005 ◽

2021 ◽

pp. 97-132

Author(s):

Vincent Karas ◽

Björn W. Schuller

Keyword(s):

Deep Learning ◽

Sentiment Analysis ◽

Language Processing ◽

Data Augmentation ◽

Future Research ◽

Network Architectures ◽

Business Decisions ◽

Adversarial Networks ◽

Current Trends ◽

Recent Developments

Sentiment analysis is an important area of natural language processing that can help inform business decisions by extracting sentiment information from documents. The purpose of this chapter is to introduce the reader to selected concepts and methods of deep learning and show how deep models can be used to increase performance in sentiment analysis. It discusses the latest advances in the field and covers topics including traditional sentiment analysis approaches, the fundamentals of sentence modelling, popular neural network architectures, autoencoders, attention modelling, transformers, data augmentation methods, the benefits of transfer learning, the potential of adversarial networks, and perspectives on explainable AI. The authors' intent is that through this chapter, the reader can gain an understanding of recent developments in this area as well as current trends and potentials for future research.

Download Full-text

Amrita-CEN-Senti-DB:Twitter Dataset for Sentimental Analysis and Application of Classical Machine Learning and Deep Learning

10.36227/techrxiv.12058968 ◽

2020 ◽

Author(s):

vinayakumar R

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Language Processing ◽

Learning Algorithm ◽

Support Vector ◽

Text Data ◽

Inverse Document Frequency ◽

Logistics Regression ◽

Document Frequency

Social media is a platform in which tons and tons of text are generated each and every day. The data is so large that cannot be easily understood, so this has paved a path to a new field in the information technology which is natural language processing. In this paper, the text data which is used for the classification is tweets that determines the state of the person according of the sentiments which is positive, negative and neutral. Emotions are the way of expression of the person’s feelings which has a high influence on the decision making tasks. Here we have proposed the text representation, Term Frequency Inverse Document Frequency (tfidf), Keras embedding along with the machine learning and deep learning algorithms for the purpose of the classification of the sentiments, out of which Logistics Regression machine learning based methods out performs well when the features is taken in the limited amount as the features increases Support Vector Machine (SVM) which is also one of the machine learning algorithm out performs well making a benchmark accuracy for this dataset as the 75.8%. For the research purpose the dataset has been made publically available.

Download Full-text

A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications

ACM Computing Surveys ◽

10.1145/3502287 ◽

2022 ◽

Author(s):

Ms. Aayushi Bansal ◽

Dr. Rewa Sharma ◽

Dr. Mamta Kathuria

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Deep Learning ◽

Language Processing ◽

Data Augmentation ◽

Medical Science ◽

Real Life ◽

Survey Paper ◽

Augmentation Techniques ◽

Comprehensive Survey

Recent advancements in deep learning architecture have increased its utility in real-life applications. Deep learning models require a large amount of data to train the model. In many application domains, there is a limited set of data available for training neural networks as collecting new data is either not feasible or requires more resources such as in marketing, computer vision, and medical science. These models require a large amount of data to avoid the problem of overfitting. One of the data space solutions to the problem of limited data is data augmentation. The purpose of this study focuses on various data augmentation techniques that can be used to further improve the accuracy of a neural network. This saves the cost and time consumption required to collect new data for the training of deep neural networks by augmenting available data. This also regularizes the model and improves its capability of generalization. The need for large datasets in different fields such as computer vision, natural language processing, security and healthcare is also covered in this survey paper. The goal of this paper is to provide a comprehensive survey of recent advancements in data augmentation techniques and their application in various domains.

Download Full-text