A Survey on Contrastive Self-Supervised Learning

Ashish Jaiswal; Ashwin Ramesh Babu; Mohammad Zaki Zadeh; Debapriya Banerjee; Fillia Makedon

doi:10.3390/technologies9010002

A Survey on Contrastive Self-Supervised Learning

Technologies ◽

10.3390/technologies9010002 ◽

2020 ◽

Vol 9 (1) ◽

pp. 2

Author(s):

Ashish Jaiswal ◽

Ashwin Ramesh Babu ◽

Mohammad Zaki Zadeh ◽

Debapriya Banerjee ◽

Fillia Makedon

Keyword(s):

Computer Vision ◽

Supervised Learning ◽

Language Processing ◽

Large Scale ◽

Performance Comparison ◽

Extensive Review ◽

Future Directions ◽

Dominant Component ◽

Supervised Methods ◽

The Cost

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudolabels as supervision and use the learned representations for several downstream tasks. Specifically, contrastive learning has recently become a dominant component in self-supervised learning for computer vision, natural language processing (NLP), and other domains. It aims at embedding augmented versions of the same sample close to each other while trying to push away embeddings from different samples. This paper provides an extensive review of self-supervised methods that follow the contrastive approach. The work explains commonly used pretext tasks in a contrastive learning setup, followed by different architectures that have been proposed so far. Next, we present a performance comparison of different methods for multiple downstream tasks such as image classification, object detection, and action recognition. Finally, we conclude with the limitations of the current methods and the need for further techniques and future directions to make meaningful progress.

Download Full-text

Adversarial training for few-shot text classification

Intelligenza Artificiale ◽

10.3233/ia-200051 ◽

2021 ◽

Vol 14 (2) ◽

pp. 201-214

Author(s):

Danilo Croce ◽

Giuseppe Castellucci ◽

Roberto Basili

Keyword(s):

Supervised Learning ◽

Language Processing ◽

Reproducing Kernel ◽

Generative Adversarial Networks ◽

Training Material ◽

Semantic Classification ◽

Universal Sentence ◽

Kernel Hilbert Spaces ◽

Supervised Methods ◽

Low Dimensional

In recent years, Deep Learning methods have become very popular in classification tasks for Natural Language Processing (NLP); this is mainly due to their ability to reach high performances by relying on very simple input representations, i.e., raw tokens. One of the drawbacks of deep architectures is the large amount of annotated data required for an effective training. Usually, in Machine Learning this problem is mitigated by the usage of semi-supervised methods or, more recently, by using Transfer Learning, in the context of deep architectures. One recent promising method to enable semi-supervised learning in deep architectures has been formalized within Semi-Supervised Generative Adversarial Networks (SS-GANs) in the context of Computer Vision. In this paper, we adopt the SS-GAN framework to enable semi-supervised learning in the context of NLP. We demonstrate how an SS-GAN can boost the performances of simple architectures when operating in expressive low-dimensional embeddings; these are derived by combining the unsupervised approximation of linguistic Reproducing Kernel Hilbert Spaces and the so-called Universal Sentence Encoders. We experimentally evaluate the proposed approach over a semantic classification task, i.e., Question Classification, by considering different sizes of training material and different numbers of target classes. By applying such adversarial schema to a simple Multi-Layer Perceptron, a classifier trained over a subset derived from 1% of the original training material achieves 92% of accuracy. Moreover, when considering a complex classification schema, e.g., involving 50 classes, the proposed method outperforms state-of-the-art alternatives such as BERT.

Download Full-text

Enhanced Spatial–Temporal Map-Based Video Analytic Platform and Its Local- Versus Cloud-Based Deployment with Regional 511 Camera Network

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211036377 ◽

2021 ◽

pp. 036119812110363

Author(s):

Yi Ge ◽

Peter J. Jin ◽

Tianya Zhang ◽

Jonathan Martinez

Keyword(s):

Computer Vision ◽

Large Scale ◽

Cost Effective ◽

Splitting Algorithm ◽

Lane Changes ◽

Traffic Video ◽

The Cost ◽

Vision Algorithm ◽

Traffic Detection

This paper explores the cloud- versus server-based deployment scenarios of an enhanced computer vision platform for potential deployment on low-resolution 511 traffic video streams. An existing computer vision algorithm based on a spatial–temporal map and designed for high-angle traffic video like that of NGSIM (Next Generation SIMulation) is enhanced for roadside CCTV traffic camera angles. Because of the lower visual angle, determining the directions, splitting vehicles from occlusions, and identifying lane changes become difficult. A motion-flow-based direction determination method, a bisection occlusion detection and splitting algorithm, and a lane-change tracking method are proposed. The model evaluation is conducted by using videos from multiple cameras from the New Jersey Department of Transportation’s 511 traffic video surveillance system. The results show promising performance in both accuracy and computational efficiency for potential large-scale cloud deployment. The cost analysis reveals that at the current pricing model of cloud computing, the cloud-based deployment is more convenient and cost-effective for an on-demand network assessment. In contrast, the dedicated-server-based deployment is more economical for long-term traffic detection deployment.

Download Full-text

Translating Videos into Synthetic Training Data for Wearable Sensor-Based Activity Recognition Systems Using Residual Deep Convolutional Networks

Applied Sciences ◽

10.3390/app11073094 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3094

Author(s):

Vitor Fortes Rey ◽

Kamalveer Kaur Garewal ◽

Paul Lukowicz

Keyword(s):

Computer Vision ◽

Regression Model ◽

Activity Recognition ◽

Language Processing ◽

Large Scale ◽

Simulated Data ◽

Training Data ◽

Sensor Data ◽

Activity Data ◽

Data Set

Human activity recognition (HAR) using wearable sensors has benefited much less from recent advances in Deep Learning than fields such as computer vision and natural language processing. This is, to a large extent, due to the lack of large scale (as compared to computer vision) repositories of labeled training data for sensor-based HAR tasks. Thus, for example, ImageNet has images for around 100,000 categories (based on WordNet) with on average 1000 images per category (therefore up to 100,000,000 samples). The Kinetics-700 video activity data set has 650,000 video clips covering 700 different human activities (in total over 1800 h). By contrast, the total length of all sensor-based HAR data sets in the popular UCI machine learning repository is less than 63 h, with around 38 of those consisting of simple mode of locomotion activities like walking, standing or cycling. In our research we aim to facilitate the use of online videos, which exist in ample quantities for most activities and are much easier to label than sensor data, to simulate labeled wearable motion sensor data. In previous work we already demonstrated some preliminary results in this direction, focusing on very simple, activity specific simulation models and a single sensor modality (acceleration norm). In this paper, we show how we can train a regression model on generic motions for both accelerometer and gyro signals and then apply it to videos of the target activities to generate synthetic Inertial Measurement Units (IMU) data (acceleration and gyro norms) that can be used to train and/or improve HAR models. We demonstrate that systems trained on simulated data generated by our regression model can come to within around 10% of the mean F1 score of a system trained on real sensor data. Furthermore, we show that by either including a small amount of real sensor data for model calibration or simply leveraging the fact that (in general) we can easily generate much more simulated data from video than we can collect its real version, the advantage of the latter can eventually be equalized.

Download Full-text

Accuracy and Efficiency Comparison of Object Detection Open-Source Models

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v17i05.21833 ◽

2021 ◽

Vol 17 (05) ◽

pp. 165

Author(s):

Brahim Jabir ◽

Noureddine Falih ◽

Khalid Rahmani

Keyword(s):

Computer Vision ◽

Object Detection ◽

Open Source ◽

Large Scale ◽

Performance Comparison ◽

Yield Potential ◽

Temperate Climate ◽

Weed Species ◽

Weed Detection ◽

Cloudy Weather

In agriculture, weeds cause direct damage to the crop, and it primarily affects the crop yield potential. Manual and mechanical weeding methods consume a lot of energy and time and do not give efficient results. Chemical weed control is still the best way to control weeds. However, the widespread and large-scale use of herbicides is harmful to the environment. Our study's objective is to propose an efficient model for a smart system to detect weeds in crops in real-time using computer vision. Our experiment dataset contains images of two different weed species well known in our region strained in this region with a temperate climate. The first is the Phalaris Paradoxa. The second is Convolvulus, manually captured with a professional camera from fields under different lighting conditions (from morning to afternoon in sunny and cloudy weather). The detection of weed and crop has experimented with four recent pre-configured open-source computer vision models for object detection: Detectron2, EfficientDet, YOLO, and Faster R-CNN. The performance comparison of weed detection models is executed on the Open CV and Keras platform using python language.

Download Full-text

Unsupervised and self-supervised deep learning approaches for biomedical text mining

Briefings in Bioinformatics ◽

10.1093/bib/bbab016 ◽

2021 ◽

Author(s):

Mohamed Nadif ◽

François Role

Keyword(s):

Deep Learning ◽

Text Mining ◽

Supervised Learning ◽

Research Area ◽

Biomedical Text ◽

Biomedical Text Mining ◽

Learning Approaches ◽

Large Sets ◽

Supervised Methods ◽

The Cost

Abstract Biomedical scientific literature is growing at a very rapid pace, which makes increasingly difficult for human experts to spot the most relevant results hidden in the papers. Automatized information extraction tools based on text mining techniques are therefore needed to assist them in this task. In the last few years, deep neural networks-based techniques have significantly contributed to advance the state-of-the-art in this research area. Although the contribution to this progress made by supervised methods is relatively well-known, this is less so for other kinds of learning, namely unsupervised and self-supervised learning. Unsupervised learning is a kind of learning that does not require the cost of creating labels, which is very useful in the exploratory stages of a biomedical study where agile techniques are needed to rapidly explore many paths. In particular, clustering techniques applied to biomedical text mining allow to gather large sets of documents into more manageable groups. Deep learning techniques have allowed to produce new clustering-friendly representations of the data. On the other hand, self-supervised learning is a kind of supervised learning where the labels do not have to be manually created by humans, but are automatically derived from relations found in the input texts. In combination with innovative network architectures (e.g. transformer-based architectures), self-supervised techniques have allowed to design increasingly effective vector-based word representations (word embeddings). We show in this survey how word representations obtained in this way have proven to successfully interact with common supervised modules (e.g. classification networks) to whose performance they greatly contribute.

Download Full-text

How to Represent Paintings: A Painting Classification Using Artistic Comments

Sensors ◽

10.3390/s21061940 ◽

2021 ◽

Vol 21 (6) ◽

pp. 1940

Author(s):

Wentao Zhao ◽

Dalin Zhou ◽

Xinguo Qiu ◽

Wei Jiang

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Language Processing ◽

Large Scale ◽

Machine Learning Techniques ◽

Topological Graph ◽

Convolutional Network ◽

Learning Techniques ◽

Classification Tasks ◽

State Of Art

The goal of large-scale automatic paintings analysis is to classify and retrieve images using machine learning techniques. The traditional methods use computer vision techniques on paintings to enable computers to represent the art content. In this work, we propose using a graph convolutional network and artistic comments rather than the painting color to classify type, school, timeframe and author of the paintings by implementing natural language processing (NLP) techniques. First, we build a single artistic comment graph based on co-occurrence relations and document word relations and then train an art graph convolutional network (ArtGCN) on the entire corpus. The nodes, which include the words and documents in the topological graph are initialized using a one-hot representation; then, the embeddings are learned jointly for both words and documents, supervised by the known-class training labels of the paintings. Through extensive experiments on different classification tasks using different input sources, we demonstrate that the proposed methods achieve state-of-art performance. In addition, ArtGCN can learn word and painting embeddings, and we find that they have a major role in describing the labels and retrieval paintings, respectively.

Download Full-text

A Classification Model of Legal Consulting Questions Based on Multi-Attention Prototypical Networks

International Journal of Computational Intelligence Systems ◽

10.1007/s44196-021-00053-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Jianzhou Feng ◽

Jinman Cui ◽

Qikai Wei ◽

Zhengji Zhou ◽

Yuxiong Wang

Keyword(s):

Supervised Learning ◽

Language Processing ◽

Text Classification ◽

Question Answering ◽

Training Data ◽

Classification Model ◽

Great Progress ◽

Public Datasets ◽

The Cost

AbstractText classification is a research hotspot in the field of natural language processing. Existing text classification models based on supervised learning, especially deep learning models, have made great progress on public datasets. But most of these methods rely on a large amount of training data, and these datasets coverage is limited. In the legal intelligent question-answering system, accurate classification of legal consulting questions is a necessary prerequisite for the realization of intelligent question answering. However, due to lack of sufficient annotation data and the cost of labeling is high, which lead to the poor effect of traditional supervised learning methods under sparse labeling. In response to the above problems, we construct a few-shot legal consulting questions dataset, and propose a prototypical networks model based on multi-attention. For the same category of instances, this model first highlights the key features in the instances as much as possible through instance-dimension level attention. Then it realizes the classification of legal consulting questions by prototypical networks. Experimental results show that our model achieves state-of-the-art results compared with baseline models. The code and dataset are released on https://github.com/cjm0824/MAPN.

Download Full-text

The Knowledge Acquisition Bottleneck Problem in Multilingual Word Sense Disambiguation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/687 ◽

2020 ◽

Author(s):

Tommaso Pasini

Keyword(s):

Knowledge Acquisition ◽

Language Processing ◽

Large Scale ◽

Semantic Information ◽

Word Sense Disambiguation ◽

Knowledge Bases ◽

Word Sense ◽

Future Directions ◽

Sense Disambiguation ◽

Made In

Word Sense Disambiguation (WSD) is the task of identifying the meaning of a word in a given context. It lies at the base of Natural Language Processing as it provides semantic information for words. In the last decade, great strides have been made in this field and much effort has been devoted to mitigate the knowledge acquisition bottleneck problem, i.e., the problem of semantically annotating texts at a large scale and in different languages. This issue is ubiquitous in WSD as it hinders the creation of both multilingual knowledge bases and manually-curated training sets. In this work, we first introduce the reader to the task of WSD through a short historical digression and then take the stock of the advancements to alleviate the knowledge acquisition bottleneck problem. In that, we survey the literature on manual, semi-automatic and automatic approaches to create English and multilingual corpora tagged with sense annotations and present a clear overview over supervised models for WSD. Finally, we provide our view over the future directions that we foresee for the field.

Download Full-text

Reducing Manual Supervision Required for Biodiversity Monitoring with Self-Supervised Learning

Biodiversity Information Science and Standards ◽

10.3897/biss.5.74047 ◽

2021 ◽

Vol 5 ◽

Author(s):

Omiros Pantazis ◽

Gabriel Brostow ◽

Kate Jones ◽

Oisin Mac Aodha

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Image Classification ◽

Supervised Learning ◽

Transfer Learning ◽

Camera Trap ◽

Biodiversity Monitoring ◽

Species Classification ◽

Supervised Methods ◽

Classification Tasks

Recent years have ushered in a vast array of different types of low cost and reliable sensors that are capable of capturing large quantities of audio and visual information from the natural world. In the case of biodiversity monitoring, camera traps (i.e. remote cameras that take images when movement is detected (Kays et al. 2009) have shown themselves to be particularly effective tools for the automated monitoring of the presence and activity of different animal species. However, this ease of deployment comes at a cost, as even a small scale camera trapping project can result in hundreds of thousands of images that need to be reviewed. Until recently, this review process was an extremely time consuming endeavor. It required domain experts to manually inspect each image to: determine if it contained a species of interest and identify, where possible, which species was present. Fortunately, in the last five years, advances in machine learning have resulted in a new suite of algorithms that are capable of automatically performing image classification tasks like species classification. determine if it contained a species of interest and identify, where possible, which species was present. Fortunately, in the last five years, advances in machine learning have resulted in a new suite of algorithms that are capable of automatically performing image classification tasks like species classification. The effectiveness of deep neural networks (Norouzzadeh et al. 2018), coupled with transfer learning (tuning a model that is pretrained on a larger dataset (Willi et al. 2018), have resulted in high levels of accuracy on camera trap images. However, camera trap images exhibit unique challenges that are typically not present in standard benchmark datasets used in computer vision. For example, objects of interest are often heavily occluded, the appearance of a scene can change dramatically over time due to changes in weather and lighting, and while the overall number of images can be large, the variation in locations is often limited (Schneider et al. 2020). These challenges combined mean that in order to reach high performance on species classification it is necessary to collect a large amount of annotated data to train the deep models. This again takes a significant amount of time for each project, and this time could be better spent addressing the ecological or conservation questions of interest. Self-supervised learning is a paradigm in machine learning that attempts to forgo the need for manual supervision by instead learning informative representations from images directly, e.g. transforming an image in two different ways without impacting the semantics of the included object, and learn by imposing similarity between the two tranformations. This is a tantalizing proposition for camera trap data, as it has the potential to drastically reduce the amount of time required to annotate data. The current performance of these methods on standard computer vision benchmarks is encouraging, as it suggests that self-supervised models have begun to reach the accuracy of their fully supervised counterparts for tasks like classifying everyday objects in images (Chen et al. 2020). However, existing self-supervised methods can struggle when applied to tasks that contain highly similar, i.e. fine-grained, object categories such as different species of plants and animals (Van Horn et al. 2021). To this end, we explore the effectiveness of self-supervised learning when applied to camera trap imagery. We show that these methods can be used to train image classifiers with a significant reduction in manual supervision. Furthermore, we extend this analysis by showing, with some careful design considerations, that off-the-shelf self-supervised methods can be made to learn even more effective image representations for automated species classification. We show that by exploiting cues at training time related to where and when a given image was captured can result in further improvements in classification performance. We demonstrate, across several different camera trapping datasets, that it is possible to achieve similar, and sometimes even superior, accuracy to fully supervised transfer learning-based methods using a factor of ten times less manual supervision. Finally, we discuss some of the limitations of the outlined approaches and their implications on automated species classification from images.

Download Full-text

Crowdsourcing Ecologically-Valid Dialogue Data for German

Frontiers in Computer Science ◽

10.3389/fcomp.2021.686050 ◽

2021 ◽

Vol 3 ◽

Author(s):

Yannick Frommherz ◽

Alessandra Zarcone

Keyword(s):

Language Processing ◽

Large Scale ◽

Amazon Mechanical Turk ◽

Human Machine Interaction ◽

User Interactions ◽

Wizard Of Oz ◽

Low Resource ◽

The Cost ◽

Vague Language ◽

Machine Interaction

Despite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to a SA if it was capable of understanding and producing natural dialogue given a specific task. Such data can be collected applying a Wizard-of-Oz approach (WOz), where user and system side are played by humans. WOz allows researchers to simulate human-machine interaction while benefitting from the fact that all participants are human and thus dialogue-competent. More recent approaches have leveraged simple templates specifying a dialogue scenario for crowdsourcing large-scale datasets. Template-based collection efforts, however, come at the cost of data diversity and naturalness. We present a method to crowdsource dialogue data for the SA domain in the WOz framework, which aims at limiting researcher-induced bias in the data while still allowing for a low-resource, scalable data collection. Our method can also be applied to languages other than English (in our case German), for which fewer crowd-workers may be available. We collected data asynchronously, relying only on existing functionalities of Amazon Mechanical Turk, by formulating the task as a dialogue continuation task. Coherence in dialogues is ensured, as crowd-workers always read the dialogue history, and as a unifying scenario is provided for each dialogue. In order to limit bias in the data, rather than using template-based scenarios, we handcrafted situated scenarios which aimed at not pre-script-ing the task into every single detail and not priming the participants’ lexical choices. Our scenarios cued people’s knowledge of common situations and entities relevant for our task, without directly mentioning them, but relying on vague language and circumlocutions. We compare our data (which we publish as the CROWDSS corpus; n = 113 dialogues) with data from MultiWOZ, showing that our scenario approach led to considerably less scripting and priming and thus more ecologically-valid dialogue data. This suggests that small investments in the collection setup can go a long way in improving data quality, even in a low-resource setup.

Download Full-text