scholarly journals Video Interactive Captioning with Human Prompts

Author(s):  
Aming Wu ◽  
Yahong Han ◽  
Yi Yang

Video captioning aims at generating a proper sentence to describe the video content. As a video often includes rich visual content and semantic details, different people may be interested in different views. Thus the generated sentence always fails to meet the ad hoc expectations. In this paper, we make a new attempt that, we launch a round of interaction between a human and a captioning agent. After generating an initial caption, the agent asks for a short prompt from the human as a clue of his expectation. Then, based on the prompt, the agent could generate a more accurate caption. We name this process a new task of video interactive captioning (ViCap). Taking a video and an initial caption as input, we devise the ViCap agent which consists of a video encoder, an initial caption encoder, and a refined caption generator. We show that the ViCap can be trained via a full supervision (with ground-truth) way or a weak supervision (with only prompts) way. For the evaluation of ViCap, we first extend the MSRVTT with interaction ground-truth. Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.

Author(s):  
Bin Zhao ◽  
Xuelong Li ◽  
Xiaoqiang Lu

Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.


2020 ◽  
Vol 2020 (4) ◽  
pp. 116-1-116-7
Author(s):  
Raphael Antonius Frick ◽  
Sascha Zmudzinski ◽  
Martin Steinebach

In recent years, the number of forged videos circulating on the Internet has immensely increased. Software and services to create such forgeries have become more and more accessible to the public. In this regard, the risk of malicious use of forged videos has risen. This work proposes an approach based on the Ghost effect knwon from image forensics for detecting forgeries in videos that can replace faces in video sequences or change the mimic of a face. The experimental results show that the proposed approach is able to identify forgery in high-quality encoded video content.


Author(s):  
Li-Ming Chen ◽  
Bao-Xin Xiu ◽  
Zhao-Yun Ding

AbstractFor short text classification, insufficient labeled data, data sparsity, and imbalanced classification have become three major challenges. For this, we proposed multiple weak supervision, which can label unlabeled data automatically. Different from prior work, the proposed method can generate probabilistic labels through conditional independent model. What’s more, experiments were conducted to verify the effectiveness of multiple weak supervision. According to experimental results on public dadasets, real datasets and synthetic datasets, unlabeled imbalanced short text classification problem can be solved effectively by multiple weak supervision. Notably, without reducing precision, recall, and F1-score can be improved by adding distant supervision clustering, which can be used to meet different application needs.


2021 ◽  
Author(s):  
Jason Meil

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.<br> <br>Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. <br> <br>The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. </p>


Author(s):  
Niraj Doshi ◽  
Gerald Schaefer

Nailfold capillaroscopy (NC) is a non-invasive imaging technique employed to assess the condition of blood capillaries in the nailfold. It is particularly useful for early detection of scleroderma spectrum disorders and evaluation of Raynaud's phenomenon. While automated approaches to analysing NC images are relatively rare, they are typically based on extraction and analysis of individual capillaries from the images in order to assign a patient to one of the commonly employed scleroderma patterns. In this chapter, we present a different approach that does not rely on individual capillaries but performs interpretation in a holistic way based on information gathered from an image or a selected image region. In particular, our algorithm employs texture analysis to characterise the underlying patterns, coupled with a classification stage to first identify patterns in fingers, and then, through a voting strategy, reach a decision for a patient. Experimental results on a set of NC images with known ground truth demonstrate the efficacy of the proposed approach.


2018 ◽  
Vol 4 (12) ◽  
pp. 139 ◽  
Author(s):  
Alessandro Fedeli ◽  
Manuela Maffongelli ◽  
Ricardo Monleone ◽  
Claudio Pagnamenta ◽  
Matteo Pastorino ◽  
...  

A new prototype of a tomographic system for microwave imaging is presented in this paper. The target being tested is surrounded by an ad-hoc 3D-printed structure, which supports sixteen custom antenna elements. The transmission measurements between each pair of antennas are acquired through a vector network analyzer connected to a modular switching matrix. The collected data are inverted by a hybrid nonlinear procedure combining qualitative and quantitative reconstruction algorithms. Preliminary experimental results, showing the capabilities of the developed system, are reported.


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i300-i308 ◽  
Author(s):  
Dan Guo ◽  
Melanie Christine Föll ◽  
Veronika Volkmann ◽  
Kathrin Enderle-Ammour ◽  
Peter Bronsert ◽  
...  

Abstract Motivation Mass spectrometry imaging (MSI) characterizes the molecular composition of tissues at spatial resolution, and has a strong potential for distinguishing tissue types, or disease states. This can be achieved by supervised classification, which takes as input MSI spectra, and assigns class labels to subtissue locations. Unfortunately, developing such classifiers is hindered by the limited availability of training sets with subtissue labels as the ground truth. Subtissue labeling is prohibitively expensive, and only rough annotations of the entire tissues are typically available. Classifiers trained on data with approximate labels have sub-optimal performance. Results To alleviate this challenge, we contribute a semi-supervised approach mi-CNN. mi-CNN implements multiple instance learning with a convolutional neural network (CNN). The multiple instance aspect enables weak supervision from tissue-level annotations when classifying subtissue locations. The convolutional architecture of the CNN captures contextual dependencies between the spectral features. Evaluations on simulated and experimental datasets demonstrated that mi-CNN improved the subtissue classification as compared to traditional classifiers. We propose mi-CNN as an important step toward accurate subtissue classification in MSI, enabling rapid distinction between tissue types and disease states. Availability and implementation The data and code are available at https://github.com/Vitek-Lab/mi-CNN_MSI.


2020 ◽  
Vol 34 (07) ◽  
pp. 12637-12644 ◽  
Author(s):  
Yibo Yang ◽  
Hongyang Li ◽  
Xia Li ◽  
Qijie Zhao ◽  
Jianlong Wu ◽  
...  

The panoptic segmentation task requires a unified result from semantic and instance segmentation outputs that may contain overlaps. However, current studies widely ignore modeling overlaps. In this study, we aim to model overlap relations among instances and resolve them for panoptic segmentation. Inspired by scene graph representation, we formulate the overlapping problem as a simplified case, named scene overlap graph. We leverage each object's category, geometry and appearance features to perform relational embedding, and output a relation matrix that encodes overlap relations. In order to overcome the lack of supervision, we introduce a differentiable module to resolve the overlap between any pair of instances. The mask logits after removing overlaps are fed into per-pixel instance id classification, which leverages the panoptic supervision to assist in the modeling of overlap relations. Besides, we generate an approximate ground truth of overlap relations as the weak supervision, to quantify the accuracy of overlap relations predicted by our method. Experiments on COCO and Cityscapes demonstrate that our method is able to accurately predict overlap relations, and outperform the state-of-the-art performance for panoptic segmentation. Our method also won the Innovation Award in COCO 2019 challenge.


Author(s):  
Luca Baroffio ◽  
Alessandro E. C. Redondi ◽  
Marco Tagliasacchi ◽  
Stefano Tubaro

Visual features constitute compact yet effective representations of visual content, and are being exploited in a large number of heterogeneous applications, including augmented reality, image registration, content-based retrieval, and classification. Several visual content analysis applications are distributed over a network and require the transmission of visual data, either in the pixel or in the feature domain, to a central unit that performs the task at hand. Furthermore, large-scale applications need to store a database composed of up to billions of features and perform matching with low latency. In this context, several different implementations of feature extraction algorithms have been proposed over the last few years, with the aim of reducing computational complexity and memory footprint, while maintaining an adequate level of accuracy. Besides extraction, a large body of research addressed the problem of ad-hoc feature encoding methods, and a number of networking and transmission protocols enabling distributed visual content analysis have been proposed. In this survey, we present an overview of state-of-the-art methods for the extraction, encoding, and transmission of compact features for visual content analysis, thoroughly addressing each step of the pipeline and highlighting the peculiarities of the proposed methods.


2009 ◽  
Vol 1216 ◽  
Author(s):  
Cristina Romero ◽  
Ariel A. Valladares ◽  
R. M. Valladares ◽  
Alexander Valladares

AbstractNanoporous carbon is a widely studied material due to its potential applications in hydrogen storage or for filtering undesirable products. Most of the developments have been experimental although some simulation work has been carried out based on the use of graphene sheets and/or carbon chains and classical molecular dynamics. The slit pore model is one of the oldest models proposed to describe porous carbon. Developed by Emmet in 1948 [1] it has been recurrently used and in its most basic form consists of two parallel graphene layers separated by a distance that is taken as the width of the pore. Its simplicity limits its applicability since experimental evidence suggests that the walls of the carbon pores have widths of a few graphene layers [2], but it still is appealing for computational simulations due to its low computational cost. Using a previously developed ab initio approach to generate porous semiconductors [3] we have obtained porous carbonaceous materials with walls made up of a few graphene layers (four layers), in agreement with experimental results; these walls are separated by distances comparable to those used in the slit pore model [4]. This validates the idea of a modified slit pore model obtained without the use of ad hoc suppositions. Structures will be presented, analyzed and compared to available experimental results.


Sign in / Sign up

Export Citation Format

Share Document