Single-shot Semantic Matching Network for Moment Localization in Videos

Author(s):  
Xinfang Liu ◽  
Xiushan Nie ◽  
Junya Teng ◽  
Li Lian ◽  
Yilong Yin

Moment localization in videos using natural language refers to finding the most relevant segment from videos given a natural language query. Most of the existing methods require video segment candidates for further matching with the query, which leads to extra computational costs, and they may also not locate the relevant moments under any length evaluated. To address these issues, we present a lightweight single-shot semantic matching network (SSMN) to avoid the complex computations required to match the query and the segment candidates, and the proposed SSMN can locate moments of any length theoretically. Using the proposed SSMN, video features are first uniformly sampled to a fixed number, while the query sentence features are generated and enhanced by GloVe, long-term short memory (LSTM), and soft-attention modules. Subsequently, the video features and sentence features are fed to an enhanced cross-modal attention model to mine the semantic relationships between vision and language. Finally, a score predictor and a location predictor are designed to locate the start and stop indexes of the query moment. We evaluate the proposed method on two benchmark datasets and the experimental results demonstrate that SSMN outperforms state-of-the-art methods in both precision and efficiency.

Author(s):  
Kun Zhang ◽  
Guangyi Lv ◽  
Linyuan Wang ◽  
Le Wu ◽  
Enhong Chen ◽  
...  

Sentence semantic matching requires an agent to determine the semantic relation between two sentences, which is widely used in various natural language tasks such as Natural Language Inference (NLI) and Paraphrase Identification (PI). Among all matching methods, attention mechanism plays an important role in capturing the semantic relations and properly aligning the elements of two sentences. Previous methods utilized attention mechanism to select important parts of sentences at one time. However, the important parts of the sentence during semantic matching are dynamically changing with the degree of sentence understanding. Selecting the important parts at one time may be insufficient for semantic understanding. To this end, we propose a Dynamic Re-read Network (DRr-Net) approach for sentence semantic matching, which is able to pay close attention to a small region of sentences at each step and re-read the important words for better sentence semantic understanding. To be specific, we first employ Attention Stack-GRU (ASG) unit to model the original sentence repeatedly and preserve all the information from bottom-most word embedding input to up-most recurrent output. Second, we utilize Dynamic Re-read (DRr) unit to pay close attention to one important word at one time with the consideration of learned information and re-read the important words for better sentence semantic understanding. Extensive experiments on three sentence matching benchmark datasets demonstrate that DRr-Net has the ability to model sentence semantic more precisely and significantly improve the performance of sentence semantic matching. In addition, it is very interesting that some of finding in our experiments are consistent with the findings of psychological research.


2022 ◽  
Vol 22 (3) ◽  
pp. 1-21
Author(s):  
Prayag Tiwari ◽  
Amit Kumar Jaiswal ◽  
Sahil Garg ◽  
Ilsun You

Self-attention mechanisms have recently been embraced for a broad range of text-matching applications. Self-attention model takes only one sentence as an input with no extra information, i.e., one can utilize the final hidden state or pooling. However, text-matching problems can be interpreted either in symmetrical or asymmetrical scopes. For instance, paraphrase detection is an asymmetrical task, while textual entailment classification and question-answer matching are considered asymmetrical tasks. In this article, we leverage attractive properties of self-attention mechanism and proposes an attention-based network that incorporates three key components for inter-sequence attention: global pointwise features, preceding attentive features, and contextual features while updating the rest of the components. Our model follows evaluation on two benchmark datasets cover tasks of textual entailment and question-answer matching. The proposed efficient Self-attention-driven Network for Text Matching outperforms the state of the art on the Stanford Natural Language Inference and WikiQA datasets with much fewer parameters.


Author(s):  
Jaydeep Sen ◽  
Ashish Mittal ◽  
Diptikalyan Saha ◽  
Karthik Sankaranarayanan

Query completion systems are well studied in the context of information retrieval systems that handle keyword queries. However, Natural Language Interface to Databases (NLIDB) systems that focus on syntactically correct and semantically complete queries to obtain high precision answers require a fundamentally different approach to the query completion problem as opposed to IR systems. To the best of our knowledge, we are first to focus on the problem of query completion for NLIDB systems. In particular, we introduce a novel concept of functional partitioning of an ontology and then design algorithms to intelligently use the components obtained from functional partitioning to extend a state-of-the-art NLIDB system to produce accurate and semantically meaningful query completions in the absence of query logs. We test the proposed query completion framework on multiple benchmark datasets and demonstrate the efficacy of our technique empirically.


Sensors ◽  
2021 ◽  
Vol 21 (3) ◽  
pp. 1012
Author(s):  
Jisu Hwang ◽  
Incheol Kim

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.


Author(s):  
Siva Reddy ◽  
Mirella Lapata ◽  
Mark Steedman

In this paper we introduce a novel semantic parsing approach to query Freebase in natural language without requiring manual annotations or question-answer pairs. Our key insight is to represent natural language via semantic graphs whose topology shares many commonalities with Freebase. Given this representation, we conceptualize semantic parsing as a graph matching problem. Our model converts sentences to semantic graphs using CCG and subsequently grounds them to Freebase guided by denotations as a form of weak supervision. Evaluation experiments on a subset of the Free917 and WebQuestions benchmark datasets show our semantic parser improves over the state of the art.


Quantum ◽  
2017 ◽  
Vol 1 ◽  
pp. 27 ◽  
Author(s):  
Pavel Sekatski ◽  
Michalis Skotiniotis ◽  
Janek Kołodyński ◽  
Wolfgang Dür

We establish general limits on how precise a parameter, e.g. frequency or the strength of a magnetic field, can be estimated with the aid of full and fast quantum control. We consider uncorrelated noisy evolutions of N qubits and show that fast control allows to fully restore the Heisenberg scaling (~1/N^2) for all rank-one Pauli noise except dephasing. For all other types of noise the asymptotic quantum enhancement is unavoidably limited to a constant-factor improvement over the standard quantum limit (~1/N) even when allowing for the full power of fast control. The latter holds both in the single-shot and infinitely-many repetitions scenarios. However, even in this case allowing for fast quantum control helps to increase the improvement factor. Furthermore, for frequency estimation with finite resource we show how a parallel scheme utilizing any fixed number of entangled qubits but no fast quantum control can be outperformed by a simple, easily implementable, sequential scheme which only requires entanglement between one sensing and one auxiliary qubit.


2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Miaoyuan Shi

With the development of deep learning and its wide application in the field of natural language, the question and answer research of knowledge graph based on deep learning has gradually become the focus of attention. After that, the natural language query is converted into a structured query sentence to identify the entities and attributes in the user’s natural language query and the specified entities and attributes are used to retrieve answers to the knowledge graph. Using the advantage of deep learning in capturing sentence information, it incorporates the attention mechanism to obtain the semantic vector of the relevant attributes in the query and uses the parameter sharing mechanism to insert candidate attributes into the triple in the same model to obtain the semantic vector of typical candidates. The experiment measured that under the 100,000 RDF dataset, the single entity query of the MIQE model does not exceed 3 seconds, and the connection query does not exceed 5 seconds. Under the one-million RDF dataset, the single entity query of the MIQE model does not exceed 8 seconds, and the connection query will not be more than 10 seconds. Experimental data show that the system of knowledge-answering questions of engineering of intelligent construction based on deep learning has good horizontal scalability.


Author(s):  
Kaan Ant ◽  
Ugur Sogukpinar ◽  
Mehmet Fatif Amasyali

The use of databases those containing semantic relationships between words is becoming increasingly widespread in order to make natural language processing work more effective. Instead of the word-bag approach, the suggested semantic spaces give the distances between words, but they do not express the relation types. In this study, it is shown how semantic spaces can be used to find the type of relationship and it is compared with the template method. According to the results obtained on a very large scale, while is_a and opposite are more successful for semantic spaces for relations, the approach of templates is more successful in the relation types at_location, made_of and non relational.


Author(s):  
Md. Asifuzzaman Jishan ◽  
Khan Raqib Mahmud ◽  
Abul Kalam Al Azad

We presented a learning model that generated natural language description of images. The model utilized the connections between natural language and visual data by produced text line based contents from a given image. Our Hybrid Recurrent Neural Network model is based on the intricacies of Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Bi-directional Recurrent Neural Network (BRNN) models. We conducted experiments on three benchmark datasets, e.g., Flickr8K, Flickr30K, and MS COCO. Our hybrid model utilized LSTM model to encode text line or sentences independent of the object location and BRNN for word representation, this reduced the computational complexities without compromising the accuracy of the descriptor. The model produced better accuracy in retrieving natural language based description on the dataset.


Sign in / Sign up

Export Citation Format

Share Document