Video Captioning Based on Channel Soft Attention and Semantic Reconstructor

Video captioning is a popular task which automatically generates a natural-language sentence to describe video content. Previous video captioning works mainly use the encoder–decoder framework and exploit special techniques such as attention mechanisms to improve the quality of generated sentences. In addition, most attention mechanisms focus on global features and spatial features. However, global features are usually fully connected features. Recurrent convolution networks (RCNs) receive 3-dimensional features as input at each time step, but the temporal structure of each channel at each time step has been ignored, which provide temporal relation information of each channel. In this paper, a video captioning model based on channel soft attention and semantic reconstructor is proposed, which considers the global information for each channel. In a video feature map sequence, the same channel of every time step is generated by the same convolutional kernel. We selectively collect the features generated by each convolutional kernel and then input the weighted sum of each channel to RCN at each time step to encode video representation. Furthermore, a semantic reconstructor is proposed to rebuild semantic vectors to ensure the integrity of semantic information in the training process, which takes advantage of both forward (semantic to sentence) and backward (sentence to semantic) flows. Experimental results on popular datasets MSVD and MSR-VTT demonstrate the effectiveness and feasibility of our model.

Download Full-text

Video Caption Based Searching Using End-to-End Dense Captioning and Sentence Embeddings

Symmetry ◽

10.3390/sym12060992 ◽

2020 ◽

Vol 12 (6) ◽

pp. 992

Author(s):

Akshay Aggarwal ◽

Aniruddha Chauhan ◽

Deepika Kumar ◽

Mamta Mittal ◽

Sudipta Roy ◽

...

Keyword(s):

Search Space ◽

Video Content ◽

Video Captioning ◽

The Past ◽

Percentile Score ◽

Universal Sentence ◽

End To End ◽

Video Searching ◽

Searching Method

Traditionally, searching for videos on popular streaming sites like YouTube is performed by taking the keywords, titles, and descriptions that are already tagged along with the video into consideration. However, the video content is not utilized for searching of the user’s query because of the difficulty in encoding the events in a video and comparing them to the search query. One solution to tackle this problem is to encode the events in a video and then compare them to the query in the same space. A method of encoding meaning to a video could be video captioning. The captioned events in the video can be compared to the query of the user, and we can get the optimal search space for the videos. There have been many developments over the course of the past few years in modeling video-caption generators and sentence embeddings. In this paper, we exploit an end-to-end video captioning model and various sentence embedding techniques that collectively help in building the proposed video-searching method. The YouCook2 dataset was used for the experimentation. Seven sentence embedding techniques were used, out of which the Universal Sentence Encoder outperformed over all the other six, with a median percentile score of 99.51. Thus, this method of searching, when integrated with traditional methods, can help improve the quality of search results.

Download Full-text

Video Captioning with Tube Features

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/164 ◽

2018 ◽

Cited By ~ 5

Author(s):

Bin Zhao ◽

Xuelong Li ◽

Xiaoqiang Lu

Keyword(s):

Experimental Results ◽

Visual Feature ◽

Video Content ◽

Dynamic Information ◽

Video Captioning ◽

Attention Model ◽

Decoder Architecture ◽

Benchmark Datasets

Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

Download Full-text

Spot-scan imaging of ice-embedded catalase crystal on a slow-scan CCD camera in a 400-kV electron cryomicroscope

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100168220 ◽

1994 ◽

Vol 52 ◽

pp. 98-99

Author(s):

Wah Chiu ◽

Michael Sherman ◽

Jaap Brink

Keyword(s):

Electron Diffraction ◽

Ccd Camera ◽

Protein Crystal ◽

Modulation Transfer ◽

Measured Intensity ◽

3 Dimensional ◽

Dimensional Reconstruction ◽

Diffraction Patterns ◽

Complex Crystal

In protein electron crystallography, both low dose electron diffraction patterns and images are needed to provide accurate amplitudes and phases respectively for a 3-dimensional reconstruction. We have demonstrated that the Gatan 1024x1024 model 679 slow-scan CCD camera is useful to record electron diffraction intensities of glucose-embedded crotoxin complex crystal to 3 Å resolution. The quality of the electron diffraction intensities is high on the basis of the measured intensity equivalence ofthe Friedel-related reflections. Moreover, the number of patterns recorded from a single crystal can be as high as 120 under the constraints of radiation damage and electron statistics for the reflections in each pattern.A limitation of the slow-scan CCD camera for recording electron images of protein crystal arises from the relatively large pixel size, i.e. 24 μm (provided by Gatan). The modulation transfer function of our camera with a P43 scintillator has been determined for 400 keV electrons and shows an amplitude fall-off to 0.25 at 1/60 μm−1.

Download Full-text

Understanding temporal structure for video captioning

Pattern Analysis and Applications ◽

10.1007/s10044-018-00770-3 ◽

2019 ◽

Vol 23 (1) ◽

pp. 147-159

Author(s):

Shagan Sah ◽

Thang Nguyen ◽

Ray Ptucha

Keyword(s):

Temporal Structure ◽

Video Captioning

Download Full-text

A quality analysis of the YouTube video content on meniscus repair surgery

Orthopaedic Journal of Sports Medicine ◽

10.1177/2325967121s00013 ◽

2021 ◽

Vol 9 (2_suppl) ◽

pp. 2325967121S0001

Author(s):

François Sigonney ◽

Camille Steltzlen ◽

Pierre Alban Bouché ◽

Nicolas Pujol

Keyword(s):

Medical Information ◽

Cruciate Ligament ◽

Meniscus Repair ◽

Quality Analysis ◽

Video Content ◽

Source Type ◽

Keywords Meniscus ◽

Youtube Videos ◽

Anterior Cruciate

Objectives: The Internet, especially YouTube, is an important and growing source of medical information. The content of this information is poorly evaluated. The objective of this study was to analyze the quality of YouTube video content on meniscus repair. The hypothesis was that this source of information is not relevant for patients. Methods: A YouTube search was carried out using the keywords "meniscus repair". Videos had to have had more than 10,000 views to be included. The videos were analyzed by two evaluators. Various features of the videos were recorded (number of views, date of publication, "likes", "don’t likes", number of comments, source, type of content and the origin of the video). The quality of the video content was analyzed by two validated information system scores: the JAMA benchmark score (0 to 4) and the Modified DISCERN score (0 to 5). A specific meniscus repair score (MRSS scored out of 22) was developed for this study, in the same way that a specific score has been developed for other similar studies (anterior cruciate ligament, spine, etc.). Results: Forty-four (44) videos were included in the study. The average number of views per video was 180,100 (± 222,000) for a total number of views of 7,924,095. The majority of the videos were from North America (90.9%). In most cases, the source (uploader) that published the video was a doctor (59.1%). A manufacturer, an institution and a non-medical source were the other sources. The content actually contained information on meniscus repair in only 50% of the cases. The mean scores for the JAMA benchmark, MD score and MRSS were 1.6/4± 0.75, 1.2/5 ± 1.02 and 4.5/22 (± 4.01) respectively. No correlation was found between the number of views and the quality of the videos. The quality of videos from medical sources was not superior to those from other sources. Conclusion: The content of YouTube videos on meniscus repair is of very low quality. Physicians should inform patients and, more importantly, contribute to the improvement of these contents.

Download Full-text

YouTube as a source of public health information regarding COVID-19 vaccination: an assessment of reliability and quality of video content (Preprint)

10.2196/preprints.29942 ◽

2021 ◽

Author(s):

Calvin Chan ◽

Viknesh Sounderajah ◽

Elisabeth Daniels ◽

Amish Acharya ◽

Jonathan Clarke ◽

...

Keyword(s):

Public Health ◽

Health Information ◽

Public Awareness ◽

Public Health Intervention ◽

Medical Professionals ◽

Video Content ◽

Public Health Information ◽

Health Related ◽

Quality And Reliability

BACKGROUND Recent emergency authorisation and rollout of COVID-19 vaccines by regulatory bodies has generated global attention. As the most popular video-sharing platform globally, YouTube is a potent medium for dissemination of key public health information. Understanding the nature of available content regarding COVID-19 vaccination on this widely used platform is of substantial public health interest. OBJECTIVE To evaluate the reliability and quality of information of YouTube videos regarding COVID-19 vaccination. METHODS For this cross-sectional study, the phrases ‘coronavirus vaccine’ and ‘COVID-19 vaccine’ were searched on the UK version of YouTube on December 10, 2020. The 200 most-viewed videos of each search were extracted and screened for relevance and English language. Video content and characteristics were extracted and independently rated against Health on the Net Foundation Code of Conduct (HONCode) and DISCERN quality criteria for consumer health information by two authors. RESULTS Forty-eight videos, with a combined total view count of 30,100,561, were included in the analysis. Topics addressed comprised: vaccine science (58%), vaccine trials (58%), side effects (48%), efficacy (35%) and manufacturing (17%). Twenty-one percent of videos encouraged continued public health measures. Only 4.2% of videos made non-factual claims. Ninety-eight percent of video content was scored to have low (60%) or medium (38%) adherence to HONCode principles. Educational channels produced by both medical and non-medical professionals achieved significantly higher DISCERN scores than other categories. The highest DISCERN scores were achieved by educational videos produced by medical professionals (64.3 (58.5-66.3)) and the lowest scores by independent users (18 (18-20)). CONCLUSIONS Overall quality and reliability of information on YouTube regarding COVID-19 vaccines remains poor. Videos produced by educational channels, especially by medical professionals, were higher in quality and reliability than those produced by other sources, including health-related organisations. Collaboration between health-related organisations and established medical and educational YouTube content producers provide an opportunity for dissemination of high-quality information regarding COVID-19 vaccination. Such collaboration holds potential as a rapidly implementable public health intervention aiming to engage a wide audience and increase public awareness and knowledge.

Download Full-text

Analyzing the Quality of Aesthetic Surgery Procedure Videos on TikTok

Aesthetic Surgery Journal ◽

10.1093/asj/sjab291 ◽

2021 ◽

Author(s):

Anjali Om ◽

Bobby Ijeoma ◽

Sara Kebede ◽

Albert Losken

Keyword(s):

Social Media ◽

Breast Augmentation ◽

Cosmetic Surgery ◽

Video Quality ◽

Aesthetic Surgery ◽

Video Content ◽

Medical Providers ◽

Cosmetic Procedure ◽

Different Content

Abstract Background TikTok is one of the most popular and fastest growing social media apps in the world. Previous studies have analyzed the quality of patient education information on older video platforms, but the quality of plastic and cosmetic surgery videos on TikTok has not yet been determined. Objectives To analyze the source and quality of certain cosmetic procedure videos on TikTok. Methods The TikTok mobile application was queried for content related to two popular face procedures (rhinoplasty and blepharoplasty) and two body procedures (breast augmentation and abdominoplasty). Two independent reviewers analyzed video content according to the DISCERN scale, a validated, objective criteria that assesses the quality of information on a scale of 1-5. Quality scores were compared between videos produced by medical and nonmedical creators and between different content categories. Results There were 4.8 billion views and 76.2 million likes across included videos. Videos were created by MDs (56%) and laypersons (44%). Overall average DISCERN score out of 5 corresponded to very poor video quality for rhinoplasty (1.55), blepharoplasty (1.44), breast augmentation (1.25) and abdominoplasty (1.29). DISCERN scores were significantly higher among videos produced by MDs than by laypersons for all surgeries. Comedy videos consistently had the lowest average DISCERN scores, while educational videos had the highest. Conclusions It is increasingly important that medical professionals understand the possibility of patient misinformation in the age of social media. We encourage medical providers to be involved in creating quality information on TikTok and educate patients about misinformation to best support health literacy.

Download Full-text

Qualitative evaluation of paediatric surgical otolaryngology content on YouTube

The Journal of Laryngology & Otology ◽

10.1017/s002221512000016x ◽

2020 ◽

Vol 134 (2) ◽

pp. 135-137 ◽

Cited By ~ 3

Author(s):

B Ward ◽

R Bavier ◽

C Warren ◽

J Yan ◽

B Paskhover

Keyword(s):

Health Information ◽

Qualitative Evaluation ◽

Video Content ◽

Medical Decisions ◽

High Quality ◽

Youtube Videos ◽

The Mean

AbstractObjectiveThis study evaluated the quality of YouTube content focusing on common paediatric otolaryngology procedures, as this content can influence the opinions and medical decisions of patients.MethodsA total of 120 YouTube videos were compiled to review using the terms ‘adenoid removal’, ‘adenoidectomy’, ‘ear tubes’, ‘tympanostomy’, ‘tonsil removal’ and ‘tonsillectomy’. The Discern criteria was used to rate the quality of health information presented in each video.ResultsThe mean bias Discern score was 3.18 and the mean overall Discern score was 2.39. Videos including US board certified physicians were rated significantly higher (p < 0.001) than videos without (bias Discern score = 3.00 vs 2.38; overall Discern score = 3.79 vs 1.55). The videos had been viewed a total of 176 769 549 times.ConclusionUnbiased, high quality videos on YouTube are lacking. As patients may rely on this information when making medical decisions, it is important that practitioners continually evaluate and improve this video content. Otolaryngologists should be prepared to discuss YouTube content with patients.

Download Full-text

Three-Dimensional Resource Allocation in D2D-Based V2V Communication

Electronics ◽

10.3390/electronics8090962 ◽

2019 ◽

Vol 8 (9) ◽

pp. 962 ◽

Cited By ~ 5

Author(s):

Usman Ali Khan ◽

Sang Sun Lee

Keyword(s):

Resource Allocation ◽

Three Dimensional ◽

Relay Node ◽

Cellular Phone ◽

User Equipment ◽

Moving Vehicle ◽

3 Dimensional ◽

V2v Communication ◽

3Rd Generation Partnership Project

Device-to-Device (D2D) communication is the major enabler of Vehicle-to-Everything communication in 3rd Generation Partnership Project (3GPP) Release 14. The user equipment/device can engage either in direct communication with the infrastructure, use a relay node, or it can communicate directly with another device with or without infrastructure support. The user equipment can be either a hand-held cellular device or a moving vehicle. The coexistence of cellular user equipment with the vehicular user equipment imposes different Quality of Service (QOS) requirements due to the rapid mobility of the vehicles and interference. Resource allocation is an important task by which the user equipment is allocated the required resources based on different QOS parameters. In this paper, we introduced the case of three types of users which share uplink resources: two types of vehicular users, and a third user that acts as a handheld cellular phone, which is nearly static. By keeping in mind, the differential QOS requirements for the three types of users, we have calculated the optimum power and then applied a 3-dimensional graph-based matching and hypergraph coloring based resource block (RB) allocation.

Download Full-text

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019159 ◽

2019 ◽

Vol 33 ◽

pp. 9159-9166 ◽

Cited By ~ 16

Author(s):

Yitian Yuan ◽

Tao Mei ◽

Wenwu Zhu

Keyword(s):

Video Sequence ◽

Temporal Structure ◽

Research Community ◽

Location Prediction ◽

Video Content ◽

Global Context ◽

Effectiveness And Efficiency ◽

Public Datasets ◽

Temporal Localization ◽

Informative Text

We have witnessed the tremendous growth of videos over the Internet, where most of these videos are typically paired with abundant sentence descriptions, such as video titles, captions and comments. Therefore, it has been increasingly crucial to associate specific video segments with the corresponding informative text descriptions, for a deeper understanding of video content. This motivates us to explore an overlooked problem in the research community — temporal sentence localization in video, which aims to automatically determine the start and end points of a given sentence within a paired video. For solving this problem, we face three critical challenges: (1) preserving the intrinsic temporal structure and global context of video to locate accurate positions over the entire video sequence; (2) fully exploring the sentence semantics to give clear guidance for localization; (3) ensuring the efficiency of the localization method to adapt to long videos. To address these issues, we propose a novel Attention Based Location Regression (ABLR) approach to localize sentence descriptions in videos in an efficient end-to-end manner. Specifically, to preserve the context information, ABLR first encodes both video and sentence via Bi-directional LSTM networks. Then, a multi-modal co-attention mechanism is presented to generate both video and sentence attentions. The former reflects the global video structure, while the latter highlights the sentence details for temporal localization. Finally, a novel attention based location prediction network is designed to regress the temporal coordinates of sentence from the previous attentions. We evaluate the proposed ABLR approach on two public datasets ActivityNet Captions and TACoS. Experimental results show that ABLR significantly outperforms the existing approaches in both effectiveness and efficiency.

Download Full-text