video segment
Recently Published Documents


TOTAL DOCUMENTS

41
(FIVE YEARS 12)

H-INDEX

7
(FIVE YEARS 0)

2021 ◽  
pp. 214-219
Author(s):  
Maryam Sadat Mirzaei ◽  
Kourosh Meshgi

This paper focuses on Partial and Synchronized Caption (PSC) as a tool to train L2 listening and introduces new features to facilitate speech-related difficulties. PSC is an intelligent caption that extensively processes the audio and transcript to detect and present difficult words or phrases for L2 learners. With the new features, learners can benefit from repetition and slowdowns of particular audio segments that are automatically labeled difficult. When encountering high speech rates, the system slows down the audio to the standard rate of speech. For disfluencies in speech (e.g. breached boundaries), the system generates the caption and repeats that video segment. In our experiments, intermediate L2 learners of English watched videos with different captions and functionalities, provided feedback on new PSC features, and took a series of tests. Smart repetition and slowdown components received positive learner feedback and led to significant improvement in L2 listening recognition.


2021 ◽  
pp. 1-12
Author(s):  
Sandhya ◽  
Vinay ◽  
Manchaiah, V

Purpose Multimodal sensory integration in audiovisual (AV) speech perception is a naturally occurring phenomenon. Modality-specific responses such as auditory left, auditory right, and visual responses to dichotic incongruent AV speech stimuli help in understanding AV speech processing through each input modality. It is observed that distribution of activity in the frontal motor areas involved in speech production has been shown to correlate with how subjects perceive the same syllable differently or perceive different syllables. This study investigated the distribution of modality-specific responses to dichotic incongruent AV speech stimuli by simultaneously presenting consonant–vowel (CV) syllables with different places of articulation to the participant's left and right ears and visually. Design A dichotic experimental design was adopted. Six stop CV syllables /pa/, /ta/, /ka/, /ba/, /da/, and /ga/ were assembled to create dichotic incongruent AV speech material. Participants included 40 native speakers of Norwegian (20 women, M age = 22.6 years, SD = 2.43 years; 20 men, M age = 23.7 years, SD = 2.08 years). Results Findings of this study showed that, under dichotic listening conditions, velar CV syllables resulted in the highest scores in the respective ears, and this might be explained by stimulus dominance of velar consonants, as shown in previous studies. However, this study, with dichotic auditory stimuli accompanied by an incongruent video segment, demonstrated that the presentation of a visually distinct video segment possibly draws attention to the video segment in some participants, thereby reducing the overall recognition of the dominant syllable. Furthermore, the findings here suggest the possibility of lesser response times to incongruent AV stimuli in females compared with males. Conclusion The identification of the left audio, right audio, and visual segments in dichotic incongruent AV stimuli depends on place of articulation, stimulus dominance, and voice onset time of the CV syllables.


Author(s):  
Xiaoming Peng ◽  
Abdesselam Bouzerdoum ◽  
Son Lam Phung

Existing methods for dynamic scene recognition mostly use global features extracted from the entire video frame or a video segment. In this paper, a trajectory-based dynamic scene recognition method is proposed. A trajectory is formed by a pixel moving across consecutive frames of a video segment. The local regions surrounding the trajectory provide useful appearance and motion information about a portion of the video segment. The proposed method works at several stages. First, dense and evenly distributed trajectories are extracted from a video segment. Then, the fully-connected-layer features are extracted from each trajectory using a pre-trained Convolutional Neural Networks (CNNs) model, forming a feature sequence. Next, these feature sequences are fed into a Long-Short-Term-Memory (LSTM) network to learn their temporal behavior. Finally, by aggregating the information of the trajectories, a global representation of the video segment can be obtained for classification purposes. The LSTM is trained using synthetic trajectory feature sequences instead of real ones. The synthetic feature sequences are generated with a series of generative adversarial networks (GANs). In addition to classification, category-specific discriminative trajectories are located in a video segment, which help reveal what portions of a video segment are more important than others. This is achieved by formulating an optimization problem to learn discriminative part detectors for all categories simultaneously. Experimental results on two benchmark dynamic scene datasets show that the proposed method is very competitive with six other methods.


2021 ◽  
pp. 108027
Author(s):  
Xiao Sun ◽  
Xiang Long ◽  
Dongliang He ◽  
Shilei Wen ◽  
Zhouhui Lian
Keyword(s):  

Author(s):  
Ghulam Mujtaba ◽  
Sangsoon Lee ◽  
Jaehyoun Kim ◽  
Eun-Seok Ryu

AbstractThis paper proposes a novel, lightweight method to generate animated graphical interchange format images (GIFs) using the computational resources of a client device. The method analyzes an acoustic feature from the climax section of an audio file to estimate the timestamp corresponding to the maximum pitch. Further, it processes a small video segment to generate the GIF instead of processing the entire video. This makes the proposed method computationally efficient, unlike baseline approaches that use entire videos to create GIFs. The proposed method retrieves and uses the audio file and video segment so that communication and storage efficiencies are improved in the GIF generation process. Experiments on a set of 16 videos show that the proposed approach is 3.76 times more computationally efficient than a baseline method on an Nvidia Jetson TX2. Additionally, in a qualitative evaluation, the GIFs generated using the proposed method received higher overall ratings compared to those generated by the baseline method. To the best of our knowledge, this is the first technique that uses an acoustic feature in the GIF generation process.


2021 ◽  
Vol 18 (2) ◽  
pp. 215-227
Author(s):  
Zeyu Hu ◽  
Chunjing Hu ◽  
Zexu Li ◽  
Yong Li ◽  
Guiming Wei

Sign in / Sign up

Export Citation Format

Share Document