Coding Video Data, Audio Data, and Images

Author(s):  
Udo Kuckartz ◽  
Stefan Rädiker
Keyword(s):  
Author(s):  
Andreas M. Kist ◽  
Pablo Gómez ◽  
Denis Dubrovskiy ◽  
Patrick Schlegel ◽  
Melda Kunduk ◽  
...  

Purpose High-speed videoendoscopy (HSV) is an emerging, but barely used, endoscopy technique in the clinic to assess and diagnose voice disorders because of the lack of dedicated software to analyze the data. HSV allows to quantify the vocal fold oscillations by segmenting the glottal area. This challenging task has been tackled by various studies; however, the proposed approaches are mostly limited and not suitable for daily clinical routine. Method We developed a user-friendly software in C# that allows the editing, motion correction, segmentation, and quantitative analysis of HSV data. We further provide pretrained deep neural networks for fully automatic glottis segmentation. Results We freely provide our software Glottis Analysis Tools (GAT). Using GAT, we provide a general threshold-based region growing platform that enables the user to analyze data from various sources, such as in vivo recordings, ex vivo recordings, and high-speed footage of artificial vocal folds. Additionally, especially for in vivo recordings, we provide three robust neural networks at various speed and quality settings to allow a fully automatic glottis segmentation needed for application by untrained personnel. GAT further evaluates video and audio data in parallel and is able to extract various features from the video data, among others the glottal area waveform, that is, the changing glottal area over time. In total, GAT provides 79 unique quantitative analysis parameters for video- and audio-based signals. Many of these parameters have already been shown to reflect voice disorders, highlighting the clinical importance and usefulness of the GAT software. Conclusion GAT is a unique tool to process HSV and audio data to determine quantitative, clinically relevant parameters for research, diagnosis, and treatment of laryngeal disorders. Supplemental Material https://doi.org/10.23641/asha.14575533


Author(s):  
Michael Odzer ◽  
Kristina Francke

Abstract The sound of waves breaking on shore, or against an obstruction or jetty, is an immediately recognizable sound pattern which could potentially be employed by a sensor system to identify obstructions. If frequency patterns produced by breaking waves can be reproduced and mapped in a laboratory setting, a foundational understanding of the physics behind this process could be established, which could then be employed in sensor development for navigation. This study explores whether wave-breaking frequencies correlate with the physics behind the collapsing of the wave, and whether frequencies of breaking waves recorded in a laboratory tank will follow the same pattern as frequencies produced by ocean waves breaking on a beach. An artificial “beach” was engineered to replicate breaking waves inside a laboratory wave tank. Video and audio recordings of waves breaking in the tank were obtained, and audio of ocean waves breaking on the shoreline was recorded. The audio data was analysed in frequency charts. The video data was evaluated to correlate bubble sizes to frequencies produced by the waves. The results supported the hypothesis that frequencies produced by breaking waves in the wave tank followed the same pattern as those produced by ocean waves. Analysis utilizing a solution to the Rayleigh-Plesset equation showed that the bubble sizes produced by breaking waves were inversely related to the pattern of frequencies. This pattern can be reproduced in a controlled laboratory environment and extrapolated for use in developing navigational sensors for potential applications in marine navigation such as for use with autonomous ocean vehicles.


2021 ◽  
Vol 3 ◽  
Author(s):  
Sushovan Chanda ◽  
Kedar Fitwe ◽  
Gauri Deshpande ◽  
Björn W. Schuller ◽  
Sachin Patel

Research on self-efficacy and confidence has spread across several subfields of psychology and neuroscience. The role of one’s confidence is very crucial in the formation of attitude and communication skills. The importance of differentiating the levels of confidence is quite visible in this domain. With the recent advances in extracting behavioral insight from a signal in multiple applications, detecting confidence is found to have great importance. One such prominent application is detecting confidence in interview conversations. We have collected an audiovisual data set of interview conversations with 34 candidates. Every response (from each of the candidate) of this data set is labeled with three levels of confidence: high, medium, and low. Furthermore, we have also developed algorithms to efficiently compute such behavioral confidence from speech and video. A deep learning architecture is proposed for detecting confidence levels (high, medium, and low) from an audiovisual clip recorded during an interview. The achieved unweighted average recall (UAR) reaches 85.9% on audio data and 73.6% on video data captured from an interview session.


2018 ◽  
Vol 1 (1) ◽  
Author(s):  
Zhitao Li

The audio and video decoding and synchronization playback system ofMPEG-2 TS stream is designed and implemented based on ARM embedded system. In this system, hardware processor is embedded in the ARM processor. In order to make full use of this resource, hardware MFC is adopted. The multi-format codec decoder decodes the video data and decodes the audio data using the open source Mad (libmad) library. The V4L2 (Video for Linux2) driver interface and the ALSA (advanced Linux sound architecture) library are used to implement the video. Because the video frame playback period and the hardware processing delay are inconsistent, the system has a time difference between the audio and video data operations, which causes the audio and video playback to be out of sync. Therefore, we use the method of synchronizing the video playback implemented to the audio playback stream; realize the audio and video are playing sync. Test results show that, the designed audio decodes and synchronization playback system can decode and synchronize audio and video data.


Author(s):  
Paul McIlvenny

Consumer versions of the passive 360° and stereoscopic omni-directional camera have recently come to market, generating new possibilities for qualitative video data collection. This paper discusses some of the methodological issues raised by collecting, manipulating and analysing complex video data recorded with 360° cameras and ambisonic microphones. It also reports on the development of a simple, yet powerful prototype to support focused engagement with such 360° recordings of a scene. The paper proposes that we ‘inhabit’ video through a tangible interface in virtual reality (VR) in order to explore complex spatial video and audio recordings of a single scene in which social interaction took place. The prototype is a software package called AVA360VR (‘Annotate, Visualise, Analyse 360° video in VR’). The paper is illustrated through a number of video clips, including a composite video of raw and semi-processed multi-cam recordings, a 360° video with spatial audio, a video comprising a sequence of static 360° screenshots of the AVA360VR interface, and a video comprising several screen capture clips of actual use of the tool. The paper discusses the prototype’s development and its analytical possibilities when inhabiting spatial video and audio footage as a complementary mode of re-presenting, engaging with, sharing and collaborating on interactional video data.


Author(s):  
Alexander Teixeira Kalkhoff ◽  
Dennis Dressel

This article examines collaborative utterances in interaction from a multimodal perspective. Whereas prior research has analyzed co-constructions ex post as the result of local speaker collaboration on the basis of audio data, this study shifts the focus to co-constructing as a highly coordinated, embodied practice. By examining video data of Spanish interactions, this research aims to show how speakers systematically deploy a variety of linguistic and bodily resources that serve as points of joint orientation throughout the process of co-constructing utterances.


Author(s):  
Erin M. Lesaigle ◽  
David W. Biers

Fifteen usability professionals participated in a usability test under one of three simulated real time viewing conditions: (1) Screen data (S) where the evaluators saw only the image of the user's computer screen; (2) Screen plus Audio data (SA) where the user's verbalizations could be heard in addition to viewing the screen image; and (3) Screen plus Audio plus Video data (SAV) where the evaluator's additionally saw an image of the user's face in real time. Results indicated no significant differences in the total number of problems found under the three viewing conditions although there was some evidence that the problem space differed particularly from the subjective questionnaire data collected directly from the users. In rating the severity of the problems encountered, the agreement among the usability professionals was low and did not vary as a function of the number of years of the usability professional's experience. More importantly, however, ratings of severity varied as a function of viewing condition with those usability professionals in the face condition perceiving the same problems to be more severe. When considering only the most severe problems (on which there was agreement in severity ratings), the number of severe problems uncovered was less with the questionnaires than under the three real-time viewing conditions. The results are discussed in terms of real time usability evaluation and terms of the implications for remote usability testing.


2008 ◽  
Vol 18 (06) ◽  
pp. 481-489 ◽  
Author(s):  
COLIN FYFE ◽  
WESAM BARBAKH ◽  
WEI CHUAN OOI ◽  
HANSEOK KO

We review a new form of self-organizing map which is based on a nonlinear projection of latent points into data space, identical to that performed in the Generative Topographic Mapping (GTM).1 But whereas the GTM is an extension of a mixture of experts, this model is an extension of a product of experts.2 We show visualisation and clustering results on a data set composed of video data of lips uttering 5 Korean vowels. Finally we note that we may dispense with the probabilistic underpinnings of the product of experts and derive the same algorithm as a minimisation of mean squared error between the prototypes and the data. This leads us to suggest a new algorithm which incorporates local and global information in the clustering. Both ot the new algorithms achieve better results than the standard Self-Organizing Map.


Author(s):  
Marcel Nikmon ◽  
Roman Budjač ◽  
Daniel Kuchár ◽  
Peter Schreiber ◽  
Dagmar Janáčová

Abstract Deep learning is a kind of machine learning, and machine learning is a kind of artificial intelligence. Machine learning depicts groups of various technologies, and deep learning is one of them. The use of deep learning is an integral part of the current data classification practice in today’s world. This paper introduces the possibilities of classification using convolutional networks. Experiments focused on audio and video data show different approaches to data classification. Most experiments use the well-known pre-trained AlexNet network with various pre-processing types of input data. However, there are also comparisons of other neural network architectures, and we also show the results of training on small and larger datasets. The paper comprises description of eight different kinds of experiments. Several training sessions were conducted in each experiment with different aspects that were monitored. The focus was put on the effect of batch size on the accuracy of deep learning, including many other parameters that affect deep learning [1].


Sign in / Sign up

Export Citation Format

Share Document