A Deep Learning Enhanced Novel Software Tool for Laryngeal Dynamics Analysis

Purpose High-speed videoendoscopy (HSV) is an emerging, but barely used, endoscopy technique in the clinic to assess and diagnose voice disorders because of the lack of dedicated software to analyze the data. HSV allows to quantify the vocal fold oscillations by segmenting the glottal area. This challenging task has been tackled by various studies; however, the proposed approaches are mostly limited and not suitable for daily clinical routine. Method We developed a user-friendly software in C# that allows the editing, motion correction, segmentation, and quantitative analysis of HSV data. We further provide pretrained deep neural networks for fully automatic glottis segmentation. Results We freely provide our software Glottis Analysis Tools (GAT). Using GAT, we provide a general threshold-based region growing platform that enables the user to analyze data from various sources, such as in vivo recordings, ex vivo recordings, and high-speed footage of artificial vocal folds. Additionally, especially for in vivo recordings, we provide three robust neural networks at various speed and quality settings to allow a fully automatic glottis segmentation needed for application by untrained personnel. GAT further evaluates video and audio data in parallel and is able to extract various features from the video data, among others the glottal area waveform, that is, the changing glottal area over time. In total, GAT provides 79 unique quantitative analysis parameters for video- and audio-based signals. Many of these parameters have already been shown to reflect voice disorders, highlighting the clinical importance and usefulness of the GAT software. Conclusion GAT is a unique tool to process HSV and audio data to determine quantitative, clinically relevant parameters for research, diagnosis, and treatment of laryngeal disorders. Supplemental Material https://doi.org/10.23641/asha.14575533

Download Full-text

Acoustic Study of Wave-Breaking to Enhance the Understanding of Wave Physics

Volume 6B: Ocean Engineering ◽

10.1115/omae2020-19352 ◽

2020 ◽

Author(s):

Michael Odzer ◽

Kristina Francke

Keyword(s):

Wave Breaking ◽

Ocean Waves ◽

Breaking Waves ◽

Video Data ◽

Wave Tank ◽

Potential Applications ◽

Audio Data ◽

Video And Audio ◽

Bubble Sizes ◽

Laboratory Tank

Abstract The sound of waves breaking on shore, or against an obstruction or jetty, is an immediately recognizable sound pattern which could potentially be employed by a sensor system to identify obstructions. If frequency patterns produced by breaking waves can be reproduced and mapped in a laboratory setting, a foundational understanding of the physics behind this process could be established, which could then be employed in sensor development for navigation. This study explores whether wave-breaking frequencies correlate with the physics behind the collapsing of the wave, and whether frequencies of breaking waves recorded in a laboratory tank will follow the same pattern as frequencies produced by ocean waves breaking on a beach. An artificial “beach” was engineered to replicate breaking waves inside a laboratory wave tank. Video and audio recordings of waves breaking in the tank were obtained, and audio of ocean waves breaking on the shoreline was recorded. The audio data was analysed in frequency charts. The video data was evaluated to correlate bubble sizes to frequencies produced by the waves. The results supported the hypothesis that frequencies produced by breaking waves in the wave tank followed the same pattern as those produced by ocean waves. Analysis utilizing a solution to the Rayleigh-Plesset equation showed that the bubble sizes produced by breaking waves were inversely related to the pattern of frequencies. This pattern can be reproduced in a controlled laboratory environment and extrapolated for use in developing navigational sensors for potential applications in marine navigation such as for use with autonomous ocean vehicles.

Download Full-text

Inhabiting spatial video and audio data: Towards a scenographic turn in the analysis of social interaction

Social Interaction Video-Based Studies of Human Sociality ◽

10.7146/si.v2i1.110409 ◽

2018 ◽

Vol 2 (1) ◽

Cited By ~ 1

Author(s):

Paul McIlvenny

Keyword(s):

Social Interaction ◽

Video Data ◽

Video Clips ◽

Screen Capture ◽

Audio Recordings ◽

Actual Use ◽

Video Data Collection ◽

Audio Data ◽

Video And Audio ◽

Spatial Video

Consumer versions of the passive 360° and stereoscopic omni-directional camera have recently come to market, generating new possibilities for qualitative video data collection. This paper discusses some of the methodological issues raised by collecting, manipulating and analysing complex video data recorded with 360° cameras and ambisonic microphones. It also reports on the development of a simple, yet powerful prototype to support focused engagement with such 360° recordings of a scene. The paper proposes that we ‘inhabit’ video through a tangible interface in virtual reality (VR) in order to explore complex spatial video and audio recordings of a single scene in which social interaction took place. The prototype is a software package called AVA360VR (‘Annotate, Visualise, Analyse 360° video in VR’). The paper is illustrated through a number of video clips, including a composite video of raw and semi-processed multi-cam recordings, a 360° video with spatial audio, a video comprising a sequence of static 360° screenshots of the AVA360VR interface, and a video comprising several screen capture clips of actual use of the tool. The paper discusses the prototype’s development and its analytical possibilities when inhabiting spatial video and audio footage as a complementary mode of re-presenting, engaging with, sharing and collaborating on interactional video data.

Download Full-text

Airflow Characterization Combined With 3D Reconstruction of Rabbit Vocal Cord Based on In-Vivo Phonatory Experiments

Volume 2: Fluid Mechanics; Multiphase Flows ◽

10.1115/fedsm2020-20369 ◽

2020 ◽

Author(s):

Zhipeng Lou ◽

Junshi Wang ◽

James J. Daniero ◽

Haibo Dong ◽

Jinxiang Xi

Keyword(s):

3D Reconstruction ◽

Vocal Cord ◽

Vocal Fold ◽

High Speed ◽

3D Model ◽

Vocal Folds ◽

Computational Effort ◽

Immersed Boundary ◽

Vibration Modes

Abstract In this paper, a numerical approach combined with experiments is employed to characterize the airflow through the vocal cord. Rabbits are used to perform in vivo magnetic resonance imaging (MRI) experiments and the MRI scan data are directly imposed for the three-dimensional (3D) reconstruction of a 3D high-fidelity model. The vibration modes are observed via the in vivo high-speed videoendoscopy (HSVM) technique, and the time-dependent glottal height is evaluated dynamically for the validation of the 3D reconstruction model. 72 sets of rabbit in vivo high-speed recordings are evaluated to achieve the most common vibration mode. The reconstruction is mainly based on MRI data and the HSVM records are supporting and validate the 3D model. A sharp-interface immersed-boundary-method (IBM)-based compressible flow solver is employed to compute the airflow. The primary purpose of the computational effort is to characterize the influence of the vocal folds that applied to the airflow and the airflow-induced phonation. The vocal fold kinematics and the vibration modes are quantified and the vortex structures are analyzed under the influence of vocal folds. The results have shown significant effects of the vocal fold height on the vortex structure, vorticity and velocity. The reconstructed 3D model from this work helps to bring insight into further understanding of the rabbit phonation mechanism. The results provide potential improvement for diagnosis of human vocal fold dysfunction and phonation disorder.

Download Full-text

Development and Response of Materially-Nonlinear, Multi-Layer Synthetic Models of the Human Vocal Folds

ASME 2007 Summer Bioengineering Conference ◽

10.1115/sbc2007-176564 ◽

2007 ◽

Author(s):

James S. Drechsel ◽

Jacob B. Munger ◽

Allyson A. Pulsipher ◽

Scott L. Thomson

Keyword(s):

Vocal Fold ◽

Sound Production ◽

Voice Disorders ◽

Vocal Folds ◽

Prevention And Treatment ◽

Flow Induced Vibrations ◽

Synthetic Models ◽

Normal Speech

The human vocal folds are responsible for sound production during normal speech, and a study of their flow-induced vibrations can lead to improved prevention and treatment of voice disorders. However, studying the vocal folds in vivo or using excised larynges has several disadvantages. Therefore, alternatives exist using either synthetic (physical) and/or computational vocal fold models. In order to be physiologically relevant, the behavior and properties of these models must reasonably match those of the human vocal folds.

Download Full-text

TOPOLOGICAL MAPPINGS OF VIDEO AND AUDIO DATA

International Journal of Neural Systems ◽

10.1142/s0129065708001749 ◽

2008 ◽

Vol 18 (06) ◽

pp. 481-489 ◽

Cited By ~ 10

Author(s):

COLIN FYFE ◽

WESAM BARBAKH ◽

WEI CHUAN OOI ◽

HANSEOK KO

Keyword(s):

Mean Squared Error ◽

Video Data ◽

Self Organizing Map ◽

Generative Topographic Mapping ◽

Data Set ◽

Squared Error ◽

Nonlinear Projection ◽

Audio Data ◽

Video And Audio ◽

Self Organizing

We review a new form of self-organizing map which is based on a nonlinear projection of latent points into data space, identical to that performed in the Generative Topographic Mapping (GTM).1 But whereas the GTM is an extension of a mixture of experts, this model is an extension of a product of experts.2 We show visualisation and clustering results on a data set composed of video data of lips uttering 5 Korean vowels. Finally we note that we may dispense with the probabilistic underpinnings of the product of experts and derive the same algorithm as a minimisation of mean squared error between the prototypes and the data. This leads us to suggest a new algorithm which incorporates local and global information in the clustering. Both ot the new algorithms achieve better results than the standard Self-Organizing Map.

Download Full-text

Convolutional Networks Used to Classify Video and Audio Data

Research Papers Faculty of Materials Science and Technology Slovak University of Technology ◽

10.2478/rput-2019-0034 ◽

2019 ◽

Vol 27 (45) ◽

pp. 113-120

Author(s):

Marcel Nikmon ◽

Roman Budjač ◽

Daniel Kuchár ◽

Peter Schreiber ◽

Dagmar Janáčová

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Classification ◽

Current Data ◽

Video Data ◽

Batch Size ◽

Network Architectures ◽

Convolutional Networks ◽

Audio Data ◽

Video And Audio

Abstract Deep learning is a kind of machine learning, and machine learning is a kind of artificial intelligence. Machine learning depicts groups of various technologies, and deep learning is one of them. The use of deep learning is an integral part of the current data classification practice in today’s world. This paper introduces the possibilities of classification using convolutional networks. Experiments focused on audio and video data show different approaches to data classification. Most experiments use the well-known pre-trained AlexNet network with various pre-processing types of input data. However, there are also comparisons of other neural network architectures, and we also show the results of training on small and larger datasets. The paper comprises description of eight different kinds of experiments. Several training sessions were conducted in each experiment with different aspects that were monitored. The focus was put on the effect of batch size on the accuracy of deep learning, including many other parameters that affect deep learning [1].

Download Full-text

Rethinking glottal midline detection

10.1101/2020.08.20.257428 ◽

2020 ◽

Author(s):

Andreas M. Kist ◽

Julian Zilker ◽

Pablo Gómez ◽

Anne Schützenberger ◽

Michael Döllinger

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Vocal Fold ◽

High Speed ◽

Deep Neural Networks ◽

Comprehensive Evaluation ◽

Vocal Folds ◽

Biophysical Model ◽

Analysis Workflow ◽

Classical Computer

A healthy voice is crucial for verbal communication and hence in daily as well as professional life. The basis for a healthy voice are the sound producing vocal folds in the larynx. A hallmark of healthy vocal fold oscillation is the symmetric motion of the left and right vocal fold. Clinically, videoendoscopy is applied to assess the symmetry of the oscillation and evaluated subjectively. High-speed videoendoscopy, an emerging method that allows quantification of the vocal fold oscillation, is more commonly employed in research due to the amount of data and the complex, semi-automatic analysis. In this study, we provide a comprehensive evaluation of methods that detect fully automatically the glottal midline. We use a biophysical model to simulate different vocal fold oscillations, extended the openly available BAGLS dataset using manual annotations, utilized both, simulations and annotated endoscopic images, to train deep neural networks at different stages of the analysis workflow, and compared these to established computer vision algorithms. We found that classical computer vision perform well on detecting the glottal midline in glottis segmentation data, but are outper-formed by deep neural networks on this task. We further suggest GlottisNet, a multi-task neural architecture featuring the simultaneous prediction of both, the opening between the vocal folds and the symmetry axis, leading to a huge step forward towards clinical applicability of quantitative, deep learning-assisted laryngeal endoscopy, by fully automating segmentation and midline detection.

Download Full-text

Data-driven vocal folds models for the representation of both acoustic and high speed video data

2015 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn.2015.7280685 ◽

2015 ◽

Author(s):

Carlo Drioli ◽

Gian Luca Foresti

Keyword(s):

High Speed ◽

Vocal Folds ◽

Video Data ◽

Data Driven ◽

High Speed Video

Download Full-text

Rethinking glottal midline detection

Scientific Reports ◽

10.1038/s41598-020-77216-6 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Andreas M. Kist ◽

Julian Zilker ◽

Pablo Gómez ◽

Anne Schützenberger ◽

Michael Döllinger

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Vocal Fold ◽

High Speed ◽

Deep Neural Networks ◽

Comprehensive Evaluation ◽

Vocal Folds ◽

Biophysical Model ◽

Analysis Workflow ◽

Classical Computer

AbstractA healthy voice is crucial for verbal communication and hence in daily as well as professional life. The basis for a healthy voice are the sound producing vocal folds in the larynx. A hallmark of healthy vocal fold oscillation is the symmetric motion of the left and right vocal fold. Clinically, videoendoscopy is applied to assess the symmetry of the oscillation and evaluated subjectively. High-speed videoendoscopy, an emerging method that allows quantification of the vocal fold oscillation, is more commonly employed in research due to the amount of data and the complex, semi-automatic analysis. In this study, we provide a comprehensive evaluation of methods that detect fully automatically the glottal midline. We used a biophysical model to simulate different vocal fold oscillations, extended the openly available BAGLS dataset using manual annotations, utilized both, simulations and annotated endoscopic images, to train deep neural networks at different stages of the analysis workflow, and compared these to established computer vision algorithms. We found that classical computer vision perform well on detecting the glottal midline in glottis segmentation data, but are outperformed by deep neural networks on this task. We further suggest GlottisNet, a multi-task neural architecture featuring the simultaneous prediction of both, the opening between the vocal folds and the symmetry axis, leading to a huge step forward towards clinical applicability of quantitative, deep learning-assisted laryngeal endoscopy, by fully automating segmentation and midline detection.

Download Full-text

A single latent channel is sufficient for biomedical image segmentation

10.1101/2021.12.10.472122 ◽

2021 ◽

Author(s):

Andreas M Kist ◽

Stephan Duerr ◽

Anne Schuetzenberger ◽

Marion Semmler

Keyword(s):

Neural Networks ◽

High Speed ◽

Deep Neural Networks ◽

Reconstruction Accuracy ◽

Crucial Step ◽

Biomedical Image ◽

Latent Space ◽

Fully Automatic ◽

Highly Correlated ◽

Game Changer

Glottis segmentation is a crucial step to quantify endoscopic footage in laryngeal high-speed videoendoscopy. Recent advances in using deep neural networks for glottis segmentation allow a fully automatic workflow. However, exact knowledge of integral parts of these segmentation deep neural networks remains unknown. Here, we show using systematic ablations that a single latent channel as bottleneck layer is sufficient for glottal area segmentation. We further show that the latent space is an abstraction of the glottal area segmentation relying on three spatially defined pixel subtypes. We provide evidence that the latent space is highly correlated with the glottal area waveform, can be encoded with four bits, and decoded using lean decoders while maintaining a high reconstruction accuracy. Our findings suggest that glottis segmentation is a task that can be highly optimized to gain very efficient and clinical applicable deep neural networks. In future, we believe that online deep learning-assisted monitoring is a game changer in laryngeal examinations.

Download Full-text