Rethinking glottal midline detection

A healthy voice is crucial for verbal communication and hence in daily as well as professional life. The basis for a healthy voice are the sound producing vocal folds in the larynx. A hallmark of healthy vocal fold oscillation is the symmetric motion of the left and right vocal fold. Clinically, videoendoscopy is applied to assess the symmetry of the oscillation and evaluated subjectively. High-speed videoendoscopy, an emerging method that allows quantification of the vocal fold oscillation, is more commonly employed in research due to the amount of data and the complex, semi-automatic analysis. In this study, we provide a comprehensive evaluation of methods that detect fully automatically the glottal midline. We use a biophysical model to simulate different vocal fold oscillations, extended the openly available BAGLS dataset using manual annotations, utilized both, simulations and annotated endoscopic images, to train deep neural networks at different stages of the analysis workflow, and compared these to established computer vision algorithms. We found that classical computer vision perform well on detecting the glottal midline in glottis segmentation data, but are outper-formed by deep neural networks on this task. We further suggest GlottisNet, a multi-task neural architecture featuring the simultaneous prediction of both, the opening between the vocal folds and the symmetry axis, leading to a huge step forward towards clinical applicability of quantitative, deep learning-assisted laryngeal endoscopy, by fully automating segmentation and midline detection.

Download Full-text

Rethinking glottal midline detection

Scientific Reports ◽

10.1038/s41598-020-77216-6 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Andreas M. Kist ◽

Julian Zilker ◽

Pablo Gómez ◽

Anne Schützenberger ◽

Michael Döllinger

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Vocal Fold ◽

High Speed ◽

Deep Neural Networks ◽

Comprehensive Evaluation ◽

Vocal Folds ◽

Biophysical Model ◽

Analysis Workflow ◽

Classical Computer

AbstractA healthy voice is crucial for verbal communication and hence in daily as well as professional life. The basis for a healthy voice are the sound producing vocal folds in the larynx. A hallmark of healthy vocal fold oscillation is the symmetric motion of the left and right vocal fold. Clinically, videoendoscopy is applied to assess the symmetry of the oscillation and evaluated subjectively. High-speed videoendoscopy, an emerging method that allows quantification of the vocal fold oscillation, is more commonly employed in research due to the amount of data and the complex, semi-automatic analysis. In this study, we provide a comprehensive evaluation of methods that detect fully automatically the glottal midline. We used a biophysical model to simulate different vocal fold oscillations, extended the openly available BAGLS dataset using manual annotations, utilized both, simulations and annotated endoscopic images, to train deep neural networks at different stages of the analysis workflow, and compared these to established computer vision algorithms. We found that classical computer vision perform well on detecting the glottal midline in glottis segmentation data, but are outperformed by deep neural networks on this task. We further suggest GlottisNet, a multi-task neural architecture featuring the simultaneous prediction of both, the opening between the vocal folds and the symmetry axis, leading to a huge step forward towards clinical applicability of quantitative, deep learning-assisted laryngeal endoscopy, by fully automating segmentation and midline detection.

Download Full-text

High-Speed Imaging to Study an Auto-Oscillating Vocal Fold Replica for Different Initial Conditions

International Journal of Applied Mechanics ◽

10.1142/s1758825117500648 ◽

2017 ◽

Vol 09 (05) ◽

pp. 1750064 ◽

Cited By ~ 2

Author(s):

A. Van Hirtum ◽

X. Pelorson

Keyword(s):

Vocal Fold ◽

High Speed ◽

Initial Conditions ◽

Vocal Folds ◽

High Speed Imaging ◽

Human Voice ◽

Manual Intervention ◽

Geometrical Features ◽

Upstream Pressure

Experiments on mechanical deformable vocal folds replicas are important in physical studies of human voice production to understand the underlying fluid–structure interaction. At current date, most experiments are performed for constant initial conditions with respect to structural as well as geometrical features. Varying those conditions requires manual intervention, which might affect reproducibility and hence the quality of experimental results. In this work, a setup is described which allows setting elastic and geometrical initial conditions in an automated way for a deformable vocal fold replica. High-speed imaging is integrated in the setup in order to decorrelate elastic and geometrical features. This way, reproducible, accurate and systematic measurements can be performed for prescribed initial conditions of glottal area, mean upstream pressure and vocal fold elasticity. Moreover, quantification of geometrical features during auto-oscillation is shown to contribute to the experimental characterization and understanding.

Download Full-text

Off-the-shelf deep learning is not enough, and requires parsimony, Bayesianity, and causality

npj Computational Materials ◽

10.1038/s41524-020-00487-0 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Rama K. Vasudevan ◽

Maxim Ziatdinov ◽

Lukas Vlcek ◽

Sergei V. Kalinin

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Deep Learning ◽

Bayesian Methods ◽

Deep Neural Networks ◽

Applied Research ◽

Modern Science ◽

Generative Models ◽

Knowledge Development ◽

Physical Constraints

AbstractDeep neural networks (‘deep learning’) have emerged as a technology of choice to tackle problems in speech recognition, computer vision, finance, etc. However, adoption of deep learning in physical domains brings substantial challenges stemming from the correlative nature of deep learning methods compared to the causal, hypothesis driven nature of modern science. We argue that the broad adoption of Bayesian methods incorporating prior knowledge, development of solutions with incorporated physical constraints and parsimonious structural descriptors and generative models, and ultimately adoption of causal models, offers a path forward for fundamental and applied research.

Download Full-text

A Deep Learning Enhanced Novel Software Tool for Laryngeal Dynamics Analysis

Journal of Speech Language and Hearing Research ◽

10.1044/2021_jslhr-20-00498 ◽

2021 ◽

pp. 1-15

Author(s):

Andreas M. Kist ◽

Pablo Gómez ◽

Denis Dubrovskiy ◽

Patrick Schlegel ◽

Melda Kunduk ◽

...

Keyword(s):

Neural Networks ◽

Quantitative Analysis ◽

High Speed ◽

Voice Disorders ◽

Vocal Folds ◽

Video Data ◽

Audio Data ◽

Fully Automatic ◽

Video And Audio

Purpose High-speed videoendoscopy (HSV) is an emerging, but barely used, endoscopy technique in the clinic to assess and diagnose voice disorders because of the lack of dedicated software to analyze the data. HSV allows to quantify the vocal fold oscillations by segmenting the glottal area. This challenging task has been tackled by various studies; however, the proposed approaches are mostly limited and not suitable for daily clinical routine. Method We developed a user-friendly software in C# that allows the editing, motion correction, segmentation, and quantitative analysis of HSV data. We further provide pretrained deep neural networks for fully automatic glottis segmentation. Results We freely provide our software Glottis Analysis Tools (GAT). Using GAT, we provide a general threshold-based region growing platform that enables the user to analyze data from various sources, such as in vivo recordings, ex vivo recordings, and high-speed footage of artificial vocal folds. Additionally, especially for in vivo recordings, we provide three robust neural networks at various speed and quality settings to allow a fully automatic glottis segmentation needed for application by untrained personnel. GAT further evaluates video and audio data in parallel and is able to extract various features from the video data, among others the glottal area waveform, that is, the changing glottal area over time. In total, GAT provides 79 unique quantitative analysis parameters for video- and audio-based signals. Many of these parameters have already been shown to reflect voice disorders, highlighting the clinical importance and usefulness of the GAT software. Conclusion GAT is a unique tool to process HSV and audio data to determine quantitative, clinically relevant parameters for research, diagnosis, and treatment of laryngeal disorders. Supplemental Material https://doi.org/10.23641/asha.14575533

Download Full-text

Deep neural networks only in combination with traditional computer vision

ATZelektronik worldwide ◽

10.1007/s38314-017-0077-3 ◽

2017 ◽

Vol 12 (6) ◽

pp. 26-31

Author(s):

Uwe Westmeyer

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Deep Neural Networks

Download Full-text

Survey on Energy-Efficient Deep Neural Networks for Computer Vision

Low-Power Computer Vision ◽

10.1201/9781003162810-3 ◽

2022 ◽

pp. 25-52

Author(s):

Abhinav Goel ◽

Caleb Tung ◽

Xiao Hu ◽

Haobo Wang ◽

Yung-Hsiang Lu ◽

...

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Energy Efficient ◽

Deep Neural Networks

Download Full-text

Dynamic Digital Image Correlation of a Dynamic Physical Model of the Vocal Folds

Advances in Bioengineering ◽

10.1115/imece2005-81457 ◽

2005 ◽

Cited By ~ 5

Author(s):

S. Mantha ◽

L. Mongeau ◽

T. Siegmund

Keyword(s):

Digital Image Correlation ◽

Digital Image ◽

Vocal Fold ◽

High Speed ◽

Vocal Folds ◽

Image Correlation ◽

Strain Component ◽

Medial Surface ◽

Superior Surface ◽

Incomplete Closure

An experimental study of the vibratory deformation of the human vocal folds was conducted. Experiments were performed using model vocal folds [1, 2], Fig. 1, made of silicone rubber implemented into an air supply system, Fig. 2. The material used to cast the model is an isotropic homogeneous material, [3] with a tangent modulus E=5 kPa at ε = 0, i.e. elastic properties similar to those of the human vocal fold cover [4]. The advantages of the use of model larynx systems over the use of excised larynges include easy accessibility to fundamental studies of the vocal fold vibration without invasive testing. Acoustic analysis of voice or electroglottography provide certain insight into voice production processes but optical techniques for the study of vocal fold vibrations have drawn considerable attention. Videoendoscopy, stroboscopy, high-speed photography, and kymography have shown to provide a visual impression of vocal fold dynamics but are limited in providing insight into the fundamental deformation processes of the vocal folds. Quantitative measures of deformation have been conducted through micro-suture techniques but are invasive and allows for measurements of only view image points. Laser triangulation is non-invasive but is limited to only one local measurement point. Here, digital image correlation technique with the software VIC 3D [5] is applied. For the experimental set-up see Fig. 2. The analysis consists of (1) stereo correlation to obtain in-plane displacements and (2) stereo triangulation step to obtain out-of-plane deformation. For the stereo correlation images of the object at two different stages of deformation are compared. A point in the image of the undeformed object is matched with the corresponding point in the deformed stage. “Subsets” of digital images are traced via their gray value distribution from the undeformed reference image to the deformed image. The uniqueness of the matching is enabled by the creation of a speckle pattern on the object’s surface. Here, a white pigment is mixed into the silicone rubber and subsequently black enamel paint is sprayed onto the superior surface of the vocal folds. The stereo triangulation requires two images of the object at each stage of deformation. These are obtained in a single CCD frame by placing a beam splitter in the optical axis between camera and object. These images provide a “left” and “right” view of the model larynx. Thus, the deformed shape of the vocal folds can be obtained. The method allows for noninvasive measurement of the full-field displacement fields. Images of the superior surface of the model larynx are obtained by the use of a high speed digital camera with a frame rate of 3000 frames per second allowing for more than 30 image frames for each vibration cycle. For the 3D digital image correlation analysis two images of the object are obtained for each time instance as a beam splitter is placed in the optical axis between the camera and the model larynx. Phonation frequencies and onset pressure are given in Fig. 3, showing that the model larynx behavior is close to actual physiological data. Figs 4(a) and (b) provide superior views of the model larynx at maximum glottal opening and at glottal closure, respectively. As one example of measured strain fields, Figs 5(a) and (b) depict the distributions of the transverse strain component, on the glottal surface in a contour plot on the deformed superior surface. The knowledge of the distribution of this strain component is relevant to the assessment of the impact of vocal fold collision on potential tissue damage. In the position of maximum opening the vocal folds are deformed by a combination of a bulging-type deformation and the opening movement. At this time instance, the transverse strains at the medial surface are found to be negative, an indication of Poisson’s deformation. During the closing stage, vocal folds collide and simultaneously a mode 3 vibration pattern emerges. Closure of the glottal opening is not complete and two incomplete closure areas are formed during the closure stage. These open areas are located at the anterior and posterior ends of the model larynx, see Fig. 4(b). The finding of this type of incomplete closure is agreement with both actual glottal measurements [6] and 3D finite element simulations of [7]. Transverse strains during that stage are now positive and considerably larger that during the opening stage. Finally, Fig. 6 depicts the time evolution of the out of plane displacements along the medial surface for the closing phase and Fig. 7 depicts the maximum values of the longitudinal strain (at the coronal section of the medial surface) in dependence of the flow rate. These examples of measurements indicate that the DIC method is promising for studies of vocal fold dynamics.

Download Full-text

Electroglottography and Vocal Fold Physiology

Journal of Speech Language and Hearing Research ◽

10.1044/jshr.3302.245 ◽

1990 ◽

Vol 33 (2) ◽

pp. 245-254 ◽

Cited By ~ 76

Author(s):

D. G. Childers ◽

D. M. Hicks ◽

G. P. Moore ◽

L. Eskenazi ◽

A. L. Lalwani

Keyword(s):

Vocal Fold ◽

High Speed ◽

Human Subjects ◽

Vocal Folds ◽

Supporting Evidence ◽

Direct Measurements ◽

Analysis And Synthesis ◽

Vocal Fold Motion ◽

Maximum Opening ◽

Major Hypothesis

The electroglottogram (EGG) is known to be related to vocal fold motion. A major hypothesis undergoing examination in several research centers is that the EGG is related to the area of contact of the vocal folds. This hypothesis is difficult to substantiate with direct measurements using human subjects. However, other supporting evidence can be offered. For this study we made measurements from synchronized ultra high-speed laryngeal films and from EGG waveforms collected from subjects with normal larynges and patients with vocal disorders. We compare certain features of the EGG waveform to (a) the instant of the opening of the glottis, (b) the instant of the closing of the glottis, and (c) the instant of the maximum opening of the glottis. In addition, we compare both the open quotient and the relative average perturbation measured from the glottal area to that estimated from the EGG. All of these comparisons indicate that vocal fold vibratory characteristics are reflected by features of the EGG waveform. This makes the EGG useful for speech analysis and synthesis as well as for modeling laryngeal behavior. The limitations of the EGG are discussed.

Download Full-text

Usefulness of high-speed digital imaging (HSDI) in the diagnosis of oedematous – hypertrophic changes of the larynx in people using voice occupationally

Otolaryngologia Polska ◽

10.5604/01.3001.0010.2244 ◽

2017 ◽

Vol 71 (4) ◽

pp. 19-25 ◽

Cited By ~ 1

Author(s):

Bożena Kosztyła-Hojna ◽

Diana Moskal ◽

Anna Kuryliszyn-Moskal ◽

Anna Andrzejewska ◽

Anna Łobaczuk-Sitnik ◽

...

Keyword(s):

Electron Microscopy ◽

Digital Imaging ◽

Vocal Fold ◽

High Speed ◽

Vocal Folds ◽

Intercellular Spaces ◽

Vacuolar Degeneration ◽

Transmission Electron ◽

Tem Method ◽

Hypertrophic Changes

Introduction. The aim of the study is the evaluation of the usefulness of High-Speed Digital Imaging (HSDI) in the diagnosis of organic dysphonia in a form of oedematous-hypertrophic changes of vocal fold mucosa, morphologically confirmed by Transmission Electron Microscopy (TEM) method in patients working with voice occupationally. Material and methods. The group consisted of 30 patients working with voice occupationally with oedematous-hypertrophic changes of vocal fold mucosa. Parameters of vocal folds vibrations were evaluated using HSDI technique with a digital HS camera, HRES Endocam Richard Wolf GmbH. The image of vocal folds was recorded with a rate of 4000 frames per second. Postoperative material of the larynx was prepared in a routine way and observed in transmission electron microscope OPTON 900–PC. Results. HSDI technique allows to assess the real vibrations of vocal folds and determine many parameters. The results of TEM in the postoperative material showed destruction of epithelial cells with severe vacuolar degeneration, the enlargement of intercellular spaces and a large number of blood vessels in the stroma, which indicates the presence of oedematous-hypertrophic changes of the larynx. Discussion. The ultrastructural assessment confirm the particular usefulness of HSDI method in the diagnosis of organic dysphonia in a form of oedematous-hypertrophic changes. Key words: High-Speed Digital Imaging, oedematous-hypertrophic changes, vocal fold mucosa, larynx

Download Full-text

Normal Voice Production: Computation of Driving Parameters from Endoscopic Digital High Speed Images

Methods of Information in Medicine ◽

10.1055/s-0038-1634360 ◽

2003 ◽

Vol 42 (03) ◽

pp. 271-276 ◽

Cited By ~ 19

Author(s):

T. Braunschweig ◽

J. Lohscheller ◽

U. Eysholdt ◽

U. Hoppe ◽

M. Döllinger

Keyword(s):

Vocal Fold ◽

High Speed ◽

Biomechanical Model ◽

Vocal Folds ◽

Inversion Algorithm ◽

Knowledge Based ◽

Normal Voice ◽

Inversion Procedure ◽

High Speed Glottography

Summary Objectives: A central point for quantitative evaluation of pathological and healthy voices is the analysis of vocal fold oscillations. By means of digital High Speed Glottography (HGG), vocal fold oscillations can be recorded in real time. Recently, a numerical inversion procedure was developed that allows the extraction of physiological parameters from digital high speed videos and a classification of voice disorders. The aim of this work was to validate the inversion procedure and to investigate the applicability to normal voices. Methods: High speed recordings were performed during phonation within a group of five female and five male persons with normal voices. By using knowledge based image processing algorithms, motion curves of the vocal folds were extracted at three different positions (dorsal, medial, ventral). These curves were used to obtain physiological voice parameters, and in particular the degree of symmetry of the vocal folds based upon a biomechanical model of the vocal folds. Results: The highest degree of symmetry was observed for the medial motion curves. While the dor-sally and ventrally extracted motion curves exhibited similar results concerning the degree of symmetry the performance of the algorithm was less stable. Conclusions: The inversion algorithm provides reasonable results for all subjects when applied to the medial motion curves. However, for dorsal and ventral motion curves, correct performance is reduced to 85 %.

Download Full-text