speech transcription
Recently Published Documents


TOTAL DOCUMENTS

110
(FIVE YEARS 21)

H-INDEX

11
(FIVE YEARS 2)

2021 ◽  
Vol 11 (11) ◽  
pp. 1540
Author(s):  
Jennifer U. Soriano ◽  
Abby Olivieri ◽  
Katherine C. Hustad

The Intelligibility in Context Scale (ICS) is a widely used, efficient tool for describing a child’s speech intelligibility. Few studies have explored the relationship between ICS scores and transcription intelligibility scores, which are the gold standard for clinical measurement. This study examined how well ICS composite scores predicted transcription intelligibility scores among children with cerebral palsy (CP), how well individual questions from the ICS differentially predicted transcription intelligibility scores, and how well the ICS composite scores differentiated between children with and without speech motor impairment. Parents of 48 children with CP, who were approximately 13 years of age, completed the ICS. Ninety-six adult naïve listeners provided orthographic transcriptions of children’s speech. Transcription intelligibility scores were regressed on ICS composite scores and individual item scores. Dysarthria status was regressed on ICS composite scores. Results indicated that ICS composite scores were moderately strong predictors of transcription intelligibility scores. One individual ICS item differentially predicted transcription intelligibility scores, and dysarthria severity influenced how well ICS composite scores differentiated between children with and without speech motor impairment. Findings suggest that the ICS has potential clinical utility for children with CP, especially when used with other objective measures of speech intelligibility.


2021 ◽  
Author(s):  
Courtney Mansfield ◽  
Sara Ng ◽  
Gina-Anne Levow ◽  
Richard A. Wright ◽  
Mari Ostendorf

2021 ◽  
Author(s):  
Mahdi Namazifar ◽  
John Malik ◽  
Li Erran Li ◽  
Gokhan Tur ◽  
Dilek Hakkani Tür

Author(s):  
Carly B. Fox ◽  
Megan Israelsen-Augenstein ◽  
Sharad Jones ◽  
Sandra Laing Gillam

Purpose This study examined the accuracy and potential clinical utility of two expedited transcription methods for narrative language samples elicited from school-age children (7;5–11;10 [years;months]) with developmental language disorder. Transcription methods included real-time transcription produced by speech-language pathologists (SLPs) and trained transcribers (TTs) as well as Google Cloud Speech automatic speech recognition. Method The accuracy of each transcription method was evaluated against a gold-standard reference corpus. Clinical utility was examined by determining the reliability of scores calculated from the transcripts produced by each method on several language sample analysis (LSA) measures. Participants included seven certified SLPs and seven TTs. Each participant was asked to produce a set of six transcripts in real time, out of a total 42 language samples. The same 42 samples were transcribed using Google Cloud Speech. Transcription accuracy was evaluated through word error rate. Reliability of LSA scores was determined using correlation analysis. Results Results indicated that Google Cloud Speech was significantly more accurate than real-time transcription in transcribing narrative samples and was not impacted by speech rate of the narrator. In contrast, SLP and TT transcription accuracy decreased as a function of increasing speech rate. LSA metrics generated from Google Cloud Speech transcripts were also more reliably calculated. Conclusions Automatic speech recognition showed greater accuracy and clinical utility as an expedited transcription method than real-time transcription. Though there is room for improvement in the accuracy of speech recognition for the purpose of clinical transcription, it produced highly reliable scores on several commonly used LSA metrics. Supplemental Material https://doi.org/10.23641/asha.15167355


2021 ◽  
Vol 12 ◽  
Author(s):  
Maria Jose Alvarez-Alonso ◽  
Cristina de-la-Peña ◽  
Zaira Ortega ◽  
Ricardo Scott

Quality of language comprehension determines performance in all kinds of activities including academics. Processing of words initially develops as auditory, and gradually extends to visual as children learn to read. School failure is highly related to listening and reading comprehension problems. In this study we analyzed sex-differences in comprehension of texts in Spanish (standardized reading test PROLEC-R) in three modalities (visual, auditory, and both simultaneously: dual-modality) presented to 12–14-years old students, native in Spanish. We controlled relevant cognitive variables such as attention (d2), phonological and semantic fluency (FAS) and speed of processing (WISC subtest Coding). Girls’ comprehension was similar in the three modalities of presentation, however boys were importantly benefited by dual-modality as compared to boys exposed only to visual or auditory text presentation. With respect to the relation of text comprehension and school performance, students with low grades in Spanish showed low auditory comprehension. Interestingly, visual and dual modalities preserved comprehension levels in these low skilled students. Our results suggest that the use of visual-text support during auditory language presentation could be beneficial for low school performance students, especially boys, and encourage future research to evaluate the implementation in classes of the rapidly developing technology of simultaneous speech transcription, that could be, in addition, beneficial to non-native students, especially those recently incorporated into school or newly arrived in a country from abroad.


2021 ◽  
Author(s):  
Aitor Álvarez ◽  
Haritz Arzelus ◽  
Iván G. Torre ◽  
Ander González-Docasal
Keyword(s):  

Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 235
Author(s):  
Natalia Bogach ◽  
Elena Boitsova ◽  
Sergey Chernonog ◽  
Anton Lamtev ◽  
Maria Lesnichaya ◽  
...  

This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible by technological improvements in signal processing algorithms. We discuss an approach and propose a holistic solution to teaching the phonological phenomena which are crucial for correct pronunciation, such as the phonemes; the energy and duration of syllables and pauses, which construct the phrasal rhythm; and the tone movement within an utterance, i.e., the phrasal intonation. The working prototype of StudyIntonation Computer-Assisted Pronunciation Training (CAPT) system is a tool for mobile devices, which offers a set of tasks based on a “listen and repeat” approach and gives the audio-visual feedback in real time. The present work summarizes the efforts taken to enrich the current version of this CAPT tool with two new functions: the phonetic transcription and rhythmic patterns of model and learner speech. Both are designed on a base of a third-party automatic speech recognition (ASR) library Kaldi, which was incorporated inside StudyIntonation signal processing software core. We also examine the scope of automatic speech recognition applicability within the CAPT system workflow and evaluate the Levenstein distance between the transcription made by human experts and that obtained automatically in our code. We developed an algorithm of rhythm reconstruction using acoustic and language ASR models. It is also shown that even having sufficiently correct production of phonemes, the learners do not produce a correct phrasal rhythm and intonation, and therefore, the joint training of sounds, rhythm and intonation within a single learning environment is beneficial. To mitigate the recording imperfections voice activity detection (VAD) is applied to all the speech records processed. The try-outs showed that StudyIntonation can create transcriptions and process rhythmic patterns, but some specific problems with connected speech transcription were detected. The learners feedback in the sense of pronunciation assessment was also updated and a conventional mechanism based on dynamic time warping (DTW) was combined with cross-recurrence quantification analysis (CRQA) approach, which resulted in a better discriminating ability. The CRQA metrics combined with those of DTW were shown to add to the accuracy of learner performance estimation. The major implications for computer-assisted English pronunciation teaching are discussed.


2020 ◽  
Vol 4 (2) ◽  
pp. 294-305
Author(s):  
Stefanny Lauwren ◽  

Greta Thunberg delivered a speech entitled “You’re Acting Like Spoiled, Irresponsible Children” to influential figures in Europe through the “Civil Society for renaissance” event, in which she was personally invited by the president of the organizer, Luca Jahier. Through her speech, she managed to convince the European Union to pledge to spend billions of euros to combat the climate crisis. This study aims to discover how interpersonal metafunction is used in the speech and what functions are revealed through Fairclough’s Critical Discourse Analysis and Hallidayan Systemic Functional Grammar. The data, consisting of seventy-one independent clauses, were taken from Thunberg’s book which consists of her speech transcription, titled “No One Is Too Small to Make a Difference”. The research discovers that through the use of mood, modality, and pronouns, Thunberg conveys her view on her relationship with the audience as victim and perpetrator, and the one who holds responsibility and takes the blame.


2020 ◽  
Vol 46 (Supplement_1) ◽  
pp. S177-S177
Author(s):  
Can Kilciksiz ◽  
Katrina Brown ◽  
Alexandria Vail ◽  
Tadas Baltrusaitis ◽  
Luciana Pennant ◽  
...  

Abstract Background A major challenge for reliable and effective mental health care is the lack of objective markers of illness. Computational approaches to measuring naturalistic behavior in clinical settings could therefore provide an objective backstop for mental health assessment and disease monitoring. This study aimed to train machine-learning (ML) classifiers to estimate conventional clinical measures of severe mental illness using quantitative metrics derived from computational analysis of facial and vocal behaviors. Methods Individuals hospitalized for any active psychotic condition were recruited to participate in up to ten recorded study visits, comprised of three segments. Each visit was captured using two synchronized HD webcams and cardioid microphones, to obtain high quality audiovisual (AV) data from both patient and interviewer. We performed automated facial action coding, vocal analysis, and speech transcription using publicly available software (e.g., openFace, openSmile, TranscribeMe). Results A total of 34 participants, participated in 66 sessions between 2015 and 2018, resulting in over 40 hours of AV recordings. In our visual and vocal analysis, we found that several features derived from face, voice, and use of language (i.e. eyebrow furrowing, eye widening, smile variability, characteristics of vowels) were both robustly measured using our approach, and allowed us to accurately estimate multiple symptom domains (i.e. mania, depression, psychosis) with (R= >0.7, p = <0.05). In our linguistic analysis, we found that abundance of power words (i.e. superiority, important) and lack of contextual language (i.e. yesterday, nearby) are highly indicative of positive psychotic symptoms with (R= +0.417, p = 0.002) and (R= -0.302, p = 0.028) respectively. Discussion Automated analysis of face, voice, and speech provides a number of robust behavioral markers sensitive enough to detect changes in psychopathology within individuals over time. Therefore, naturalistic, quantitative assessments can yield objective markers of mood and cognition that can be used to optimize both access and quality of treatments for a wide range of psychiatric conditions.


2020 ◽  
pp. 002383092091107 ◽  
Author(s):  
Suzanne R. Jongman ◽  
Yung Han Khoe ◽  
Florian Hintz

Previous research has shown that vocabulary size affects performance on laboratory word production tasks. Individuals who know many words show faster lexical access and retrieve more words belonging to pre-specified categories than individuals who know fewer words. The present study examined the relationship between receptive vocabulary size and speaking skills as assessed in a natural sentence production task. We asked whether measures derived from spontaneous responses to everyday questions correlate with the size of participants’ vocabulary. Moreover, we assessed the suitability of automatic speech recognition (ASR) for the analysis of participants’ responses in complex language production data. We found that vocabulary size predicted indices of spontaneous speech: individuals with a larger vocabulary produced more words and had a higher speech-silence ratio compared to individuals with a smaller vocabulary. Importantly, these relationships were reliably identified using manual and automated transcription methods. Taken together, our results suggest that spontaneous speech elicitation is a useful method to investigate natural language production and that automatic speech recognition can alleviate the burden of labor-intensive speech transcription.


Sign in / Sign up

Export Citation Format

Share Document