scholarly journals From acoustic to linguistic analysis of temporal speech structure: acousto-linguistic transformation during speech perception using speech quilts

2019 ◽  
Author(s):  
Tobias Overath ◽  
Joon H. Paik

AbstractSpeech perception entails the mapping of the acoustic waveform to linguistic representations. For this mapping to succeed, the speech signal needs to be tracked over various temporal windows at high temporal precision in order to decode linguistic units ranging from phonemes (tens of milliseconds) to sentences (seconds). Here, we tested the hypothesis that cortical processing of speech-specific temporal structure is modulated by higher-level linguistic analysis. Using fMRI, we measured BOLD signal changes to 4-s long speech quilts with variable temporal structure (30, 120, 480, 960 ms segment lengths), as well as natural speech, created from a familiar (English) or foreign (Korean) language. We found evidence for the acoustic analysis of temporal speech properties in superior temporal sulcus (STS): the BOLD signal increased as a function of temporal speech structure in both familiar and foreign languages. However, activity in left inferior gyrus (IFG) revealed evidence for linguistic processing of temporal speech properties: the BOLD signal increased as a function of temporal speech structure only in familiar, but not in foreign speech. Network analyses suggested that left IFG modulates processing of speech-specific temporal structure in primary auditory cortex, which in turn sensitizes processing of speech-specific temporal structure in STS. The results thus reveal a network for acousto-linguistic transformation consisting of primary and non-primary auditory cortex, STS, and left IFG.Significance StatementWhere and how the acoustic information contained in complex speech signals is mapped to linguistic information is still not fully explained by current speech/language models. We dissociate acoustic from linguistic analyses of speech by comparing the same acoustic manipulation (varying the extent of temporal speech structure) in two languages (native, foreign). We show that acoustic temporal speech structure is analyzed in superior temporal sulcus (STS), while linguistic information is extracted in left inferior frontal gyrus (IFG). Furthermore, modulation from left IFG enhances sensitivity to temporal speech structure in STS. We propose a model for acousto-linguistic transformation of speech-specific temporal structure in the human brain that can account for these results.

2005 ◽  
Vol 94 (4) ◽  
pp. 2970-2975 ◽  
Author(s):  
Rajiv Narayan ◽  
Ayla Ergün ◽  
Kamal Sen

Although auditory cortex is thought to play an important role in processing complex natural sounds such as speech and animal vocalizations, the specific functional roles of cortical receptive fields (RFs) remain unclear. Here, we study the relationship between a behaviorally important function: the discrimination of natural sounds and the structure of cortical RFs. We examine this problem in the model system of songbirds, using a computational approach. First, we constructed model neurons based on the spectral temporal RF (STRF), a widely used description of auditory cortical RFs. We focused on delayed inhibitory STRFs, a class of STRFs experimentally observed in primary auditory cortex (ACx) and its analog in songbirds (field L), which consist of an excitatory subregion and a delayed inhibitory subregion cotuned to a characteristic frequency. We quantified the discrimination of birdsongs by model neurons, examining both the dynamics and temporal resolution of discrimination, using a recently proposed spike distance metric (SDM). We found that single model neurons with delayed inhibitory STRFs can discriminate accurately between songs. Discrimination improves dramatically when the temporal structure of the neural response at fine timescales is considered. When we compared discrimination by model neurons with and without the inhibitory subregion, we found that the presence of the inhibitory subregion can improve discrimination. Finally, we modeled a cortical microcircuit with delayed synaptic inhibition, a candidate mechanism underlying delayed inhibitory STRFs, and showed that blocking inhibition in this model circuit degrades discrimination.


1999 ◽  
Vol 82 (5) ◽  
pp. 2327-2345 ◽  
Author(s):  
Jagmeet S. Kanwal ◽  
Douglas C. Fitzpatrick ◽  
Nobuo Suga

Mustached bats, Pteronotus parnellii parnellii,emit echolocation pulses that consist of four harmonics with a fundamental consisting of a constant frequency (CF1-4) component followed by a short, frequency-modulated (FM1-4) component. During flight, the pulse fundamental frequency is systematically lowered by an amount proportional to the velocity of the bat relative to the background so that the Doppler-shifted echo CF2 is maintained within a narrowband centered at ∼61 kHz. In the primary auditory cortex, there is an expanded representation of 60.6- to 63.0-kHz frequencies in the “Doppler-shifted CF processing” (DSCF) area where neurons show sharp, level-tolerant frequency tuning. More than 80% of DSCF neurons are facilitated by specific frequency combinations of ∼25 kHz (BFlow) and ∼61 kHz (BFhigh). To examine the role of these neurons for fine frequency discrimination during echolocation, we measured the basic response parameters for facilitation to synthesized echolocation signals varied in frequency, intensity, and in their temporal structure. Excitatory response areas were determined by presenting single CF tones, facilitative curves were obtained by presenting paired CF tones. All neurons showing facilitation exhibit at least two facilitative response areas, one of broad spectral tuning to frequencies centered at BFlowcorresponding to a frequency in the lower half of the echolocation pulse FM1 sweep and another of sharp tuning to frequencies centered at BFhigh corresponding to the CF2 in the echo. Facilitative response areas for BFhigh are broadened by ∼0.38 kHz at both the best amplitude and 50 dB above threshold response and show lower thresholds compared with the single-tone excitatory BFhigh response areas. An increase in the sensitivity of DSCF neurons would lead to target detection from farther away and/or for smaller targets than previously estimated on the basis of single-tone responses to BFhigh. About 15% of DSCF neurons show oblique excitatory and facilitatory response areas at BFhigh so that the center frequency of the frequency-response function at any amplitude decreases with increasing stimulus amplitudes. DSCF neurons also have inhibitory response areas that either skirt or overlap both the excitatory and facilitatory response areas for BFhigh and sometimes for BFlow. Inhibition by a broad range of frequencies contributes to the observed sharpness of frequency tuning in these neurons. Recordings from orthogonal penetrations show that the best frequencies for facilitation as well as excitation do not change within a cortical column. There does not appear to be any systematic representation of facilitation ratios across the cortical surface of the DSCF area.


2018 ◽  
Vol 30 (11) ◽  
pp. 1704-1719 ◽  
Author(s):  
Anna Maria Alexandrou ◽  
Timo Saarinen ◽  
Jan Kujala ◽  
Riitta Salmelin

During natural speech perception, listeners must track the global speaking rate, that is, the overall rate of incoming linguistic information, as well as transient, local speaking rate variations occurring within the global speaking rate. Here, we address the hypothesis that this tracking mechanism is achieved through coupling of cortical signals to the amplitude envelope of the perceived acoustic speech signals. Cortical signals were recorded with magnetoencephalography (MEG) while participants perceived spontaneously produced speech stimuli at three global speaking rates (slow, normal/habitual, and fast). Inherently to spontaneously produced speech, these stimuli also featured local variations in speaking rate. The coupling between cortical and acoustic speech signals was evaluated using audio–MEG coherence. Modulations in audio–MEG coherence spatially differentiated between tracking of global speaking rate, highlighting the temporal cortex bilaterally and the right parietal cortex, and sensitivity to local speaking rate variations, emphasizing the left parietal cortex. Cortical tuning to the temporal structure of natural connected speech thus seems to require the joint contribution of both auditory and parietal regions. These findings suggest that cortical tuning to speech rhythm operates on two functionally distinct levels: one encoding the global rhythmic structure of speech and the other associated with online, rapidly evolving temporal predictions. Thus, it may be proposed that speech perception is shaped by evolutionary tuning, a preference for certain speaking rates, and predictive tuning, associated with cortical tracking of the constantly changing-rate of linguistic information in a speech stream.


2021 ◽  
Author(s):  
Swapna Agarwalla ◽  
Sharba Bandyopadhyay

Syllable sequences in male mouse ultrasonic-vocalizations (USVs), songs, contain structure - quantified through predictability, like birdsong and aspects of speech. Apparent USV innateness and lack of learnability, discount mouse USVs for modelling speech-like social communication and its deficits. Informative contextual natural sequences (SN) were theoretically extracted and they were preferred by female mice. Primary auditory cortex (A1) supragranular neurons show differential selectivity to the same syllables in SN and random sequences (SR). Excitatory neurons (EXNs) in females showed increases in selectivity to whole SNs over SRs based on extent of social exposure with male, but syllable selectivity remained unchanged. Thus mouse A1 single neurons adaptively represent entire order of acoustic units without altering selectivity of individual units, fundamental to speech perception. Additionally, observed plasticity was replicated with silencing of somatostatin positive neurons, which had plastic effects opposite to EXNs, thus pointing out possible pathways involved in perception of sound sequences.


2020 ◽  
Author(s):  
Emmanuel Biau ◽  
Danying Wang ◽  
Hyojin Park ◽  
Ole Jensen ◽  
Simon Hanslmayr

ABSTRACTAudiovisual speech perception relies, among other things, on our expertise to map a speaker’s lip movements with speech sounds. This multimodal matching is facilitated by salient syllable features that align lip movements and acoustic envelope signals in the 4 - 8 Hz theta band. Although non-exclusive, the predominance of theta rhythms in speech processing has been firmly established by studies showing that neural oscillations track the acoustic envelope in the primary auditory cortex. Equivalently, theta oscillations in the visual cortex entrain to lip movements, and the auditory cortex is recruited during silent speech perception. These findings suggest that neuronal theta oscillations may play a functional role in organising information flow across visual and auditory sensory areas. We presented silent speech movies while participants performed a pure tone detection task to test whether entrainment to lip movements directs the auditory system and drives behavioural outcomes. We showed that auditory detection varied depending on the ongoing theta phase conveyed by lip movements in the movies. In a complementary experiment presenting the same movies while recording participants’ electro-encephalogram (EEG), we found that silent lip movements entrained neural oscillations in the visual and auditory cortices with the visual phase leading the auditory phase. These results support the idea that the visual cortex entrained by lip movements filtered the sensitivity of the auditory cortex via theta phase synchronisation.


2021 ◽  
Author(s):  
Jana Van Canneyt ◽  
Marlies Gillis ◽  
Jonas Vanthornhout ◽  
Tom Francart

The neural tracking framework enables the analysis of neural responses (EEG) to continuous natural speech, e.g., a story or a podcast. This allows for objective investigation of a range of auditory and linguistic processes in the brain during natural speech perception. This approach is more ecologically valid than traditional auditory evoked responses and has great potential for both research and clinical applications. In this article, we review the neural tracking framework and highlight three prominent examples of neural tracking analyses. This includes the neural tracking of the fundamental frequency of the voice (f0), the speech envelope and linguistic features. Each of these analyses provides a unique point of view into the hierarchical stages of speech processing in the human brain. f0-tracking assesses the encoding of fine temporal information in the early stages of the auditory pathway, i.e. from the auditory periphery up to early processing in the primary auditory cortex. This fundamental processing in (mostly) subcortical stages forms the foundation of speech perception in the cortex. Envelope tracking reflects bottom-up and top-down speech-related processes in the auditory cortex, and is likely necessary but not sufficient for speech intelligibility. To study neural processes more directly related to speech intelligibility, neural tracking of linguistic features can be used. This analysis focuses on the encoding of linguistic features (e.g. word or phoneme surprisal) in the brain. Together these analyses form a multi-faceted and time-effective objective assessment of the auditory and linguistic processing of an individual.


2017 ◽  
Author(s):  
Jeremy I. Skipper ◽  
Uri Hasson

AbstractWhat adaptations allow humans to produce and perceive speech so effortlessly? We show that speech is supported by a largely undocumented core of structural and functional connectivity between the central sulcus (CS or primary motor and somatosensory cortex) and the transverse temporal gyrus (TTG or primary auditory cortex). Anatomically, we show that CS and TTG cortical thickness covary across individuals and that they are connected by white matter tracts. Neuroimaging network analyses confirm the functional relevance and specificity of these structural relationships. Specifically, the CS and TTG are functionally connected at rest, during natural audiovisual speech perception, and are coactive over a large variety of linguistic stimuli and tasks. Importantly, across structural and functional analyses, connectivity of regions immediately adjacent to the TTG are with premotor and prefrontal regions rather than the CS. Finally, we show that this structural/functional CS-TTG relationship is mediated by a constellation of genes associated with vocal learning and disorders of efference copy. We propose that this core circuit constitutes an interface for rapidly exchanging articulatory and acoustic information and discuss implications for current models of speech.


2004 ◽  
Vol 92 (1) ◽  
pp. 52-65 ◽  
Author(s):  
Andrei V. Medvedev ◽  
Jagmeet S. Kanwal

The mustached bat, Pteronotus parnellii, uses complex communication sounds (“calls”) for social interactions. We recorded both event-related local field potentials (LFPs) and single/few-unit (SU) spike activity from the same electrode in the posterior region of the primary auditory cortex (AIp) during presentation of simple syllabic calls to awake bats. Temporal properties of the LFPs, which reflect activity within local neuronal clusters, and spike discharges from SUs were studied at 138 recording sites in six bats using seven variants each of 14 simple syllables presented at intensity levels of 40–90 dB SPL. There was no clear spatial selectivity to different call types within the AIp area. Rather, as shown previously, single units responded to multiple call types with similar values of the peak response rate in the peri-stimulus time histogram (PSTH). The LFPs and SUs, however, showed a rich temporal structure that was unique for each call type. Multidimensional scaling (MDS) of the averaged waveforms of call-evoked LFPs and PSTHs revealed that calls were better segregated in the two-dimensional space based on the LFP compared with the PSTH data. A representation within the “LFP-space” revealed that one of the dimensions correlated with the predominant and fundamental frequency of a call. The other dimension showed a high correlation with “harmonic complexity” (“fine” spectral structure of a call). We suggest that the temporal pattern of LFP and spiking activity reflects call-specific dynamics at any locus within the AIp area. This dynamic contributes to a distributed (population-based) representation of calls. Alternatively stated, the fundamental frequency and harmonic structure of calls, and not the recording location within the AIp, determines the temporal structure of the call-evoked LFP.


Neuroreport ◽  
2002 ◽  
Vol 13 (3) ◽  
pp. 311-315 ◽  
Author(s):  
Lynne E. Bernstein ◽  
Edward T. Auer ◽  
Jean K. Moore ◽  
Curtis W. Ponton ◽  
Manual Don ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document