speech corpora
Recently Published Documents


TOTAL DOCUMENTS

171
(FIVE YEARS 38)

H-INDEX

10
(FIVE YEARS 1)

Author(s):  
Linda Gaile ◽  

The research on the simultaneous interpreting process and the associated target and source languages requires both the oral source speeches and the simultaneous interpreting of the spoken source speeches into the target language. For a relatively short time now, researchers of translation and interpreting have been able to access digitized linguistic corpora, parallel and speech corpora of different language pairs, from which they can build their own purpose-oriented corpus of original and target-language oral texts. Furthermore, the built-up language corpus can be analysed qualitatively or quantitatively using different software and investigated for specific linguistic phenomena. This present article focuses on the benefits of data retrieval from digitalized language and speech corpora, which can be an important source of assistance for the analysis of the oral simultaneous interpretation target text. At the heart of this question is the European Parliament’s speeches corpus, from which authentic speeches in the source language (German) and simultaneous interpretation in the target language (Latvian) can be obtained to create a sub-corpus for the German-Latvian language pair. Among others, the question of which interpreting strategies can be used for simultaneous interpreting from German into Latvian is explored, and the application of EXMARaLDA Partitur-Editor software is presented, which allows to create a simultaneous transcription of the source language and the simultaneously interpreted target language as well as to develop a speech corpus.


2021 ◽  
Vol 12 ◽  
Author(s):  
Qihui Xu ◽  
Magdalena Markowska ◽  
Martin Chodorow ◽  
Ping Li

The study of code-switching (CS) speech has produced a wealth of knowledge in the understanding of bilingual language processing and representation. Here, we approach this issue by using a novel network science approach to map bilingual spontaneous CS speech. In Study 1, we constructed semantic networks on CS speech corpora and conducted community detections to depict the semantic organizations of the bilingual lexicon. The results suggest that the semantic organizations of the two lexicons in CS speech are largely distinct, with a small portion of overlap such that the semantic network community dominated by each language still contains words from the other language. In Study 2, we explored the effect of clustering coefficients on language choice during CS speech, by comparing clustering coefficients of words that were code-switched with their translation equivalents (TEs) in the other language. The results indicate that words where the language is switched have lower clustering coefficients than their TEs in the other language. Taken together, we show that network science is a valuable tool for understanding the overall map of bilingual lexicons as well as the detailed interconnections and organizations between the two languages.


Author(s):  
Yu. I. Butenko ◽  
Yu. V. Stroganov ◽  
A. V. Kvasnikov ◽  
N. V. Slavnov ◽  
N. V. Kokurina

he article describes the user roles in the speech corpus for studying pronunciation variability of native speakers in the Russian language. The need for systems of speech recognition of speakers with dialects and disabilities is stated. It is emphasized the need to study the pronunciation variability of pronunciation by different speakers, taking into account regional and individual speech characteristics. Subsequent creating a speech corpus as the basis for regional and individual speech recognition is discussed. The speech corpus being developed contains recordings of the same text fragments by different speakers. The system of audio speech markup for researching the pronunciation variability of native Russian speakers is described. The roles of administrator, moderator, marker and analyst are provided for working with the corpus. Each user rights in the speech corpus are described: the administrator is the role which has all possible rights in the system. The marker is the user whose main task is to mark up the audio recordings. Analyst is the user who can assess and process the data in the speech corpus. The necessity of the moderator’s role in controlling the quality of markup is proved by the fact that it’s mostly students who mark up the audio recordings. The information in the developed speech corpora is supposed to be useful for phonetic studies in linguistics and a database for oral speech recognition.


Author(s):  
Edward Ombui ◽  
◽  
Lawrence Muchemi ◽  
Peter Wagacha

Presidential campaign periods are a major trigger event for hate speech on social media in almost every country. A systematic review of previous studies indicates inadequate publicly available annotated datasets and hardly any evidence of theoretical underpinning for the annotation schemes used for hate speech identification. This situation stifles the development of empirically useful data for research, especially in supervised machine learning. This paper describes the methodology that was used to develop a multidimensional hate speech framework based on the duplex theory of hate [1] components that include distance, passion, commitment to hate, and hate as a story. Subsequently, an annotation scheme based on the framework was used to annotate a random sample of ~51k tweets from ~400k tweets that were collected during the August and October 2017 presidential campaign period in Kenya. This resulted in a goldstandard codeswitched dataset that could be used for comparative and empirical studies in supervised machine learning. The resulting classifiers trained on this dataset could be used to provide real-time monitoring of hate speech spikes on social media and inform data-driven decision-making by relevant security agencies in government.


2021 ◽  
Vol 7 (4) ◽  
pp. 4001
Author(s):  
Maya Heydarova

The voice corpus of language is the essential part of the linguistic resources, and it contains the phonetic database. A phonetic database is a structured collection of software-delivered speech fragments. Nowadays, phonetic database or voice corpus became like a new element in speech technologies, and much investigation has taken place according to this event. The investigators' interest in voice corpus is related to the development of a speech recognition system. Today it is enough to experience in preparation of a phonetic database. Equipped with unique information on the preparation and usage of everyday speech corpus, the development level of speech technologies and the increasing power of computer technologies allow for the investigation of various language materials, largescale, and statistical phonetic research. These developed directions of linguistics were investigated in this article. Speech corpora are a valuable source of information for phonological research and the study of sound patterns. The study of speech corpora is in its infancy compared to other field studies in linguistics. Existing speech corpora form the part of the world's languages and do not fully represent all the dialects and speech forms by phonological aspect. The article analyses the history, structure, and importance of developing speech corpses, a branch of corpus linguistics and has developed in recent years. The article also lists the main features to be considered in the design of the speech corpus.


Author(s):  
Michael A. Johns ◽  
Laura Rodrigo ◽  
Rosa E. Guzzardo Tamargo ◽  
Aliza Winneg ◽  
Paola E. Dussias

Abstract Most studies on lexical priming have examined single words presented in isolation, despite language users rarely encountering words in such cases. The present study builds upon this by examining both within-language identity priming and across-language translation priming in sentential contexts. Highly proficient Spanish–English bilinguals read sentence-question pairs, where the sentence contained the prime and the question contained the target. At earlier stages of processing, we find evidence only of within-language identity priming; at later stages of processing, however, across-language translation priming surfaces, and becomes as strong as within-language identity priming. Increasing the time between the prime sentence and target question results in strengthened priming at the latest stages of processing. These results replicate previous findings at the single-word level but do so within sentential contexts, which has implications both for accounts of priming via automatic spreading activation as well as for accounts of persistence attested in spontaneous speech corpora.


2021 ◽  
Vol 45 (2) ◽  
pp. 9
Author(s):  
Robert Long ◽  
Hiroaki Watanabe

This study examines the grammatical errors in Japanese university students’ dialogues over an academic year. The L2 interactions of 15 Japanese speakers were taken from the JUSFC2018 corpus (April/May 2018) and the JUSFC2019 corpus (January/February 2019). The corpora were based on a self-introduction monologue and a three-question dialogue; however, this study examines the grammatical accuracy found in the dialogues. Research questions focused on a possible significant difference in grammatical accuracy from the first interview session in 2018 and the second one the following year, specifically regarding errors in clauses per 100 words, the frequency of global errors and local errors, and the five most frequent kinds of errors. Results showed that error-free clauses/100 words decreased slightly from 8.78 clauses to 7.89, while clauses with errors/100 words increased by nearly one clause, from 3.16 to 4.05 clauses. Global errors showed a remarkable decline from 22 to 15, but local errors increased from 76 to 112. A t-test confirmed there was not significant difference between the two speech corpora in regard to global and local errors. The five most frequent errors were (a) lexical phrasing (71), (b) article omissions (41), (c) plural errors (19), (d) preposition omissions (19), and (e) verb usage (9). This data highlights the difficulty in having students self-edit themselves. 本研究は、日本人大学生の英会話における文法上のエラーを、1学年間追跡調査したものである。15名の日本語話者の第2言語でのやり取りは、JUSFC2018コーパス(2018年4月/5月)と JUSFC2019 コーパス(2019年1月/2月)から取得された。これらのコーパスは、自己紹介の独白と3つの質問に答える会話に基づいているが、本研究は会話における文法上の正確さに焦点を当てて調査をした。研究課題は、2018年の最初のインタビューと翌年の2回目のインタビューとの間に、文法上の正確さにおいて有意な相違があるかどうかに焦点を当てた。特に、100単語ごとの節におけるエラー、グローバル・エラーとローカル・エラーの頻度、そして最も頻度の高い5つのエラーに注目した。調査結果は次の通りである。100単語ごとのエラーのない節は、8.78 節から7.89節へと若干減少した一方、100単語ごとのエラーのある節は1節以上増加し、3.16節から4.05節となった。グローバル・エラーは22から15へと著しく減少し、ローカル・エラーは76から112へと増加した。t-テストによると、グローバル・エラーとローカル・エラーに関しては、2つのスピーチコーパスに有意差は認められなかった。5つの最も頻度の高いエラーは次の通り、語彙の言葉づかい(71)、冠詞の省略(41)、複数形の間違い(19)、前置詞の省略(19)、そして動詞の使い方(9)、である。このデータは、学生が彼ら自身で校正することの難しさを浮き彫りにしている。


2021 ◽  
Vol 5 (2) ◽  
pp. 5-27
Author(s):  
Tommaso Raso ◽  
Bruno Rocha

This paper aims at investigating the prosodic relations between the category of illocution and that of attitude, the latter defined as the way the illocution (verbal action) is performed (Modis on Actum). We set three experiments and relative perception tests seeking to understand: (i) how different attitudes of the same illocution (Order) are perceived in different contexts; (ii) whether the illocutions of Order and Instruction are conveyed by the same prosodic form; (iii) how pragmatic/cognitive parameters work to accommodate a different prosodic form, using the illocutions of Offer and Question of Confirmation. We conclude that the methodology for the study of the illocutionary prosodic forms must pay close attention to the prosodic aspects of attitude, since they are always present when an illocution is performed, superposing their features over those of the illocution. We also claim that the identification of a specific illocution must consider some pragmatic and cognitive parameters, and not only prosody, since different illocutions can be prosodically performed with the same form. This becomes clear if we look for data in spontaneous speech corpora, where the pragmatic conditions can be at least partially reconstructed.


2021 ◽  
Vol 7 (s1) ◽  
Author(s):  
Nanna Haug Hilton

Abstract This paper presents the project Stimmen fan Fryslân ‘Voices of Fryslân’. The project relies on a smartphone application developed to involve local communities in the creation of speech corpora, particularly of lesser used languages. This paper lays out the scientific and societal context of the project, showcases the smartphone application and gives an overview of the results from the project that attracted more than 15,000 users. Some key methodological issues are considered, and the paper discusses the role of smartphone technology for citizen science in minority language areas while also showing new maps with distributions of lexical and phonological variation in Frisian.


Sign in / Sign up

Export Citation Format

Share Document