A tagging algorithm for mixed language identification in a noisy domain

Abstract This study focuses on bilingual speakers of Ecuadoran Quichua and the mixed language known as Media Lengua, which consists of Quichua morphosyntactic frames with all content word roots relexified from Spanish. For all intents and purposes, only the lexicon – more specifically, lexical roots – separate Media Lengua from Quichua, and yet speakers generally manage to keep the two languages apart in production and are able to unequivocally distinguish the languages in perception tasks. Two main questions drive the research effort. The first, given the very close relationships between Quichua and Media Lengua, is whether each language has a distinct lexicon, or a single lexical repository is shared by the two languages. A second and closely related question is the extent to which language-specific phonotactic patterns aid in language identification, possibly even to the extent of constituting the only robust language-tagging mechanism in a joint lexicon. Using lexical-decision and false-memory tasks to probe the Quichua-Media Lengua bilingual lexical repertoire, the results are consistent with a model based on a single lexicon, partially differentiated by subtle phonotactic cues, and bolstered by contemporary participants’ knowledge of Spanish as well as Quichua.

Download Full-text

Phone Clustering Methods for Multilingual Language Identification

10.5121/csit.2020.101421 ◽

2020 ◽

Author(s):

Ronny Mabokela

Keyword(s):

Classification Accuracy ◽

Markov Models ◽

Language Identification ◽

Supervised Machine Learning ◽

Data Driven ◽

Support Vector ◽

Clustering Methods ◽

Speech Corpus ◽

Acoustic Models ◽

Mixed Language

This paper proposes phoneme clustering methods for multilingual language identification (LID) on a mixed-language corpus. A one-pass multilingual automated speech recognition (ASR) system converts spoken utterances into occurrences of phone sequences. Hidden Markov models were employed to train multilingual acoustic models that handle multiple languages within an utterance. Two phoneme clustering methods were explored to derive the most appropriate phoneme similarities between the target languages. Ultimately a supervised machine learning technique was employed to learn the language transition of the phonotactic information and engage the support vector machine (SVM) models to classify phoneme occurrences. The system performance was evaluated on mixed-language speech corpus for two South African languages (Sepedi and English) using the phone error rate (PER) and LID classification accuracy separately. We show that multilingual ASR which fed directly to the LID system has a direct impact on LID accuracy. Our proposed system has achieved an acceptable phone recognition and classification accuracy in mixed-language speech and monolingual speech (i.e. either Sepedi or English). Data-driven, and knowledge-driven phoneme clustering methods improve ASR and LID for code-switched speech. The data-driven method obtained the PER of 5.1% and LID classification accuracy of 94.5% when the acoustic models are trained with 64 Gaussian mixtures per state.

Download Full-text