Spoken Language Identification of Indian Languages Using MFCC Features

Author(s):  
Mainak Biswas ◽  
Saif Rahaman ◽  
Satwik Kundu ◽  
Pawan Kumar Singh ◽  
Ram Sarkar
2021 ◽  
Vol 10 (5) ◽  
pp. 2578-2587
Author(s):  
Aarti Bakshi ◽  
Sunil Kumar Kopparapu

In spoken language identification (SLID) systems, the test data may be of a sufficiently shorter duration than training data, known as duration mismatch condition. Duration normalized features are used to identify a spoken language for nine Indian languages in duration mismatch conditions. Random forest-based importance vectors of 1582 OpenSMILE features are calculated for each utterance in different duration datasets. The feature importance vectors are normalized across each dataset and later across different duration datasets. The optimal number of duration normalized features is selected to maximize SLID system accuracy. Three classifiers, artificial neural network (ANN), support vector machine (SVM), and random forest (RF), and their fusion, weights optimized using logistic regression, are used. The speech material comprised utterances, each of 30 sec, extracted from the All India Radio dataset with nine Indian languages. Seven new datasets of smaller utterance durations were generated by carefully splitting each utterance. Experimental results showed that 150 most important duration normalized features were optimal with a relative increase in 18-80% accuracy for mismatch conditions. The accuracy decreased with increased duration mismatch.


Language is the ability to communicate with any person. Approximate number of spoken languages are 6500 in the world. Different regions in a world have different languages spoken. Spoken language recognition is the process to identify the language spoken in a speech sample. Most of the spoken language identification is done on languages other than Indian. There are many applications to recognize a speech like spoken language translation in which the fundamental step is to recognize the language of the speaker. This system is specifically made to identify two Indian languages. The speech data of various news channels is used that is available online. The Mel Frequency Cepstral Coefficients (MFCC) feature is used to collect features from the speech sample because it provides a particular identity to the different classes of audio. The identification is done by using MFCC feature in the Deep Neural Network. The objective of this work is to improve the accuracy of the classification model. It is done by making changes in several layers of the Deep Neural Network.


2021 ◽  
Author(s):  
Enes Furkan Cigdem ◽  
Ali Haznedaroglu ◽  
Levent M. Arslan

Sign in / Sign up

Export Citation Format

Share Document