scholarly journals Cognitive and Psychophysical Determinants of Voice Recognition

2019 ◽  
Author(s):  
Bryan Shilowich ◽  
Irving Biederman

Voice recognition is a fundamental pathway to person individuation, although typically overshadowed by its visual counterpart, face recognition. There have been no large scale, parametric studies investigating voice recognition performance as a function of cognitive variables in concert with voice parameters. Using celebrity voice clips of varying lengths, 1-4 sec., paired with similar sounding, unfamiliar voice foils, the present study investigated three key voice parameters distinguishing targets from foils -- fundamental frequency, f0 (pitch), subharmonic-to-harmonic ratio, SHR (creakiness), and syllabic rate--in concert with the cognitive variables of voice familiarity and judged voice distinctiveness as they contributed to recognition accuracy at varying clip lengths. All the variables had robust effects in clips as short as 1 sec. Objective measures of distinctiveness, quantified by the distances of each target voice to that target’s sex- based mean for each parameter, showed that sensitivity to distinctiveness increased with familiarity. This effect was most evident on foil trials; at clip lengths of one second and above, f0 and SHR distinctiveness showed no discernible effect on match trials. Speaking rate distinctiveness improved match accuracy, an effect only seen with high familiarity. Recognition accuracy improved with the number of parameters that differed by an amount larger than the median, both in the target-to-foil and target-to-mean voice comparisons. A linear regression model of these three voice parameters, clip length, and subjective measures of distinctiveness and familiarity accounted for 36.7% of the variance in recognition accuracy.

2021 ◽  
Vol 13 (10) ◽  
pp. 265
Author(s):  
Jie Chen ◽  
Bing Han ◽  
Xufeng Ma ◽  
Jian Zhang

Underwater target recognition is an important supporting technology for the development of marine resources, which is mainly limited by the purity of feature extraction and the universality of recognition schemes. The low-frequency analysis and recording (LOFAR) spectrum is one of the key features of the underwater target, which can be used for feature extraction. However, the complex underwater environment noise and the extremely low signal-to-noise ratio of the target signal lead to breakpoints in the LOFAR spectrum, which seriously hinders the underwater target recognition. To overcome this issue and to further improve the recognition performance, we adopted a deep-learning approach for underwater target recognition, and a novel LOFAR spectrum enhancement (LSE)-based underwater target-recognition scheme was proposed, which consists of preprocessing, offline training, and online testing. In preprocessing, we specifically design a LOFAR spectrum enhancement based on multi-step decision algorithm to recover the breakpoints in LOFAR spectrum. In offline training, the enhanced LOFAR spectrum is adopted as the input of convolutional neural network (CNN) and a LOFAR-based CNN (LOFAR-CNN) for online recognition is developed. Taking advantage of the powerful capability of CNN in feature extraction, the recognition accuracy can be further improved by the proposed LOFAR-CNN. Finally, extensive simulation results demonstrate that the LOFAR-CNN network can achieve a recognition accuracy of 95.22%, which outperforms the state-of-the-art methods.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Li Liu ◽  
Yunfeng Ji ◽  
Yun Gao ◽  
Zhenyu Ping ◽  
Liang Kuang ◽  
...  

Traffic accidents are easily caused by tired driving. If the fatigue state of the driver can be identified in time and a corresponding early warning can be provided, then the occurrence of traffic accidents could be avoided to a large extent. At present, the recognition of fatigue driving states is mostly based on recognition accuracy. Fatigue state is currently recognized by combining different features, such as facial expressions, electroencephalogram (EEG) signals, yawning, and the percentage of eyelid closure over the pupil over time (PERCLoS). The combination of these features increases the recognition time and lacks real-time performance. In addition, some features will increase error in the recognition result, such as yawning frequently with the onset of a cold or frequent blinking with dry eyes. On the premise of ensuring the recognition accuracy and improving the realistic feasibility and real-time recognition performance of fatigue driving states, a fast support vector machine (FSVM) algorithm based on EEGs and electrooculograms (EOGs) is proposed to recognize fatigue driving states. First, the collected EEG and EOG modal data are preprocessed. Second, multiple features are extracted from the preprocessed EEGs and EOGs. Finally, FSVM is used to classify and recognize the data features to obtain the recognition result of the fatigue state. Based on the recognition results, this paper designs a fatigue driving early warning system based on Internet of Things (IoT) technology. When the driver shows symptoms of fatigue, the system not only sends a warning signal to the driver but also informs other nearby vehicles using this system through IoT technology and manages the operation background.


2020 ◽  
Vol 17 (3(Suppl.)) ◽  
pp. 1019
Author(s):  
Bassel Alkhatib ◽  
Mohammad Madian Waleed Kamal Eddin

The speaker identification is one of the fundamental problems in speech processing and voice modeling. The speaker identification applications include authentication in critical security systems and the accuracy of the selection. Large-scale voice recognition applications are a major challenge. Quick search in the speaker database requires fast, modern techniques and relies on artificial intelligence to achieve the desired results from the system. Many efforts are made to achieve this through the establishment of variable-based systems and the development of new methodologies for speaker identification. Speaker identification is the process of recognizing who is speaking using the characteristics extracted from the speech's waves like pitch, tone, and frequency. The speaker's models are created and saved in the system environment and used to verify the identity required by people accessing the systems, which allows access to various services that are controlled by voice, speaker identification involves two main parts: the first part is the feature extraction and the second part is the feature matching.


Human Activity Identification (HAI) in videos is one of the trendiest research fields in the computer visualization. Among various HAI techniques, Joints-pooled 3D-Deep convolutional Descriptors (JDD) have achieved effective performance by learning the body joint and capturing the spatiotemporal characteristics concurrently. However, the time consumption for estimating the locale of body joints by using large-scale dataset and computational cost of skeleton estimation algorithm were high. The recognition accuracy using traditional approaches need to be improved by considering both body joints and trajectory points together. Therefore, the key goal of this work is to improve the recognition accuracy using an optical flow integrated with a two-stream bilinear model, namely Joints and Trajectory-pooled 3D-Deep convolutional Descriptors (JTDD). In this model, an optical flow/trajectory point between video frames is also extracted at the body joint positions as input to the proposed JTDD. For this reason, two-streams of Convolutional 3D network (C3D) multiplied with the bilinear product is used for extracting the features, generating the joint descriptors for video sequences and capturing the spatiotemporal features. Then, the whole network is trained end-to-end based on the two-stream bilinear C3D model to obtain the video descriptors. Further, these video descriptors are classified by linear Support Vector Machine (SVM) to recognize human activities. Based on both body joints and trajectory points, action recognition is achieved efficiently. Finally, the recognition accuracy of the JTDD model and JDD model are compared.


Author(s):  
Song Li ◽  
Mustafa Ozkan Yerebakan ◽  
Yue Luo ◽  
Ben Amaba ◽  
William Swope ◽  
...  

Abstract Voice recognition has become an integral part of our lives, commonly used in call centers and as part of virtual assistants. However, voice recognition is increasingly applied to more industrial uses. Each of these use cases has unique characteristics that may impact the effectiveness of voice recognition, which could impact industrial productivity, performance, or even safety. One of the most prominent among them is the unique background noises that are dominant in each industry. The existence of different machinery and different work layouts are primary contributors to this. Another important characteristic is the type of communication that is present in these settings. Daily communication often involves longer sentences uttered under relatively silent conditions, whereas communication in industrial settings is often short and conducted in loud conditions. In this study, we demonstrated the importance of taking these two elements into account by comparing the performances of two voice recognition algorithms under several background noise conditions: a regular Convolutional Neural Network (CNN) based voice recognition algorithm to an Auto Speech Recognition (ASR) based model with a denoising module. Our results indicate that there is a significant performance drop between the typical background noise use (white noise) and the rest of the background noises. Also, our custom ASR model with the denoising module outperformed the CNN based model with an overall performance increase between 14-35% across all background noises. . Both results give proof that specialized voice recognition algorithms need to be developed for these environments to reliably deploy them as control mechanisms.


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2056
Author(s):  
Junjie Wu ◽  
Jianfeng Xu ◽  
Deyu Lin ◽  
Min Tu

The recognition accuracy of micro-expressions in the field of facial expressions is still understudied, as current research methods mainly focus on feature extraction and classification. Based on optical flow and decision thinking theory, we propose a novel micro-expression recognition method, which can filter low-quality micro-expression video clips. Determined by preset thresholds, we develop two optical flow filtering mechanisms: one based on two-branch decisions (OFF2BD) and the other based on three-way decisions (OFF3WD). In OFF2BD, which use the classical binary logic to classify images, and divide the images into positive or negative domain for further filtering. Differ from the OFF2BD, OFF3WD added boundary domain to delay to judge the motion quality of the images. In this way, the video clips with low degree of morphological change can be eliminated, so as to directly improve the quality of micro-expression features and recognition rate. From the experimental results, we verify the recognition accuracy of 61.57%, and 65.41% for CASMEII, and SMIC datasets, respectively. Through the comparative analysis, it shows that the scheme can effectively improve the recognition performance.


2020 ◽  
Vol 19 (6) ◽  
pp. 2075-2090 ◽  
Author(s):  
Hao Cheng ◽  
Furui Wang ◽  
Linsheng Huo ◽  
Gangbing Song

Deposits prevention and removal in pipeline has great importance to ensure pipeline operation. Selecting a suitable removal time based on the composition and mass of the deposits not only reduces cost but also improves efficiency. In this article, we develop a new non-destructive approach using the percussion method and voice recognition with support vector machine to detect the sandy deposits in the steel pipeline. Particularly, as the mass of sandy deposits in the pipeline changes, the impact-induced sound signals will be different. A commonly used voice recognition feature, Mel-Frequency Cepstrum Coefficients, which represent the result of a cosine transform of the real logarithm of the short-term energy spectrum on a Mel-frequency scale, is adopted in this research and Mel-Frequency Cepstrum Coefficients are extracted from the obtained sound signals. A support vector machine model was employed to identify the sandy deposits with different mass values by classifying energy summation and Mel-Frequency Cepstrum Coefficients. In addition, the classification accuracies of energy summation and Mel-Frequency Cepstrum Coefficients are compared. The experimental results demonstrated that Mel-Frequency Cepstrum Coefficients perform better in pipeline deposits detection and have great potential in acoustic recognition for structural health monitoring. In addition, the proposed Mel-Frequency Cepstrum Coefficients–based pipeline deposits monitoring model can estimate the deposits in the pipeline with high accuracy. Moreover, compared with current non-destructive deposits detection approaches, the percussion method is easy to implement. With the rapid development of artificial intelligence and acoustic recognition, the proposed method can realize higher accuracy and higher speed in the detection of pipeline deposits, and has great application potential in the future. In addition, the proposed percussion method can enable robotic-based inspection for large-scale implementation.


2011 ◽  
Vol 188 ◽  
pp. 629-635
Author(s):  
Xia Yue ◽  
Chun Liang Zhang ◽  
Jian Li ◽  
H.Y. Zhu

A hybrid support vector machine (SVM) and hidden Markov model (HMM) model was introduced into the fault diagnosis of pump. This model had double layers: the first layer used HMM to classify preliminarily in order to get the coverage of possible faults; the second layer utilized this information to activate the corresponding SVMs for improving the recognition accuracy. The structure of this hybrid model was clear and feasible. Especially the model had the potential of large-scale multiclass application in fault diagnosis because of its good scalability. The recognition experiments of 26 statuses on the ZLH600-2 pump showed that the recognition capability of this model was sound in multiclass problems. The recognition rate of one bearing eccentricity increased from SVM’s 84.42% to 89.61% while the average recognition rate of hybrid model reached 95.05%. Although some goals while model constructed did not be fully realized, this model was still very good in practical applications.


Perception ◽  
1983 ◽  
Vol 12 (2) ◽  
pp. 223-226 ◽  
Author(s):  
Ray Bull ◽  
Harriet Rathborn ◽  
Brian R Clifford

A research programme has been carried out that concerns the accuracy with which listeners can identify a speaker heard once before. The present study examined the voice-recognition abilities of blind listeners, and it was found that they could more accurately select target voices from the test arrays than could sighted people. However, the degree of blindness, the age at onset of blindness, and the number of years of blindness all failed to relate to voice-recognition accuracy.


Author(s):  
XINHUA FENG ◽  
XIAOQING DING ◽  
YOUSHOU WU ◽  
PATRICK S. P. WANG

Classifier combination is an effective method to improve the recognition accuracy of a biometric system. It has been applied to many practical biometric systems and achieved excellent performance. However, there is little literature involving theoretical analysis on the effectiveness of classifier combination. In this paper, we investigate classifiers combined with the max and min rules. In particular, we compute the recognition performance of each combined classifier, and illustrate the condition in which the combined classifier outperforms the original unimodal classifier. We focus our study on personal verification, where the input pattern is classified into one of two categories, the genuine or the impostor. For simplicity, we further assume that the matching score produced by the original classifier follows a normal distribution and the outputs of different classifiers are independent and identically distributed. Randomly-generated data are employed to test our conclusion. The influence of finite samples is explored at the same time. Moreover, an iris recognition system, which adopts multiple snapshots to identify a subject, is introduced as a practical application of the above discussions.


Sign in / Sign up

Export Citation Format

Share Document