scholarly journals Estimation and tracking of fundamental, 2nd and 3d harmonic frequencies for spectrogram normalization in speech recognition

Author(s):  
K. Fujimoto ◽  
N. Hamada ◽  
W. Kasprzak

Estimation and tracking of fundamental, 2nd and 3d harmonic frequencies for spectrogram normalization in speech recognitionA stable and accurate estimation of the fundamental frequency (pitch,F0) is an important requirement in speech and music signal analysis, in tasks like automatic speech recognition and extraction of target signal in noisy environment. In this paper, we propose a pitch-related spectrogram normalization scheme to improve the speaker - independency of standard speech features. A very accurate estimation of the fundamental frequency is a must. Hence, we develop a non-parametric recursive estimation method ofF0 and its 2nd and 3d harmonic frequencies in noisy circumstances. The proposed method is different from typical Kalman and particle filter methods in the way that no particular sum of sinusoidal model is used. Also we tend to estimate F0 and its lower harmonics by using novel likelihood function. Through experiments under various noise levels, the proposed method is proved to be more accurate than other conventional methods. The spectrogram normalization scheme makes a mapping of real harmonic structure to a normalized structure. Results obtained for voiced phonemes show an increase in stability of the standard speech features - the average within-phoneme distance of the MFCC features for voiced phonemes can be decreased by several percent.

Author(s):  
Alexandru-Lucian Georgescu ◽  
Alessandro Pappalardo ◽  
Horia Cucu ◽  
Michaela Blott

AbstractThe last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.


2021 ◽  
Vol 13 (7) ◽  
pp. 168781402110277
Author(s):  
Yankai Hou ◽  
Zhaosheng Zhang ◽  
Peng Liu ◽  
Chunbao Song ◽  
Zhenpo Wang

Accurate estimation of the degree of battery aging is essential to ensure safe operation of electric vehicles. In this paper, using real-world vehicles and their operational data, a battery aging estimation method is proposed based on a dual-polarization equivalent circuit (DPEC) model and multiple data-driven models. The DPEC model and the forgetting factor recursive least-squares method are used to determine the battery system’s ohmic internal resistance, with outliers being filtered using boxplots. Furthermore, eight common data-driven models are used to describe the relationship between battery degradation and the factors influencing this degradation, and these models are analyzed and compared in terms of both estimation accuracy and computational requirements. The results show that the gradient descent tree regression, XGBoost regression, and light GBM regression models are more accurate than the other methods, with root mean square errors of less than 6.9 mΩ. The AdaBoost and random forest regression models are regarded as alternative groups because of their relative instability. The linear regression, support vector machine regression, and k-nearest neighbor regression models are not recommended because of poor accuracy or excessively high computational requirements. This work can serve as a reference for subsequent battery degradation studies based on real-time operational data.


Sensors ◽  
2018 ◽  
Vol 18 (9) ◽  
pp. 3062 ◽  
Author(s):  
Jinwoo Choi ◽  
Jeonghong Park ◽  
Yoongeon Lee ◽  
Jongdae Jung ◽  
Hyun-Taek Choi

Acoustic source localization is used in many underwater applications. Acquiring an accurate directional angle for an acoustic source is crucial for source localization. To achieve this purpose, this paper presents a method for directional angle estimation of underwater acoustic sources using a marine vehicle. It is assumed that the vehicle is equipped with two hydrophones and that the acoustic source transmits a specific signal repeatedly. The proposed method provides a probabilistic model for time delay estimation. The probability is recursively updated by prediction and update steps. The prediction step performs a probability transition using the angular displacement of the marine vehicle. The predicted probability is updated using a generalized cross correlation function with a verification process using entropy measurement. The proposed method can provide a reliable and accurate estimation of the directional angles of underwater acoustic sources. Experimental results demonstrate good performance of the proposed probabilistic directional angle estimation method in both an inland water environment and a harbor environment.


Author(s):  
Feng Bao ◽  
Waleed H. Abdulla

In computational auditory scene analysis, the accurate estimation of binary mask or ratio mask plays a key role in noise masking. An inaccurate estimation often leads to some artifacts and temporal discontinuity in the synthesized speech. To overcome this problem, we propose a new ratio mask estimation method in terms of Wiener filtering in each Gammatone channel. In the reconstruction of Wiener filter, we utilize the relationship of the speech and noise power spectra in each Gammatone channel to build the objective function for the convex optimization of speech power. To improve the accuracy of estimation, the estimated ratio mask is further modified based on its adjacent time–frequency units, and then smoothed by interpolating with the estimated binary masks. The objective tests including the signal-to-noise ratio improvement, spectral distortion and intelligibility, and subjective listening test demonstrate the superiority of the proposed method compared with the reference methods.


Energies ◽  
2021 ◽  
Vol 14 (22) ◽  
pp. 7559
Author(s):  
Lisha Li ◽  
Shuming Yuan ◽  
Yue Teng ◽  
Jing Shao

Though the development of China’s civil aviation and the improvement of control ability have strengthened the safety operation and support ability effectively, the airlines are under the pressure of operation costs due to the increase of aircraft fuel price. With the development of optimization controlling methods in flight management systems, it becomes increasingly challenging to cut down flight fuel consumption by control the flight status of the aircraft. Therefore, the airlines both at home and abroad mainly rely on the accurate estimation of aircraft fuel to reduce fuel consumption, and further reduce its carbon emission. The airlines have to take various potential factors into consideration and load more fuel to cope with possible negative situation during the flight. Therefore, the fuel for emergency use is called PBCF (Performance-Based Contingency Fuel). The existing PBCF forecasting method used by China Airlines is not accurate, which fails to take into account various influencing factors. This paper aims to find a method that could predict PBCF more accurately than the existing methods for China Airlines.This paper takes China Eastern Airlines as an example. The experimental data of flight fuel of China Eastern Airlines Co, Ltd. were collected to find out the relevant parameters affecting the fuel consumption, which is followed by the establishment of the LSTM neural network through the parameters and collected data. Finally, through the established neural network model, the PBCF addition required by the airline with different influencing factors is output. It can be seen from the results that the all the four models are available for the accurate prediction of fuel consumption. The amount of data of A319 is much larger than that of A320 and A330, which leads to higher accuracy of the model trained by A319. The study contributes to the calculation methods in the fuel-saving project, and helps the practitioners to learn about a particular fuel calculation method. The study brought insights for practitioners to achieve the goal of low carbon emission and further contributed to their progress towards circular economy.


2018 ◽  
Vol 49 (6) ◽  
pp. 388-397
Author(s):  
François Prévost ◽  
Alexandre Lehmann

Cochlear implants restore hearing in deaf individuals, but speech perception remains challenging. Poor discrimination of spectral components is thought to account for limitations of speech recognition in cochlear implant users. We investigated how combined variations of spectral components along two orthogonal dimensions can maximize neural discrimination between two vowels, as measured by mismatch negativity. Adult cochlear implant users and matched normal-hearing listeners underwent electroencephalographic event-related potentials recordings in an optimum-1 oddball paradigm. A standard /a/ vowel was delivered in an acoustic free field along with stimuli having a deviant fundamental frequency (+3 and +6 semitones), a deviant first formant making it a /i/ vowel or combined deviant fundamental frequency and first formant (+3 and +6 semitones /i/ vowels). Speech recognition was assessed with a word repetition task. An analysis of variance between both amplitude and latency of mismatch negativity elicited by each deviant vowel was performed. The strength of correlations between these parameters of mismatch negativity and speech recognition as well as participants’ age was assessed. Amplitude of mismatch negativity was weaker in cochlear implant users but was maximized by variations of vowels’ first formant. Latency of mismatch negativity was later in cochlear implant users and was particularly extended by variations of the fundamental frequency. Speech recognition correlated with parameters of mismatch negativity elicited by the specific variation of the first formant. This nonlinear effect of acoustic parameters on neural discrimination of vowels has implications for implant processor programming and aural rehabilitation.


Author(s):  
Eunho Kang ◽  
Hyomoon Lee ◽  
Dongsu Kim ◽  
Jongho Yoon

Abstract Practical thermal bridge performance indicators (ITBs) of existing buildings may differ from calculated thermal bridge performance derived theoretically due to actual construction conditions, such as effect of irregular shapes and aging. To fill this gap in a practical manner, more realistic quantitative evaluation of thermal bridge at on-site needs to be considered to identify thermal behaviors throughout exterior walls and thus improve overall insulation performance of buildings. In this paper, the model of a thermal bridge performance indicator is developed based on an in-situ Infrared thermography method, and a case study is then carried out to evaluate thermal performance of an existing exterior wall using the developed model. For the estimation method in this study, the form of the likelihood function is used with the Bayesian method to constantly reflect the measured data. Subsequently, the coefficient of variation is applied to analyze required times for the assumed convergence. Results from the measurement for three days show that thermal bridge under the measurement has more heat losses, including 1.14 times, when compared to the non-thermal bridge. In addition, the results present that it takes about 40 hours to reach 1% of the variation coefficient. Comparison of the ITB estimated at coefficient of variation 1% (40 hours point) with the ITB estimated at end-of-experiment (72 hours point) results in 0.9% of a relative error.


Sign in / Sign up

Export Citation Format

Share Document