The intelligibility of speech depends on factors related to the auditory processes involved in sound perception as well as on the acoustic properties of the sound entering the ear. However, a clear understanding of speech perception in complex acoustic conditions and, in particular, a quantitative description of the involved auditory processes provides a major challenge in speech and hearing research. This thesis presents a computational model that attempts to predict the speech intelligibility obtained by normal-hearing listeners in various adverse conditions. The model combines the concept of modulation frequency selectivity in the auditory processing of sound with a decision metric for intelligibility that is based on the signal-to-noise envelope power ratio (SNRenv). The proposed speech-based envelope power spectrum model (sEPSM) is demonstrated to account for the effects of stationary background noise, reverberation and noise reduction processing on speech intelligibility, indicating that the model is more general than traditional modeling approaches. Moreover, the model accounts for phase distortions when it includes a mechanism that evaluates the variation of envelope power across (audio) frequency. However, because the SNRenv is based on the long-term average envelope power, the model cannot account for the greater intelligibility typically observed in fluctuating noise compared to stationary noise. To overcome this limitation, a multi-resolution version of the sEPSM is presented where the SNRenv is estimated in temporal segments with a modulation-filter dependent duration. This multi-resolution approach effectively extends the applicability of the sEPSM to account for conditions with fluctuating interferers, while keeping its predictive power in the conditions with noisy speech distorted by reverberation or spectral subtraction. The relationship between the SNRenv based decision-metric and psychoacoustic speech intelligibility is further evaluated by generating stimuli with different SNRenv but the same overall power SNR. The results from the corresponding psychoacoustic data generally support the above relationship. However, the model is limited in conditions with manipulated clean speech since it does not account for the accompanied effects of speech distortions on intelligibility. The value of the sEPSM is further considered in conditions with noisy speech 5 transmitted through three commercially available mobile phones. The model successfully accounts for the performance across the phones in conditions with a stationary speech-shaped background noise, whereas deviations were observed in conditions with “Traffic” and “Pub” noise. Overall, the results of this thesis support the hypothesis that the SNRenv is a powerful objective metric for speech intelligibility prediction. Moreover, the findings suggest that the concept of modulation-frequency selective processing in the auditory system is crucial for human speech perception.
Main Research Area:
Contributions To Hearing Research
Technical University of Denmark, Department of Electrical Engineering, 2014