1 Department of Electronic Systems, The Faculty of Engineering and Science, Aalborg University, VBN 2 CTIF - Section, The Faculty of Engineering and Science, Aalborg University, VBN 3 Information Technology, 0.8km Markopoulo Av., Peania 19002
An audio-visual voice activity detector that uses sensors positioned distantly from the speaker is presented. Its constituting unimodal detectors are based on the modeling of the temporal variation of audio and visual features using Hidden Markov Models; their outcomes are fused using a post-decision scheme. The Mel-Frequency Cepstral Coefficients and the vertical mouth opening are the chosen audio and visual features respectively, both augmented with their first-order derivatives. The proposed system is assessed using far-field recordings from four different speakers and under various levels of additive white Gaussian noise, to obtain a performance superior than that which each unimodal component alone can achieve. © 2009 IEEE.
Dsp 2009: 16th International Conference on Digital Signal Processing, Proceedings, 2009, p. 1-5
Main Research Area:
DSP 2009: 16th International Conference on Digital Signal Processing