Friday, October 12, 2012: 8:00 PM
6C/6E (WSCC)
Audio and video information surround us more in every aspect of our lives with the use of laptops, cameras, and cell phones making inevitable the pursuit of new ways to enhance the interaction of human with these devices. These two sources of information can be studied to create a new technology to apply to many areas including speech recognition. One novel way to improve the recognition of speech is the introduction of the video signal from the speaker and study how this information can be utilized in a speech recognition system. Several challenges come into play such as how to recognize and track specific areas of the face and develop a method to recognize speech from images. In this work a tracking system of the mouth region and a visual speech recognition system are developed using the CUAVE database to recognize a series of digits from 36 speakers. The tracking system is based on an eye detection procedure using the YCbCr transformation and two mapping schemes. Detection is refined by the Kalman filter with outlier removal and by tracking each eye independently. For the visual speech recognition system, DCT and DFT features are extracted from the mouth images and used to train Gaussian mixtures models. Visual speech recognition showed an average word accuracy of 54.93%.