Abstract

We are all familiar with audio speech recognition technology for interfacing with smartphones and in-car computers. However, technology that can interpret our speech signals without audio is a far greater challenge for scientists. Audio speech recognition (ASR) can only work in situations where there is little or no background noise and where speech is clearly enunciated. Other technologies that use visual signals to lip-read, or that use lip-reading in conjunction with degraded audio input are under development. However, in the situations where a person cannot speak or where the person's face may not be fully visible, silent speech recognition, which uses muscle movements or brain signals to decode speech, is also under development. Associate Professor Takeshi Saitoh's laboratory at the Kyushu Institute of Technology is at the forefront of visual speech recognition (VSR) and is collaborating with researchers worldwide to develop a range of silent speech recognition technologies. Saitoh, whose small team of researchers and students are being supported by the Japan Society for the Promotion of Science (JSPS), says: 'The aim of our work is to achieve smooth and free communication in real time, without the need for audible speech.' The laboratory's VSR prototype is already performing at a high level. There are many reasons why scientists are working on speech technology that does not rely on audio. Saitoh points out that: 'With an ageing population, more people will suffer from speech or hearing disabilities and would benefit from a means to communicate freely. This would vastly improve their quality of life and create employment opportunities.' Also, intelligent machines, controlled by human-machine interfaces, are expected to become increasingly common in our lives. Non-audio speech recognition technology will be useful for interacting with smartphones, driverless cars, surveillance systems and smart appliances. VSR uses a modified camera, combined with image processing and pattern recognition to convert moving shapes made by the mouth, into meaningful language. Earlier VSR technologies matched the shape of a still mouth with vowel sounds, and others have correlated mouth shapes with a key input. However, these do not provide audio output in real-time, so cannot facilitate a smooth conversation. Also, it is vital that VSR is both easy to use and applicable to a range of situations, such as people bedridden in a supine position, where there is a degree of camera movement or where a face is being viewed in profile rather than full-frontal. Any reliable system should also be user-dependent, such that it will work on any skin colour and any shape of face and in spite of head movement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call