Abstract

This paper investigated the effects of varying noise levels and varying lighting levels on speech and gesture control command interfaces for aerobots. The aim was to determine the practical suitability of the multimodal combination of speech and visual gesture in human aerobotic interaction, by investigating the limits and feasibility of use of the individual components. In order to determine this, a custom multimodal speech and visual gesture interface was developed using CMU (Carnegie Mellon University) sphinx and OpenCV (Open source Computer Vision) libraries, respectively. An experiment study was designed to measure the individual effects of each of the two main components of speech and gesture, and 37 participants were recruited to participate in the experiment. The ambient noise level was varied from 55 dB to 85 dB. The ambient lighting level was varied from 10 Lux to 1400 Lux, under different lighting colour temperature mixtures of yellow (3500 K) and white (5500 K), and different background for capturing the finger gestures. The results of the experiment, which consisted of around 3108 speech utterance and 999 gesture quality observations, were presented and discussed. It was observed that speech recognition accuracy/success rate falls as noise levels rise, with 75 dB noise level being the aerobot’s practical application limit, as the speech control interaction becomes very unreliable due to poor recognition beyond this. It was concluded that multi-word speech commands were considered more reliable and effective than single-word speech commands. In addition, some speech command words (e.g., land) were more noise resistant than others (e.g., hover) at higher noise levels, due to their articulation. From the results of the gesture-lighting experiment, the effects of both lighting conditions and the environment background on the quality of gesture recognition, was almost insignificant, less than 0.5%. The implication of this is that other factors such as the gesture capture system design and technology (camera and computer hardware), type of gesture being captured (upper body, whole body, hand, fingers, or facial gestures), and the image processing technique (gesture classification algorithms), are more important in developing a successful gesture recognition system. Some further works were suggested based on the conclusions drawn from this findings which included using alternative ASR (Automatic Speech Recognition) speech models and developing more robust gesture recognition algorithm.

Highlights

  • This paper investigates the effects of varying noise levels and varying lighting conditions on practical speech and gesture control communication/interaction, within the context of an aerial robot application

  • The results of the experiment show that speech recognition accuracy/success rate falls as noise levels rise

  • The results of the experiment were presented, from which it was observed that speech recognition accuracy/success rate falls as noise levels rise

Read more

Summary

Introduction

This paper investigates the effects of varying noise levels and varying lighting conditions on practical speech and gesture control communication/interaction, within the context of an aerial robot (aerobot) application. A regression characteristic model was developed for the custom CMU Sphinx automatic speech recogniser showing speech control command performance for each ambient noise level ranges of 55 dB to 85 dB. The upper limit of the custom developed CMU Sphinx based speech interface was determined to be 75 dB with a 65% recognition rate. Beyond this threshold, speech control was considered not practical due to the significantly high rate of control failure. Speech control was considered not practical due to the significantly high rate of control failure This limit was well within the practical operation limit of the HoverCam UAV propulsion noise level, but not the DJI Phantom 4 Pro. Some ways of improving this limits were suggested, such as investigating other speech recognisers with a different model other than the hidden

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call