This paper deals with a new and improved approach of Back-propagation learning neural network based likelihood ratio score fusion technique for audio-visual speaker Identification in various noisy environments. Different signal preprocessing and noise removing techniques have been used to process the speech utterance and LPC, LPCC, RCC, MFCC, ΔMFCC and ΔΔMFCC methods have been applied to extract the features from the audio signal. Active Shape Model has been used to extract the appearance and shape based facial features. To enhance the performance of the proposed system, appearance and shape based facial features are concatenated and Principal Component Analysis method has been used to reduce the dimension of the facial feature vector. The audio and visual feature vectors are then fed to Hidden Markov Model separately to find out the log-likelihood of each modality. The reliability of each modality has been calculated using reliability measurement method. Finally, these integrated likelihood ratios are fed to Back-propagation learning neural network algorithm to discover the final speaker identification result. For measuring the performance of the proposed system, three different databases, that is, NOIZEUS speech database, ORL face database and VALID audio-visual multimodal database have been used for audio-only, visual-only, and audio-visual speaker identification. To identify the accuracy of the proposed system with existing techniques under various noisy environment, different types of artificial noise have been added at various rates with audio and visual signal and performance being compared with different variations of audio and visual features.
Read full abstract