Abstract
Background: Multimodal speech recognition is proved to be one of the most promising solutions for robust speech recognition, especially when the audio signal is corrupted by noise. As the visual speech signal not affected by audio noise, it can be used to obtain more information used to enhance the speech recognition accuracy in noisy system. The critical stage in designing robust speech recognition system is choosing of reliable classification method from large variety of available classification techniques. Deep learning is well-known as a technique that has the ability to classify a nonlinear problem, and takes into consideration the sequential characteristic of the speech signal. Numerous researches have been done in applying deep learning to overcome Audio-Visual Speech Recognition (AVSR) problems due to its amazing achievements in both speech and image recognition. Even though optimistic results have been obtained from the continuous studies, researches on enhancing accuracy in noise system and selecting the best classification technique are still gaining lots of attention.
Objective: This paper aims to build AVSR system that uses both acoustic combined with visual speech information and use classification technique based on deep learning to improve the recognition performance in a clean and noisy environment.
Methods: Mel Frequency Cepstral Coefficient (MFCC) and Discrete Cosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively. The audio feature rate is greater than the visual feature rate, so that linear interpolation is needed to obtain equal feature vectors size then early integrating them to get combined feature vector. Bidirectional Long-Short Term Memory (BiLSTM), one of the Deep learning techniques, are used for classification process and compare the obtained results to other classification techniques like Convolution Neural Network (CNN) and the traditional Hidden Markov Models (HMM). The effectiveness of the proposed model is proved by using two multi-speaker AVSR datasets termed AVletters and GRID.
Results: The proposed model gives promising results where the obtained results In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.07% and 98.47% , with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio- only.
Conclusion: Based on the obtained results, we can conclude that increasing the size of audio feature vector from 13 to 39 doesn’t give effective enhancement for the recognition accuracy in clean environment, but in noisy environment, it gives better performance. BiLSTM is considered to be the optimal classifier for a robust speech recognition system when compared to CNN and traditional HMM, because it takes into consideration the sequential characteristic of the speech signal (audio and visual). The proposed model gives great improvement in the recognition accuracy and decreasing the loss value for both clean and noisy environments than using audio-only features. Comparing the proposed model to previously obtain results which using the same datasets, we found that our model gives higher recognition accuracy and confirms the robustness of our model.