Introduction: An Automatic Speech Recognition (ASR) system enables to recognize
speech utterances and thus can be used to convert speech into text for various purposes. These systems
are deployed in different environments such as clean or noisy and are used by all ages or types
of people. These also present some of the major difficulties faced in the development of an ASR system.
Thus, an ASR system needs to be efficient, while also being accurate and robust. Our main goal
is to minimize the error rate during training as well as testing phases, while implementing an ASR
system. The performance of ASR depends upon different combinations of feature extraction techniques
and back-end techniques. In this paper, using a continuous speech recognition system, the
performance comparison of different combinations of feature extraction techniques and various
types of back-end techniques has been presented.
Methods: Hidden Markov Models (HMMs), Subspace Gaussian Mixture Models (SGMMs) and
Deep Neural Networks (DNNs) with DNN-HMM architecture, namely Karel’s, Dan’s and Hybrid
DNN-SGMM architecture are used at the back-end of the implemented system. Mel frequency
Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Gammatone Frequency
Cepstral coefficients (GFCC) are used as feature extraction techniques at the front-end of the proposed
system. Kaldi toolkit has been used for the implementation of the proposed work. The system
is trained on the Texas Instruments-Massachusetts Institute of Technology (TIMIT) speech corpus
for English language.
Results: The experimental results show that MFCC outperforms GFCC and PLP in noiseless conditions,
while PLP tends to outperform MFCC and GFCC in noisy conditions. Furthermore, the hybrid
of Dan’s DNN implementation along with SGMM performs the best for the back-end acoustic modeling.
The proposed architecture with the PLP feature extraction technique in the front end and hybrid
of Dan’s DNN implementation along with SGMM at the back end outperforms the other combinations
in a noisy environment.
Conclusion: Automatic Speech recognition has numerous applications in our lives like Home automation,
Personal assistant, Robotics, etc. It is highly desirable to build an ASR system with good
performance. The performance of Automatic Speech Recognition is affected by various factors
which include vocabulary size, whether the system is speaker dependent or independent, whether
speech is isolated, discontinuous or continuous, and adverse conditions like noise. The paper presented
an ensemble architecture that uses PLP for feature extraction at the front end and a hybrid of
SGMM + Dan’s DNN in the backend to build a noise-robust ASR system.
Discussion: The presented work in this paper discusses the performance comparison of continuous
ASR systems developed using different combinations of front-end feature extraction (MFCC, PLP,
and GFCC) and back-end acoustic modeling (mono-phone, tri-phone, SGMM, DNN and hybrid
DNN-SGMM) techniques. Each type of front-end technique is tested in combination with each type
of back-end technique. Finally, it compares the results of the combinations thus formed, to find out
the best performing combination in noisy and clean conditions.