Automatic Speech Recognition – Introduction

Automatic Speech Recognition (ASR) is the ability of a machine to identify words and phrases in spoken language and convert them into text. The speech recognition process is traditionally split into two parts: acoustic model and language model. Acoustic model is the process of converting the audio to phonemes (some papers convert to characters), while the language model will convert these phonemes or characters to sequence of words. A classic pipeline of ASR is as shown in figure 1. Many statistical ASR algorithms have been explored in the past few decades, for example, Gaussian Mixture Model (GMMs) for the acoustic model and Hidden Markov Model (HMMs) for the language model. However, the booming of deep neural networks especially recurrent neural networks (RNN) takes the development of ASR to a higher level. The end-to-end RNN based models are taking control of the modern speech recognition. In this article, we focus on some popular datasets and neural network models of ASR.

Figure 1. Pipeline of Automatic Speech Recognition. (Image taken from ASR courses from the University of Edinburgh.)