Speech Recognition using Artificial Neural Network (ANN)

Speech Recognition

Speech is the way of communication between people. The speech recognition is a software invention which converts our spoken language into a machine-readable format. Nowadays speech recognition is useful for interaction between human and machines or mobile devices. So, it is very important.

Speech recognition is mainly divided into two parts. The first part is the speech recognition process and the second part is the speech pattern recognition technique. The speech processing part contains four steps:

  • Speech data collection
  • Speech pre-processing
  • Speech feature extraction
  • Speech classification

The speech pattern recognition technique is based on Artificial Neural Network(ANN).

Speech Recognition Process

The speech recognition process contains many steps which are to recognize our speech. The processes of speech recognition are shown in the figure.


Speech is the form of human interaction which includes articulation (how we make sounds using the mouth, lips, and tongue), voice (how we use our breath to make a sound) and fluency (the rhythm of our speech).

Steps in Speech Recognition

Step 1: Speech data collection

We need high-quality language data that our system or computer understand human speech in a variety of environments and contexts. When training our Automatic Speech Recognition (ASR) system, we need a large volume of data effectively.  So, the Natural Language Utterance(NLU) is used for collecting data, train, and test to recognize human speech.

Tools for collecting speech data

  • Telephony
  • Single speaker
  • Multi-speaker
  • Speech modality
  • Text corpora
  • Prompt variation
  • Embedded device and other resources

Step 2: Speech Pre-processing

The speech of the speaker is received in a waveform. These are much software is available to record the speech of the human. We may have the background noise with the speech signal. The speech pre-processing is useful to reduce such problems.

It plays an important role in eliminating the irrelevant background noise. It is mainly used to improve the accuracy of speech recognition. It mainly involves filtering the noise, smoothing, windowing, framing and echo removing.

Step 3: Speech Feature Extraction

Feature extraction is the main part of the speech recognition. The speech may vary from person to person. Due to this variation in speech signals, there is a need to perform some feature extraction to reduce those variations. The feature from the input signal that helps the system to identify the speaker of the speaker. So, it is called as the heart of the system.

Techniques for speech feature extraction

Nowadays the following techniques are used for feature extraction. These techniques are also useful in other areas of speech processing.


The Mel-Frequency Cepstral Coefficients is the leading approach for speech feature extraction. The MFCC takes human perception sensitivity with respect to frequencies. So, there are the best speech and speaker recognition.

Calculation of MFCC

1. Pre-Emphasis


2. Window

      For hamming window:

w[n]=0.54-0.46cos 2𝛑-1


3. Calculate Fourier Transform and Power Spectrum


4. Filter Banks




The Linear Predictive Coding is a tool widely used for low bit rate coder. It provides extremely accurate estimates of speech parameters and is relatively efficient for computation. The important aspect of LPC is the linear predictive filter which allows the value of the next sample to be determined by a linear combination of previous samples.

Step 4: Speech Classification

Speech classification is an important part of speech signal processing. The most common technique used for speech classification is discussed below. These systems involve complex mathematical functions and they take out hidden information from the input processed signal.

Hidden Markov Model(HMM)

Hidden Markov Model is a statistical Markov model in which the system being method is assumed to be a Markov process with hidden states. The HMM can be represented as the simplest dynamic Bayesian Network. The modern general-purpose speech recognition system is based on Hidden Markov models. There are statistical models that output a sequence of symbols or quantities.

Vector Quantization(VQ)

The amount of compression will be described in terms of the rate, which will be measured in bits per sample. If we have codebook size of K, and the input vector is of dimensions L. We have to use[log 2 K] bits to specify which of the code-vectors was selected. The rate for an L-dimensional vector quantizer with a codebook of size K is [log 2 K]/L.

What is Artificial Neural Network(ANN)?

Artificial Neural Network is the framework which is based on the structure and functions of biological neural networks. ANNs are considered nonlinear statistical data modeling tools where the complex relationship between inputs and outputs are modeled or patterns are found. ANN is also known as the neural network.

Human Neural Network Vs Artificial Neural Network

Types of Artificial Neural Network

There are many types of ANN.

  • Feedforward network
  • Modular neural network
  • Convolutional neural network
  • Radial basis function neural network
  • Recurrent neural network

Speech recognition using ANN properties

Trainability: A network can form a relationship between input n output patterns. It is used to the network to classify speech patterns into phoneme categories.

Generalization: This is essential in speech recognition because acoustical patterns are never exactly the same.

Nonlinearity:This is useful since speech is a highly nonlinear process.

Robustness:This is a valuable feature because speech patterns are noisy.

Uniformity: This makes it easy to use both basic and differential speech input.

Parallelism:This will ultimately permit very fast processing of speech or other data.


In this article, I have shared about two types of speech recognition parts. The first part is about the speech recognition process and it contains four steps. In that,  the first step is to collect the speech data and the ways for collecting the speech data from the various area. The second step is speech pre-processing. The third step is speech feature extraction and techniques like MFCC and LPC. And the final step is speech classification and techniques like HMM and VQ. The second part contains a brief theory of Artificial Neural Network(ANN) and types ANN. Then finally, the speech recognition is done using Artificial Neural Network properties like robustness, nonlinearity, etc.

Opinions expressed by AI Time Journal contributors are their own.

About Monisha M

Editorial Staff Intern Pandian Saraswathi Yadav Engineering College, interested in Data Analytics and Machine Learning.

View all posts by Monisha M →