INTRODUCTION noises. Project Question: The current model

INTRODUCTION

 

Automatic speech
recognition is the technology that allows human beings to use their voices to
speak with a computer interface and is currently invading our lives. It’s built
into our phones, our game consoles and our smart watches. It’s even automating
our homes.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

Fig.1:

 

Though the ASR
technology is getting better day-by-day in terms of technological development, it
is still a challenging task due to the high variations in the speech signals
with respect to the accent, voice modulation and background noises.

 

Project Question:

 

The current model aims
to transcribe, with accuracy, the spoken words (in English) to text by pre-processing
and modeling audio files.

 

DESCRIPTION OF DATA SET

 

The data for this
model was obtained from the Google
Speech Commands Dataset. This is a set of one-second .wav audio files, each
containing a single spoken English word. These words are from a small set of
commands, collected using crowdsourcing and are spoken by a variety of
different speakers.  Twenty core command
words were recorded, with most speakers saying each of them five times. The
core words are “Yes”, “No”, “Up”,
“Down”, “Left”, “Right”, “On”,
“Off”, “Stop”, “Go”, “Zero”,
“One”, “Two”, “Three”, “Four”, “Five”,
“Six”, “Seven”, “Eight”, and “Nine”. To
help distinguish unrecognized words, there are also ten auxiliary words, which
most speakers only said once. These include “Bed”, “Bird”,
“Cat”, “Dog”, “Happy”, “House”,
“Marvin”, “Sheila”, “Tree”, and “Wow”. The
audio files are organized into folders based on the word they contain, and no
details were recorded for any of the participants and random ids were assigned to
each individual. These ids are encoded in each file name as the first part
before the underscore. If a participant contributed multiple utterances of the
same word, these are distinguished by the number at the end of the file name.

 

DATA PREPROCESSING

 

Sound is transmitted
as waves and are one-dimensional i.e. at every moment in time, they have a
single value based on the height of the wave. To turn this sound wave
into meaningful numbers that can be used for Data modelling, we need to
preprocess the audio files. I explored 3 different preprocessing measures for
processing audio files, which are described below:

 

1.     
SAMPLING AND FOURIER TRANSFORMATION:

 

In signal processing, sampling is the reduction of a continuous-time
signal to a discrete-time signal. In order to transform an audio wave to a
sequence of samples, we record of the height of the wave at equally-spaced
points, in this case every 1/16,000th second.

INTRODUCTION

 

Automatic speech
recognition is the technology that allows human beings to use their voices to
speak with a computer interface and is currently invading our lives. It’s built
into our phones, our game consoles and our smart watches. It’s even automating
our homes.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

Fig.1:

 

Though the ASR
technology is getting better day-by-day in terms of technological development, it
is still a challenging task due to the high variations in the speech signals
with respect to the accent, voice modulation and background noises.

 

Project Question:

 

The current model aims
to transcribe, with accuracy, the spoken words (in English) to text by pre-processing
and modeling audio files.

 

DESCRIPTION OF DATA SET

 

The data for this
model was obtained from the Google
Speech Commands Dataset. This is a set of one-second .wav audio files, each
containing a single spoken English word. These words are from a small set of
commands, collected using crowdsourcing and are spoken by a variety of
different speakers.  Twenty core command
words were recorded, with most speakers saying each of them five times. The
core words are “Yes”, “No”, “Up”,
“Down”, “Left”, “Right”, “On”,
“Off”, “Stop”, “Go”, “Zero”,
“One”, “Two”, “Three”, “Four”, “Five”,
“Six”, “Seven”, “Eight”, and “Nine”. To
help distinguish unrecognized words, there are also ten auxiliary words, which
most speakers only said once. These include “Bed”, “Bird”,
“Cat”, “Dog”, “Happy”, “House”,
“Marvin”, “Sheila”, “Tree”, and “Wow”. The
audio files are organized into folders based on the word they contain, and no
details were recorded for any of the participants and random ids were assigned to
each individual. These ids are encoded in each file name as the first part
before the underscore. If a participant contributed multiple utterances of the
same word, these are distinguished by the number at the end of the file name.

 

DATA PREPROCESSING

 

Sound is transmitted
as waves and are one-dimensional i.e. at every moment in time, they have a
single value based on the height of the wave. To turn this sound wave
into meaningful numbers that can be used for Data modelling, we need to
preprocess the audio files. I explored 3 different preprocessing measures for
processing audio files, which are described below:

 

1.     
SAMPLING AND FOURIER TRANSFORMATION:

 

In signal processing, sampling is the reduction of a continuous-time
signal to a discrete-time signal. In order to transform an audio wave to a
sequence of samples, we record of the height of the wave at equally-spaced
points, in this case every 1/16,000th second.

x

Hi!
I'm Mary!

Would you like to get a custom essay? How about receiving a customized one?

Check it out