Sei sulla pagina 1di 117

McMaster University

DigitalCommons@McMaster
EE 4BI6 Electrical Engineering Biomedical Department of Electrical and Computer
Capstones Engineering

4-23-2010

Design of a Limited Speech Recognition System


for use in a Braille Teaching Device
Brett Lindsay
McMaster University

Recommended Citation
Lindsay, Brett, "Design of a Limited Speech Recognition System for use in a Braille Teaching Device" (2010). EE 4BI6 Electrical
Engineering Biomedical Capstones. Paper 34.
http://digitalcommons.mcmaster.ca/ee4bi6/34

This Capstone is brought to you for free and open access by the Department of Electrical and Computer Engineering at DigitalCommons@McMaster.
It has been accepted for inclusion in EE 4BI6 Electrical Engineering Biomedical Capstones by an authorized administrator of
DigitalCommons@McMaster. For more information, please contact scom@mcmaster.ca.
Design of a Limited
Speech Recognition System
for use in a
Braille Teaching Device
by

Brett Lindsay

Electrical and Biomedical Engineering


Faculty Advisor: Dr. Thomas E. Doyle

Electrical and Biomedical Engineering Project Report


submitted in partial fulfillment of the degree of
Bachelor of Engineering

McMaster University
Hamilton, Ontario, Canada
April 23, 2010
Copyright c April 2010 by Brett Lindsay

1
Abstract

The report here submitted defines the scope and content of the Electrical and Biomedical Engineering
Capstone Project as submitted by Brett Lindsay. This project involved the creation of a limited Speech
Recognition system for use in a Braille Teaching Device. The greater project (that of the Braille
Teaching Device) was completed in tandem with Messrs. Chris Agam and Jonathon Hernandez. It was
felt that the Speech Recognition component would be a valuable addition to the project due to the
nature of a teaching device for use by the visually impaired (who would need an assistant to use said
device). The Speech Recognition system was creating by breaking the problem into four subsections:
the collection of data upon call by the teaching program, the manipulation of data, the recognition
algorithms to categorize said data, and the passing of results back to the teaching program. For the
recognition block, the relatively simple method of Dynamic Time Warping was chosen over more
complex options such as Hidden Markov Models or Neural Networks. This method presented some
problems as documented, specifically a tendency to favour letters with larger file sizes (such as 'w').
The Speech Recognition system created during the course of this project failed to deliver on the wanted
efficiency of 60 % and low as possible false positives. While the Speech Recognition presented is
viable, the effectiveness is below that which can be found in market for comparable price.

2
Acknowledgements
Chris Agam was a student at McMaster University in the Electrical and Biomedical Engineering
program and was a member of the group creating a Braille Teaching Device. His project was the
physical device itself. He provided the idea for the project.

Jon Hernandez was a student at McMaster University in the Electrical and Biomedical Engineering
program and was a member of the group creating a Braille Teaching Device. His project was the
programming of the micro controller as well as software for use with the device. He took part in the
creation of the communication between his software, Mr. Agam's device, and Mr. Lindsay's speech
recognition system.

Billy Taj was a student at McMaster University in the Mechatronics Engineering program and provided
additional (basic) feedback in the testing of the device.

Dr. Thomas Doyle was a professor at McMaster University and functioned as the faculty adviser for the
duration of the project.

3
Contents

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . ii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …vii

NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2
1.3 General Approach to the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2
2 Literature Review 3
2.1 Speech Recognition Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Common Methods of Implementing Speech Recognition . .. . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Spectrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Comparable Project Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Problem and Methodology of Solutions 9
3.1 Statement of Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Methodology of Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
3.4 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
3.4.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.2Windowing. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.3 Cepstral Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Recognition Algorithm: Dynamic Time Warping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
3.5.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
3.5.2 Match Matrix. .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5.3 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.4 Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Returning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Design Procedures 17
4.1 Speech Recognition Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
4.3.2Windowing. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.3 Cepstral Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
4.4 Recognition Algorithm: Dynamic Time Warping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4
4.4.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
4.4.2 Match Matrix. .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
4.4.3 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
4.4.4 Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
4.5 Returning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
5 Testing Results and Discussion 25
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
5.2 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
5.2.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.2Windowing. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
5.2.3 Cepstral Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Recognition Algorithm: Dynamic Time Warping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
5.3.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
5.3.2 Match Matrix. .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.3 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.4 Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Returning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Conclusions and Recommendations 38
6.1 Conclusions on Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Appendix A: Computer Software Design Tools 39
Appendix B: Additional Testing Notes 40
Appendix C: Code of Software Elements 71
References 107
Vitae 108

5
List of Tables
2.1 Results from [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
5.1 Time Difference DTW/DTWTHREE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
5.2 Average Time and Size of wav/txt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
5.3 Results of Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

6
List of Figures
1.1 Braille Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 Neural Network and HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 DTW Simplified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 DTW Simplified Part Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Mathematical Equations of Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Spectrogram of an Audio Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Speech Recognition Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data Aquisition Tutorial Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Hamming Window in MatLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Visualization of Cepstral Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Visualization of Match Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
3.6 Visualization of Distortion Matrix Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Speech Recognition Program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
4.2 Select Code From Recorder.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Select Code From normalizer.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Select Code From hamWindow.m & usefullSig.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Select Code From cepAnal.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Select Code From Comparison Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.7 Select Code From specCreate.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.8 Select Code From matchMat.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.9 Select Code From DTW.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.10 Visualization of Faster Trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.11 Select Code From DTWTHREEm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.12 Select Code From .libCreat.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
4.13 Select Code From speechRec.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1 Phases of Data Manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Timing Measurements of Data Manipultations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Spectrograms. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Timing Measurements of Pattern Recognition .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5 Results and c Values DTW Original . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6 C Values for DTW Original . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.7 C Values Plotted Against File Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.8 Workings of matchMat and DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
5.9 C Values for DTW w Trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.10 Results and c Values DTW w Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.11 Visualization of DTW vs DTW w Breaking Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.12 Speed of Pattern Recognition for Library Sample Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
5.13 Time and Size of wav/txt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

7
Nomenclature
Delimiter: a character used to separate independent pieces of data in a text files or data streams.
DTW: Dynamic Time Warping
HMM: Hidden Markov Models
m-file: the format of MatLab files.
NN: Neural Networks
Phoneme: The smallest unit of sound. ie the sound 'ahh' or 'eee'.
Quefrency: a pseudo time domain resulting from Cepstral analysis.
Spectrogram: A representation of the frequencies native to a small portion of time in a signal.
SR: Speech Recognition

8
1 Introduction
1.1 Background
The greater group project, a collaboration between Chris Agam, Jon Hernandez, and myself, is a Braille
teaching device to be used by the visually impaired. In the USA and across the world Braille literacy
numbers are staggeringly low; as of 2009 only ten percent of blind children in America are Braille
literate [12].

Braille itself is a form of writing for the blind, consisting of six "cells" arranged two by three. By
raising dots in various cells via various combinations, one creates letters. For example, in Figure 1.1
below can be seen the letters 'a' and 'p', where the black dots represent the raised bumps.

Figure 1.1: The Braille representations of


the letters 'a' and 'p'.

The scarcity of those fluent in the language makes it a prime candidate for an electronic teaching
device, allowing people to simply plug in and learn. This interactive nature vastly improves upon the
teaching capabilities of a book and assistant, as the assistant will most likely not be fluent in Braille
themselves. A teaching program will therefore allow the assistant a great deal more ability in assisting
the learner.

Further, an important aspect of teaching methods and/or devices is the testing of the pupil on the
subjects being learned. The fact that the pupil will be blind poses a challenge. While possible for the

1
teaching program to allow an assistant to act as a supervisor of testing, a better solution would be direct
interaction between user and program.

To this end, this project is focusing on the development of efficient and lean speech recognition
software that will allow the user to test themselves as they learn. By independently creating our own
software we cut down on cost as well as on the superfluous abilities of commercially available software
(which focus on continuously reconstructing large, complex sentences independent of the speaker).

1.2 Objectives
The objective of the project is the creation of speech recognition software for the Braille Teaching
Device. The method will roughly follow the steps outlined by Jawed, et al [8] in their creation of a
similar system. Their results of a 68% efficiency gave me confidence to say that a minimum 60%
efficiency is deliverable.

It is also necessary for the program to run as fast as possible in order to be useful. It should take, on
average, no more than two seconds (plus recording time) to run.

1.3 General Approach to the Problem


The problem was to create a limited speech recognition program. This was to be achieved using
Mathwork's MATLAB programming environment. After researching the various methods of speech
recognition in use today, it was decided upon to use the simpler Dynamic Time Warping method. This
would allow for confidence in the ability of myself to complete the project, as opposed to other
methods which could have proven too difficult to implement.

1.4 Scope of the Project


The scope of the project was necessarily limited. As this project was being undertaken individually
there was some worry as to the complexity due to the fact that speech recognition projects are rather
difficult for even professional entities to create. Therefore, it was deemed that the software would only
recognize thirty entries: a-z of the alphabet as well as commands 'enter', 'yes', 'no', and 'back'.

2
2 Literature Review
2.1 Speech Recognition Basics
Six diverse articles have been noted that cover the breadth of speech recognition theory and
implementation. Articles like [7] provided general information about the basics, while those like [2]
and [10] provided background on areas of speech recognition that will not necessarily be used in the
project but help build a full understanding of the options open. References [1] and [11] have provided a
background in the implementation of DTW in respects to speech recognition. [8] is the best piece, as it
outlines the general steps used to create a speech recognition software package similar to this project.

Three books were also looked at. [5] focused on the human creation and recognition of speech, and
while providing background was less useful from a practical standpoint. [4] was helpful in
understanding the concept of cepstral analysis as well as the need for a non-rectangular window in the
data-manipulation phase. [6] provided information on a broad range of topics in a more practical sense
then the other books (though still majorly theoretical).

The field of speech recognition can be broken down into discrete or continuous recognition, as well as
speaker independent or dependent. Discrete systems require the user to pause between sounds, while
continuous systems operate without breaks [7]. Speaker dependence requires the user to have done
some training with the system to allow it to recognize the user, where as independent systems will work
regardless of user speech patterns, tones, et cetera (eg. automated phone services) [7]. Discrete,
dependent systems are the easiest to create.

2.2 Common Methods of Implementing Speech Recognition


For the actual speech recognition component of the system, there are three main methods found in
literature. The first is the Hidden Markov Model (HMM). It is a mathematical model where the future
state’s likelihood is dependent on the current state where the states are unobserved [2]. It is complex
and very good at identifying speech that is slurred and accented (as in reality, where the computer will
be unable to identify most information passed in and has to construct sentences out of what little it did
understand). It also far exceeds the complexity of the project’s goals, and so will not be used.

Also common to speech recognition system are Neural Networks (NN). [2] Features in a context

3
window are run through a system of weighted nodes, the output of which is a classification of each
input frame, measured in terms of the probabilities of phoneme-based categories

Figure 2.1: Diagrams of Neural Networks (left) and Hidden Markov Models (right).

Dynamic Time Warping (DTW), the final method found, will be used in stead of the previously
mentioned. This involves the modification of the input data’s temporal characteristics to fit within the
realms of a standard template, followed by (relatively) simple matching techniques [11]. In reality, this
is achieved by taking the entered and manipulated data and creating a spectrogram. One then takes the
data one wishes to compare the input with, and creates a spectrogram of it as well. Next, a local match
matrix is created, defined as the cosine difference between the points in the two spectrogram matrices.
From this local match matrix, one can trace through the "path of least resistance" to get the quickest
path through. It is then a simple matter to use this value in a comparison structure to match an input
signal against a variety of template signals and achieve a "best fit".

The concept of all three methods can be a little difficult to wrap one's head around. HMMs and NNs
were not used, so the project no longer concerns itself with their in-depth workings. DTW is a simple
enough concept once one takes the time to simplify the example. For example, look at Figure 2.2. Here,
for the purpose of demonstration, rather than matching audio signals one will use letters. On the left is
an example of "CHRIS" matched against "CHRIS", while on the right we find "CHRIS" matched
against "JON".

4
Figure 2.2: Example of function of DTW using names Chris and Jon.
The local match here is also simplified, and created with the "distance" away from a letter being equal
to 1. So in the left matrix, one can see in the bottom left that it starts at 0 as both axis have the same
value (blank). As one goes up the column, the value of the match gets further away from the wanted
value (blank) and so continuously increases. At the point where both axis are the same value (along the
centre diagonal) the match matrix continues to have values of zero due to these being matches, while
pushing away from the diagonal continually increases due to increased mis-match..

On the right portion of Figure 2.2, where "CHRIS" is matched against "JON", one can comparatively
see what it is like when there are no matches between the letters in the words being tested. The further
into the comparison, the more mis-match there is.

In Figure 2.3 below, there is a comparison of two signals which are closer to being similar then
"CHRIS" and "JON". Both "CHRIS" and "KRIIS" share the feature of ending in "IS". Note how, while
the values of the match matrix increase along the first three mis-matched letters, the final two are
matched and so carry the current value along.

Figure 2.3: Example of function of DTW using names Chris and a misspelling of Chris, "Kriis".

5
A more in-depth discussion of the mathematics of DTW are gone into in section 3.5.

After the local match matrix is created, there are two methods for creating a comparison between
signals. The best is to use the "final" value in the local match matrix as the definition of the best match.
(In Figure 2.3, this would be the value in the top right corner. In the actual code created, this final value
will be represented in the bottom right corner - see sections 4, 5) . In Figure 2.3, the two comparisons
work out to have a best of '0' for "CHRIS" vs "CHRIS" and a best of '3' for "CHRIS" vs "KRIIS". One
would therefore deem that the left comparison is the best match, and predict that that was the word
said.

Another method is to trace through the the local match matrix and use the length of this trace as the
definition of the best match. This has some practical advantages over the other method, as
demonstrated in the results section of this report (section 5). However, there are some large negatives to
this method. As one can see in the example presented in Figure 2.3, the trace of least resistance for
matching "CHRIS" against both "CHRIS" and "KRIIS" are the same, despite the fact that one is a
much better match then the other.

These values will need to be normalized to a value which is equal for matrices of various sizes, as there
is the obvious problem of larger signals taking more steps and therefore producing larger final values.

2.3 Spectrograms
Understanding spectrograms is necessary to understanding the project. In the explanation of DTW the
input signals "CHRIS" etc. were somewhat glossed over. In actual practice, the match matrix will be
created by comparing the spectrograms of two audio signals. A spectrogram is a representation of the
power spectral density inherent to an audio signal over time. That is to say, it is the magnitude of the
frequencies native to a point in time of a signal. The exact mathematical formulas involved in it's
creation are seen in Figure 2.4.

Figure 2.4: Mathematical equations for creation of a Spectrogram

6
The STFT stands for Short Time Fourier Transform. This works by taking the Fourier Transform of the
signal x(t) for only one short area at a time. This area is determined by the windowing function w(t-
tau). The windowing function w slides along the signal, so as to zero everything in the signal except for
a very small part at which one wishes to find the frequency components. By sliding the window, taking
the Fourier, sliding the window, taking the Fourier, etc. one builds up a series of frequency values for
the specific small amounts of time.

This can hopefully be understood via Figure 2.5. This figure shows an audio signal (in y-magnitude vs
x-time) and it's resulting spectrogram (magnitude of y-frequency at x-time). One can see for the first
pixel in the x-time - a time with very little signal - there are only smaller y-frequencies (less than
1kHz). But if you take a pixel from the x-time closer to 0.2s, one can see that that pixel's corresponding
frequencies are much larger (up to 4kHz).

Figure 2.5: Spectrogram of an audio signal.

7
2.4 Comparable Project Results
[8] has some testing that has allowed for the gauging beforehand of the type of efficiency results that
are achievable. Reproduced in the below Table 2.1 are their results.

Table 2.1: Results for project [8].

8
3 Problem and Methodology of Solutions
3.1 Statement of Problem
The basic problem is the identification of speech. The goal of the project as stated in the Proposal was
for the software to recognize a combination of discrete speaker dependent commands (eg. 'Enter') and
discrete speaker independent characters (eg. 'a') for testing.

The speech recognition system was creating by breaking the problem into four subsections: the
collection of data when signaled by the teaching program, the manipulation of data, the recognition
algorithms to categorize said data, and the passing of results back to the teaching program.

3.2 Methodology of Solutions


There are four basic steps in the speech recognition system. The initial phase will be data acquisition;
the entering of data into the computer system from the user. Following this, the data must then be
manipulated into a usable form. After this, robust recognition algorithms must be used to match the
input data with data saved in library to correctly identify the sound. Finally, the identified sound must
be passed out to the teaching program.

Figure 3.1: Flow diagram of the Speech Recognition blocks.

9
3.3 Data Collection
Data collection was achieved through the MatLab Data Acquisition Toolbox. This toolbox allows one
to interact with Microsoft windsound and take in audio signals directly from a microphone installed on
the computer in use. The toolbox will automatically bring in this data and store it as a workable matrix
in the MatLab environment.

While it is possible to do continuous recognition with the Data Acquisition Toolbox, it had already been
decided upon to build a discrete system. This would involve the use of triggers and set samples. It was
decided that a good length of time to allow the user to input would be 3 seconds. This was chosen as it
would allow the user enough time to say the letter even if they were somewhat unprepared.

The beginning and end of the data collection were deemed to necessitate a noise, to inform the user that
it had begun/stopped recording.

The Data Acquisition Toolbox came with a tutorial in it's use. The sample code provided was a good
place to start in learning how to use said device. Below is said code.

ai = analoginput('winsound');
addchannel(ai, [1 2]);
set(ai, 'SampleRate', 8000);
set(ai, 'SamplesPerTrigger', 3000);
set(ai, 'TriggerType', 'immediate');
start(ai);
[data,time] = getdata(ai);

Figure 3.2: Data Aquisition Toolbox tutorial code.

This is relatively simple and makes recording audio very simple. The first line is to set the type of
analog input being used - in this case winsound. One could modify this to be viable with a number of
comparable softwares for use on other systems (such as Mac Python).

The Sample Rate sets the sampling frequency (in Hz) and the Samples Per Trigger can be used to set
the length of the signal to be recorded (in this case, 3/8s). Once all of the wanted parameters are set,
one starts the analog input and the recording is done for you. This is then put into a matrix via the
getdata function, in a form that one can easily manipulate.

10
3.4 Data Manipulation
3.4.1 Normalization
Normalization is the process by which the signal is brought into a range consistent with expected
values. Original research into the creation of a normalization algorithm lead to the thought that it would
require analysis of the signal in order to find the peak value, followed by a reduction of the signals
amplitude. As in, one would have to run through the entire signal, record the maximum value, and then
go about the entire signal once again and reduce every point based on this maximum.

This posed the problem of being quite computationally wasteful, and as such initial thoughts were put
into a means by which this could be done at the same time as the program was checking the signal
values for the necessary windowing (see 3.4.2).

An alternative was created when looking for a way to maximize the potential provided by using the
MatLab program instead of another programming environment. MatLab has the advantage of being
built around the quick manipulation of whole matrices, and as such it was realised that one would be
able to normalize a signal by merely dividing by the built in max function.

3.4.2 Windowing
Windowing is the process of dividing the signal into small sections to be looked at independently of
one another, and is simple to achieve (multiply the signal by zero except at point of interest). For this
program, it is assumed that the only region of interest is the letter being spoken. As such, it is not
necessary to window the signal multiple times - one need only determine where the useful portion of
the signal is and cut away everything else.

A function will be created to handle the extraction of the useful signal from the total three seconds of
input data. A rough form of zero crossing will be used to determine when a useful signal has begun and
ended. This involves the checking for a certain level - the "zero" - to be crossed.

Once the useful signal has been extracted, the harsh cutoff at the edges poses a problem. These will
create a frequency signal approaching infinite. As the later pattern recognition stages depend on
creating spectrograms, this could be a problem (see 2.3 for description of spectrograms) [4]. As such,

11
one is required to use a window which is capable of removing the high frequencies at the edges while
not wrecking the frequency information present in the useful signal. Techniques for this include use of
a Hamming window. A Hamming Window can be described by the equation:
w[n] = {0.54-0.46cos(2pi*n/(N-1) 0<=n<=N-1
{0 otherwise

MatLab has a Hamming Window function built in, and so this will be used for ease (rather than using
the above formula). Figure 3.3 is a visualization in MatLab of a Hamming Window both in the time
domain and the frequency domain.

Figure 3.3: Hamming Window in time and frequency domain as in MatLab

3.4.3 Cepstral Filtering


Ceptral Analysis involves the use of the Inverse DFT to separate the person’s characteristic vocal tract
sounds from the actual speech. The process for Cepstral Analysis has been well detailed via
information from [4], and should not be difficult to do from basic knowledge in MatLab coding
techniques.

12
Cepstral filtering is very useful in the creation of a speech recognition system as one is required to
match speech, as opposed to voices. As such, the removal of sound distinctive to the user's vocal tract
will improve on the abilities of the pattern recognition.

Cepstral filtering works as follows:


The audio signal of one's voice has two components - the vocal excitation source (s) and the
vocal tract source (v). These two sources form the signal via a convolution such that:
f(t) = v(t)*s(t)
In order to remove the unwanted v(t), we take the Fourier Transform to get the frequency
domain representation, where a convolution in time becomes a multiplication.
|F(f)| = |V(f)|x|S(f)|
We can then make the two distinct by using the properties of the logarithm.
ln(|F(f)|) = ln(|V(f)|) + ln(|S(f)|)
If one now takes the inverse Fourier Transform, one ends up with a representation of the
original signal in what is termed the "quefrency" where the two signals have been separated (an
addition in frequency is an addition in time). The quefrency is in units of time, but it is no
longer an accurate representation of time, hence the new name. The movement into the
quefrency is visualized in Figure 3.4.

Figure 3.4: Visualization of the steps involved in Cepstral Filtering.


The wanted s components of human speech are known to reside in the lower quefrencies. In Figure 3.4,
the spike at a quefrency of ~8.5 ms is the v component, and can be filtered out.

Native to the MatLab coding environment are the functions cceps and icceps. These perform the
forward and inverse cepstral transformations of a signal into and out of the quefrency domain.

13
3.5 Recognition Algorithm: Dynamic Time Warping
The methodology of creating the recognition algorithm will be as such:
1. Audio signal has been input and manipulated into a form which is better for pattern
recognition. Now, it's spectrogram will be created.
2. The creation of a match matrix using the spectrograms of the input signal and of the various
reference signals stored in library.
3. The DTW process on the match matrix to get a value of relationship between the two signals.

3.5.1 Spectrogram
The concept of the spectrogram was detailed in section 2.3. MatLab allows one to easily create a
spectrogram of a signal with the built in specgram function.

3.5.2 Match Matrix


The match matrix is the overlay of two signals' spectrograms. This is done by finding the cosine
distance of the angle between two vectors for each point in the matrix [3], and is an example of a form
of Euclidean distance [8]. In Figure 3.5, for example, pixel (1,1) was found using the vector A(:,1) and
B(:,1) where A and B are the matrices of the two spectrograms being compared. Pixel (1,2) was created
from A(:,1) and B(:,2), and so on. It is also important to normalize the value in this matrix back down
to reasonable level [3].

Match Matrix of two audio signals; "a" vs "garbage"

10

12

14
2 4 6 8 10 12 14

Figure 3.5: Visualization of Match Matrix

14
3.5.3 DTW
Data recognition utilizing DTW has been detailed through various sources, mainly [3],[6], & [9]. The
method in [3] involves the modification of the input and reference signals into their respected
spectrograms before DTW. The formula as given by [6] to solve the cumulative distortion measure:
D(i,j)=d(i,j)+minp(i,j){D[p(i,j)]+T[(i,j),p(i,j)]}
Where d is a local measure between frame i of the input and j of the reference, p is the coordinates of
possible predecessors, and T is the associated cost of the transition. This matches well to the formula
given by [3]: D(i+1,j+1)=M(i+1,j+1)+min{M(i,j),M(i+1,j),M(i,j+1)}

The basics of what this formula means is the creation of the D (distortion) matrix from the M (match)
matrix. When one is creating the distortion matrix D(i,j), one begins at position (1,1) and sets this to a
null value. The D(i+1,j+1) value is then created using the match matrix M(i+1,j+1) as the basis, but
adding the value of the lowest "jump" to a progressive pixel.

Simplified, if one is creating the distortion point D(4,2) one begins with the match point M(4,2). One
then looks at the values of M(3,1), M(4,1) and M(3,2) and adds the lowest - the quickest way to get
there. This can be seen in figure 3.6. One can then trace through the distortion matrix to find the
quickest path, and use this value as a means of comparison.

Figure 3.6: Visualization of Distortion Matrix Creation


There was also thought put in to a way to create a faster way to trace through the distortion matrix, via
breaking away once outside certain bounds. It's effectiveness would need to be tested to see if the time
saved by breaking early would be more then the time incurred by the added code.

15
3.5.4 Library
The library will be stored in a file in with the m-files, so that the speech recognition program can easily
access it. There will need to be a simple bit of code created capable of entering new files into the
library.

There were two main choices for the way in which to store the data:
1. Save the signals as Microsoft wav files using the MatLab function wavwrite, and access them
using the MatLab function wavread. Convert each accessed vector into it's spectrogram every
time it is accessed.
2. Save the spectrograms of the signals as delimited text files using the MatLab function
dlmwrite, and access them using dlmread.. Covert each vector into it's spectrogram matrix
only once, before it's saved.

The thought process is that saving as a delimited text file should logically take the program less time to
access the spectrogram - as it won't have to convert it every time - compared to saving as a wav. This
will come at the expense of the library being a larger size, as the spectrogram matrix is much larger in
size then the signal's wave vector. Testing will be needed to determine the better method.

3.6 Returning Results


Upon program activation, the teaching program will pass the value of the entry attempting to be
recognized. Data Output will involve returning a signal of (correct) “character recognized”, “failure to
recognize”, or the (incorrect) recognized character (1-26=a-z; 27-30=Enter, Yes, No, Back) to the
teaching program. Outputs returned were to be sent in the form of:
• 1-30 - incorrect character, outputs the results of pattern recognition (1-30).
• 50 - no satisfactory match.
• 100 - correct character.

Original thought into the interaction of the MatLab speech recognition and the C# teaching program
was to use the MatLab Builder for .Net. This would have created a wrapper to allow the MatLab code
to be run in C#. Mr. Hernandez found another way to access this, using a C# program which only
required the m-files to be in the same directory. This was used in stead.

16
4 Design Procedures
4.1 Speech Recognition Program
Figure 4.1 is the program speechRec.m. In this section, the design of it's components will be outlined
by taking selected code from the relative m-files. For the full code, see Appendix C.

function out = speechRec(in)


[audioIn fs]= recorder(); %get audio signal
audioIn = normalizer(audioIn);
audioIn = usefullSig(audioIn);
audioIn = cepAnal(audioIn);
audioIn = hamWindow(audioIn);

audioIn=specCreate(audioIn,fs);

%Comparison loop.
numLibEnt=30; %number of library entries, 1-30
%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').
cmin=500; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
cp=0.7071; %From experimental data, if the DTW block produces a value of
%0.7071 then this is a perfect match. This value is normalised
%for any size difference, et cetera.
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than
%min and don't have a match.
for m=1:numLibEnt
for n=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(x,fs);
M=matchMat(audioIn,Y);
ctemp=abs(DTW(M)-cp);
if (ctemp<c)
c=ctemp;
r=m;
end
end
end

%returning block.
if r==0
out=50;
elseif r==in;
out=100;
else
out=r;
end

return

Figure 4.1: m-file code used for Speech Recognition Program

17
4.2 Data Collection
Data collection was designed through the function recorder.m. This function took in no arguments, and
output the recorded signal (as a 1xn vector) as well as the sampling frequency used to record the audio.

In Figure 4.2 one can see some of the code used created to do this. The sampling frequency was set
permanently to be 8.192kHz - this is a standard value for inputting sound using winsound devices. The
time to record was also permanently set to 3 seconds. If one wished to change this, they'd have to go in
to the code to modify it. Allowing these to be changed via in input was considered, but ultimately
discounted as pointless.

The analogue input was set to be of the type winsound, and only recorded on one channel. Sample Rate
was set to fs and Samples Per Trigger set to t*fs ( to record for the the wanted time). Trigger Type set to
manual - meaning that it would begin when told, as opposed to other options such as triggering on a
rising edge. Trigger Repeat was set to 0, so that there was no repeat.

Out was the data recorded from the analog input.

fs = 8192; %in Hz, default sampling frequency for sound(), etc.


t=3; %in s, number of seconds to record for.
ai_length = t*fs;

% Set up MatLab Oscilloscope / Winsound Analoginput


ai = analoginput('winsound');
addchannel(ai, 1);
set(ai, 'SampleRate', fs);
set(ai, 'TriggerType', 'manual');
set(ai, 'TriggerRepeat', 0);
set(ai, 'SamplesPerTrigger', ai_length);

% Get data from the microphone


beep on;
beep;
start(ai);
trigger(ai);
data = getdata(ai);

beep;
delete(ai);

out = data; %return the audio input.

Figure 4.2: m-file code used to input audio signals.

18
4.2 Data Manipulation
4.3.1 Normalization
Normalization was done via the function normalizer.m. This function took in an assumed (1xn) vector
and normalized it to a maximum value of 0.5 via the code seen in Figure 4.3. Then returns the vector.

x = 0.5*x/max(abs(x));

Figure 4.3: m-file code used to normalize.

4.3.2 Windowing
Windowing was done via two functions: hamWindow.m and usefullSig.m. In Figure 4.4 one can see the
key aspects of both. The Hamming Window was created using the window function native to MatLab.
The useful signal extractor was done by running through the signal from both ends, as can be seen in
the sampled code. When the magnitude of the signal is above a threshold, this value is recorded as the
value at which to clip, minus an offset. These end values are stored in a and b, and the ends passed
these values are chopped off. Note that as and ab will be set to 0 once a value is found, ensuring that no
second value will be recorded (as the if statement will always be false).

w=window(@hamming,length(x)); x=x.*w;
____________________________________
for i=1:l
if (as && abs(x(i,1))>thresh)
a=i-os;
as=0;
end
if (bs && abs(x(l-i,1))>thresh)
b=l-i+os;
bs=0;
end
end

Figure 4.4: m-file code used to window.


4.3.3 Cepstral Filtering
Cepstral filtering was achieved via the creation of the function cepAnal.m. Figure 4.5 shows the
important code: the pushing of the audio signal into the quefrency, creation of a mask to remove
unwanted quefrecies, then the return to the time domain.
c=cceps(x);
pass=int16(length(c)/6); mask=ones(length(c),1);
mask(pass:length(c)-pass,1)=mask(pass:length(c)-pass,1)-1; c=c.*mask;
x=icceps(c);

Figure 4.5: m-file code used to perform Cepstral Filtering.

19
4.4 Recognition Algorithm: Dynamic Time Warping
Figure 4.6 is the main comparison work of the program speechRec.m. The first step in this code is to
define the size of the library. There are two main components to this: the number of library entries, and
the number of samples. For the purpose of testing this program, a library with 30 entries had been
created, with entries 1-26 corresponding to a-z, as well as the four commands "Enter", "Yes", "No", and
"Back" (27-30, respectively). There were three samples of each entry (0-2).

It was decided that the variable to store the best match would be called "c". A cmin was then establish,
being the minimum acceptable value which would be recorded. If no c being returned in later stages
could been the cmin, then it means that there was no recognizable input. A ctemp was also made to
hold returned c values temporarily. Finally, cp (p for perfect) and r were initialized. The cp value was
found through testing to be 0.7071 - that is to say that if the DTW finds a perfect match, it will return a
value of 0.7071 (see 4.4.3). The r is a variable which will hold the entry number which the current best
c falls under, and will be used in the return phase.

%Comparison loop.
numLibEnt=30; %number of library entries, 1-30
%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').
cmin=0.05; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
cp=0.7071; %From experimental data, if the DTW block produces a value of
%0.7071 then this is a perfect match. This value is normalised
%for any size difference, et cetera.
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than
%min and don't have a match.
for m=1:numLibEnt
for n=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(x,fs);
M=matchMat(audioIn,Y);
ctemp=abs(DTW(M)-cp);
if (ctemp<c)
c=ctemp;
r=m;
end
end
end

Figure 4.6 m-file code for comparison loops.

The algorithm itself is very simple. There are two nested for loops which will run through every sample

20
of every entry in the library, as defined by the number of library entries (m) and the number of library
samples (n). For example, m begins at 1, corresponding to the library entry for 'a'. n begins at 0 and will
progress through 1 and 2. This has the effect of testing the first sample of 'a', then the second, then the
third. If time and will permitted, a library with a much larger range of samples could be created, vastly
improving the probability of finding a match.

Inside the for loops is the actual actions, beginning with the opening and reading of the library's data
stored as wav files (for the reason why wav files instead of txt files, refer to 5.3.4). Then, the created
function specCreate.m is used to create the spectrogram of the library entry. matchMat.m create the
match matrix of the library entry and the audio recorded, which is passed to DTW.m. DTW.m returns a
value which one can use to judge the closeness of the match. A perfect match of 0.7071, and so the
result of DTW.m has this cp subtracted from it. The absolute of this value is then taken, such that the
best match of any library entry sample is the closest to 0 (approaching from the positive). This value is
stored as ctemp.

If the ctemp is less then the current c, then the ctemp becomes the new c value via an if statement. Also
included in this if statement is the changing of r to equal m (ie. the current entry one is testing is so far
the best recognition). c is initialized to cmin, such that if no entry's DTW result is a good enough
match, r will remain 0 (ie. the result will be that no entry was recognized).

4.4.1 Spectrogram
The file specCreate.m was made to use the native specgram function in MatLab, but with pre-set
entries as seen in Figure 4.7. The actual function is S=specgram(a,nfft,fs,window,numoverlap) where a
is the vector one wants to turn into a spectrogram, nfft is the length of the Fourier Transform to use,
window is the width of the window to use, and numoverlap is the length of the overlap of successive
windows. Refer to 2.3 for better understanding of spectrograms. The values seen in Figure 4.7 were
good values as deemed by [3].

X = specgram(x,512,fs,512,384);

Figure 4.7: m-file code used to create spectrogram.

21
4.4.2 Match Matrix
The match matrix is constructed using matchMat.m as seen in Figure 4.8. The two input spectrograms
A and B are manipulated in order to gain a normalized match matrix via the method described in 3.5.2.

sA= sqrt(sum(A.^2));
sB = sqrt(sum(B.^2));
M = (A'*B)./(sA'*sB);

Figure 4.8: m-file code used to create match matrix.

4.4.3 DTW
DTW.m accepts the match matrix as an argument and returns a value of match goodness. It begins by
creating the distortion matrix as described in 3.5.3, as can be seen in Figure 4.9. As discussed in 2.2, the
value in the bottom right corner of the distortion matrix can be used for classification purposes.

Unfortunately, as seen in section 5.3 there were difficulties with this method relating to the size of the
library entries and an inability to effectively normalize them. As such, the second method for
classifying match goodness discussed in 2.2 had to be used. This involved tracing through the
distortion matrix from it's top left corner using the phi values (which stored weather the path of least
resistance was be going right, down, or right and down in one step).

This value can then be easily normalized for varying sizes of match matrices by dividing the trace
length by the diagonal (as a perfect trace should be a line straight down the diagonal). The method of
tracing code used (see Appendix C) resulted in a perfect match returning a value - after normalization,
or 0.7071.

%create matrix and variables.


for i=1:m
for j=1:n
[dmax,tb]=min([D(i,j),D(i,j+1),D(i+1,j)]);
D(i+1,j+1)=D(i+1,j+1)+dmax;
phi(m,n)=tb;
end
end

% Tracing Code: see Appendix C.

out=out/sqrt(m^2+n^2); %divide by diagonal so that all answers are equally


weighted.

Figure 4.9: m-file code used to do DTW.

22
Also created was an attempt to speed up the DTW.m by stopping the trace is one went "out of bounds".
As one knows that a good match will go roughly right down the centre diagonal, one can then postulate
that if the trace is running outside a certain area, it can immediately be discounted as a poor match.
Figure 4.10 visualizes this. On the left is an ideal good region in the middle [6], while on the right is a
somewhat simpler means of implementing the concept.

Figure 4.10: Visualizing the idea behind a faster approach.

In the breaking code used in the function DTWTHREE.m, seen in Figure 4.11, the p value is the
vertical position of the trace, and the q is the horizontal. To implement what's seen in Figure 4.10, one
takes the vertical size of the distortion matrix (ie 14). One then assumes that a sixth of this value (ie~2)
is how close one wants the vertical trace to remain to the centre. If the vertical value (p) is greater or
less than this distance from the ideal line (the diagonal), then it is out of bounds and the trace is ended
early. The ideal vertical point p for horizontal point q is found by finding the angle of the idea diagonal
(tan-1(opposite/adjacent)), then finding opposite=adjacent*tan(angle) where the adjacent is q.

ideal=q(1,1)*tan(atan(m/n));
ideal1=ideal+m/6;
ideal2=ideal-m/6;

if (p(1,1)>ideal1)
i=-1; %easy way to stop the while loop.
p=m*n; %Some high value.
end
if (p(1,1)<ideal2)
i=-1; %easy way to stop the while loop.
p=m*n; %Some high value.
else

Figure 4.11: m-file code used to do modified trace.

23
4.4.4 Library
The library was created manually with the function libCreate.m as seen in Figure 4.12.

fs=8192;
audioIn = recorder(); %get audio signal
audioIn = normalizer(audioIn);
audioIn = usefullSig(audioIn);
audioIn = hamWindow(audioIn);
audioIn = cepAnal(audioIn);

libNum=input('Please input number to be associated with file (ie 10-


999).','s'); %String.

wavwrite(audioIn,fs,['Library/lib' libNum '.wav']);

Figure 4.12: m-file code used to create library entries.


4.5 Returning Results
Results were returned at the end of speechRec via the code seen in Figure 4.13. The r value (as noted in
section 4.4) is the library entry for which the input audio best matches. If the r remained as a 0 through
the program, this means that no library entry was close enough to warrant a match and so the output
will be 50 (the designated code number for "no recognition" as understood by Mr. Hernandez's teaching
software). If the r value matches what the teaching software told it upon starting is the correct answer,
is will output 100 (the designated code for "recognition of correct value"). If neither of these are the
case, it will return the value of the library entry for which it thought it recognised.

if r==0
out=50;
elseif r==in;
out=100;
else
out=r;
end

Figure 4.13: m-file code used to return results.

24
5 Results and Discussion
5.1 Data Collection
As seen in Figure 5.2, the recorder.m function had a response time of a little over three seconds. This is
expected, as the input was set to record for three second. It functioned as designed (section 4).

5.2 Data Manipulation


Figure 5.1 allows one to view the appearance as it runs through the stages of data manipulation. It
functioned as designed (section 4).

Normalized sound Useful sound


0.6 0.6

0.4 0.4
Signal Magnitude

Signal Magnitude

0.2 0.2

0 0

-0.2 -0.2

-0.4 -0.4
0 1 2 3 0 500 1000 1500 2000
Time Axis 4 Time Axis
x 10

Post Cepstral Filtering Window`d (hamming) sound


0.5 0.3

0.2
Signal Magnitude

Signal Magnitude

0.1

0
0
-0.1

-0.2

-0.3

-0.5 -0.4
0 500 1000 1500 2000 0 500 1000 1500 2000
Time Axis Time Axis

Figure 5.1: The phases of data manipulation.

25
Times of recorder.m for Four trials -4
x 10 Times of nomalizer.m vs Size
4 4

3 3
t (in s)

t (in s)
2 2

1 1

0 0
1 2 3 4 0 5 10 15
Trial Size in kB
-4 Times of usefullSig.m vs Size Times of capAnal.m vs Size
x 10
4 0.02

3 0.015
t (in s)

t (in s)
2 0.01

1 0.005

0 0
0 5 10 15 0 5 10 15
Size in kB Size in kB
-3 -3
x 10 Times of hamWindow.m vs Size x 10 Times of specCreat.m vs Size
1.5 3

1 2
t (in s)

t (in s)

0.5 1

0 0
0 5 10 15 0 5 10 15
Size in kB Size in kB

Figure 5.2: Timing measurements for input files of differing sizes.


Cutting away everything but the useful signal creates variable response times. In Figure 5.2, one can
see the times taken by the various functions (y-axis) based on the size of the input wav in kB (x-axis).

5.2.1 Normalization
normalizer.m times increased linearly with increased wav size. It functioned as designed (section 4).

5.2.2 Windowing
usefullSig.m times appeared strange due to the tested files already having been run through
usefullSig.m at their point of creation. The time of usefullSig.m in reality is always constant, and is
based on the length of the audio signal (ie 3 seconds * 8192 samples per second). It functioned as

26
designed (section 4).

hamWindow.m times increased linearly with increased wav size. It functioned as designed (section 4).

5.2.3 Cepstral Filtering


cepAnal.m times increased linearly with increased wav size. It functioned as designed (section 4).

5.3 Recognition Algorithm: Dynamic Time Warping


5.3.1 Spectrogram
specCreat.m times increased linearly with increased wav size. It functioned as designed (section 4).
Figure 5.3 shows the spectrograms of three sounds: those of 'c', 'b', and 'w'. Note how 'c' and 'b' are
rather similar, while 'w' appears quite different. This is becomes a minor problems, as when one is
testing for recognition of 'b', one will often see close matches with 'c', 'd', 'e' and other letters which
share similar sounds.

Spectrogram of Input Sound `c`


4000
Frequency

3000

2000

1000

0
0.05 0.10.15 0.2 0.25
Time
Spectrogram of Close Library Sound `b`
4000
Frequency

3000

2000

1000

0
0.05 0.1 0.15 0.2
Time
Spectrogram of Far Library Sound `w`
4000
Frequency

3000

2000

1000

0
0.1 0.2 0.3 0.4
Time

Figure 5.3: Examples of Spetrograms


created by specCreate.m for letters 'c', 'b',
and 'w'.

27
5.3.2 Match Matrix
matchMat.m times increased linearly with increased wav size as seen in Figure 5.4. It functioned as
designed (section 4).

-3
x 10 Times of matchMat.m vs Size
6

4
t (in s)

0
2 4 6 8 10 12 14
Size in kB
Times of DTW.m vs Size
0.015

0.01
t (in s)

0.005

0
2 4 6 8 10 12 14
Size in kB

Figure 5.4: Timing measurements for input files of differing sizes.

5.3.3 DTW
DTW.m times increased linearly with increased wav size as seen in Figure 5.4.

There were numerous challenges in the creation of DTW. Following is a truncated discussion of the
results of the early versions of DTW, as well as the changes which were made as a result. For the full
thought process (in rough formed notes) refer to Appendix B.

28
Originally, as documented in 4.4.3 and 2.2, the DTW.m used the bottom right value of the distortion
matrix as value of match goodness. Initially, the diagonal length of the match matrix was attempted to
be used as a normalizing factor. When testing began to determine the efficiency of the program, results
were extremely poor. A constant feature noticed was the tendency for the speech recognition program
to recognize the input speech as 'w' 90% of the time [14]. It was theorized that the normalization via
division of the diagonal was not working as wanted.

Figure 5.5 from Test File 5 (see Appendix C) involved the running of all library entries through the
pattern recognition code. This guaranteed a perfect match, and if the code was working the way it was
designed to, each perfect match would return a consistent c value.

-16
Results (100=match) x 10 Value of C variable for match
101 3.5

100.8
3
100.6

100.4 2.5

100.2
2

100

1.5
99.8

99.6 1

99.4
0.5
99.2

99 0
0 20 40 60 80 100 0 20 40 60 80 100

Figure 5.5: Results of testing audio files against themselves to see results, as well as c values returned.
x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)

As can be seen, the c values vary wildly. The same experiment was attempted repeatedly, with different
values in place of the diagonal as an attempt to find a means of normalizing the c values. These
included:
• No normalization.
• Multiplication by diagonal.
• Division/Multiplication by area.

29
The results of these further tests were much like Figure 5.5.

Figure 5.6 from testFileSeven (see Appendix C) involved taking a known letter (w=23, a=1) and
running it through the pattern recognition code. This test recorded the c values returned by every
library entry sample.

Figure 5.6: Testing for known w (23) and a(1) to see what sort of c values are returned for all
library entries. x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)

One can see that for testing of w (the left of Figure 5.6) the c values very wildly, though there is some
pattern due to the three sample for each entry having more in common with other entries than samples
(ie. entry 1 and 2 sample 1 are more similar than entry 1 sample 1 and 2) due to different people
creating them.

The minimum value is the perfect match - zero, as expected by the theory from the literature (section
2.2). The x-axis of Figure 5.6 represents the library entry samples, and 67-69 refers to w's.

In the right portion of Figure 5.6, the results for the testing of a are seen. While the perfect match is
found at a (as expected), the next best match is at w with a c value of 0.0027, despite a and w sounding
nothing alike. It should be noted that w @ 69 is also the largest file in the library.

While a perfect match will return the correct recognized character, even the slightest variation will be

30
beaten by w. It can therefore be surmised that the failure to find an effective way to normalize the
bottom right value of the distortion matrix is responsible for the extremely poor efficiency and
extremely high number of false positives (almost all of which are w).

Figure 5.7 is from testFileEight (see Appendix C). It allows the visualization of the c values given
when 'a' is run through the pattern recognition, plotted along side the relative size of the library
entrances.

C values are in red, sizes in blue. One can see that - other than for a perfect match - the best c values all
correspond to the largest data files. [Note that Figure 5.7 was created using area division in the DTW.]

Value of C (red) v Size(b) for 1


50

45

40

35

30

25

20

15

10

0
0 10 20 30 40 50 60 70 80 90

Figure 5.7: Value of c returned for entry of a (1) in red plotted against the proportional size of the input
file. x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)

This allows one to conclusively say that the reason for the "w problem" is that the larger file size needs
to be normalized in some way. A method to do this was not found, resulting in the necessity of

31
changing to the second method of determining match goodness as described in 2.2 and 4.4.3: the trace
back through the distortion matrix as seen in Figure 5.8.

Perfectly Matching Input Specs Match Matrix (left), Quickest Path (right)

2 2
4 4
6 6
8 8
10 10
12 12
14 14
2 4 6 8 10 12 14 2 4 6 8 10 12 14

Somewhat Matching Input Specs

2 2
4 4
6 6
8 8
10 10
12 12
14 14
5 10 15 5 10 15

Poorly Matching Input Specs

2 2
4 4
6 6
8 8
10 10
12 12
14 14
5 10 15 20 25 5 10 15 20 25

Figure 5.8: Demonstrating the workings of matchMat.m and DTW.m. Left column is the local
match matrix for perfectly, somewhat, and poorly matches input spectrograms ('b'&'b', 'b'&'c',
and 'b'&'w' respectively). To the right of these are representations of the path of least resistance
taken by the DTW block in order to find a best match value.

32
Figure 5.9 below allows for the visualization of the returned c values using the trace method. It was
created using a (1) as the input signal to match against, and was run through the pattern recognition of
only one sample range (hence the 1-30 x-axis, as opposed to the 1-90 x-axis seen in previous figures).

One can see that with this setup, a perfect match is the value of 0.7071. One can see in Figure 5.9 that
library entries that are like 'a' are also found around this value, while library entries which are far off
(like 'w') are quite distant to this value. This greatly reduced the number of false positives being found
and cured the "w problem", coinciding with a noticed improvement in efficiency.

c values for library entries


1

0.9

0.8

0.7

0.6
c value

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30
library entry

Figure 5.9: the returned c values for the traceback method using a=1 as the input.

This method did however have some unfortunate problems, as can be seen in Figure 5.10. The problem
is simple and very hard to correct: in the portion of the pattern recognition as detailed in section 4.4
there is included a piece of code ctemp=abs(DTW(M)-cp). This is necessary to find a value close to 0

33
to be the best match goodness measure, so that the value can be compared with the previous best match
goodness.

As can be seen in Figure 5.10, where a similar test to that as seen in Figure 5.5 is run using the new
method, when testing a file in the library through the pattern recognition it does not always equal a
match! The reason for this is that the DTW(M)-cp has a minimum value of 6.781186547510920e-06; it
won't calculate 0 even when given a perfect match. This results in situations where more than one
library entry will give this value, meaning that the recognized character is the last time this happened
instead of the exact match.

While it would have been nice to fully solve this problem, the solution to this point was deemed good
enough for the purposes of this project.

-6
Results (100=match) x 10 Value of C variable for match
100 6.7812

80
6.7812

60
6.7812
40

6.7812
20

0 6.7812
0 20 40 60 80 100 0 20 40 60 80 100

Figure 5.10: Results of testing audio files against themselves to see results, as well as c values
returned. x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)

34
In 4.4.3 a method for possibly speeding up the tracing was outlined. It involved breaking out of the
trace while loop if one ran out of certain bounds while tracing. In testFileTwenty (see Appendix C) a
large number of DTW and DTWTHREE (DTW+breaking code) are run and timed. The results can be
seen in Table 5.1 for three runs of testFileTwenty.

Table 5.1: Results of testing for time difference between DTW (t0) and DTWTHREE (t1-with breaking
code) over a large number of averaged runs.

The addition of code to catch out of bounds appears, over an extreme sample, less efficient (or
negligibly different) than not having the code.

From testFileTwentyOne was created Figure 5.11, to help visualize this. It test the time to run DTW
and DTWTHREE when first doing a known perfect match (x=1), then a known poor match which
would trigger code (x=2).
-3
x 10 DTW in blue, DTWTHREE in red
2

1.5

0.5

0
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Figure 5.11: Visualisations of time it takes for DTW (blue) and DTWTHREE (red). Done for a known
perfect match (x=1), then a known poor match which would trigger code (x=2).

Conclusion: the increased chance of having done something wrong is not worth the negligible benefit.

35
5.3.4 Library
testFileNine was a simple test on time it takes to do comparison algorithms vs how many samples are
in the library. Figure 5.12 shows this to have a linear relationship.

Time to run through comparison vs number of samples in library


0.12

0.1

0.08
t (in s)

0.06

0.04

0.02

0
0 2 4 6 8 10 12 14 16 18 20
Number of samples

Figure 5.12: Times to run pattern recognition vs size


of library (defined as number of samples).

Testing was also done on the two options discussed in 3.5.4, repeated here:

There were two main choices for the way in which to store the data:
1. Save the signals as Microsoft wav files using the MatLab function wavwrite, and access them
using the MatLab function wavread. Convert each accessed vector into it's spectrogram every
time it is accessed.
2. Save the spectrograms of the signals as delimited text files using the MatLab function
dlmwrite, and access them using dlmread.. Covert each vector into it's spectrogram matrix
only once, before it's saved.

Testing concluded in the results seen in Table 5.2 and Figure 5.13.

Table 5.2: Average Time (T) of opening and Size (S) of wav and txt files.

36
.wav in red, .txt in blue
0.025

0.02
t (in s)

0.015

0.01

0.005

0
1 1.5 2 2.5 3 3.5 4
Trial number

120

100
size (in kB)

80

60

40

20

0
1 1.5 2 2.5 3 3.5 4
Trial number

Figure 5.13: Results of testing comparing size (bottom) and speed (top) of stored .wav's
(red) vs. .txt (blue) files producing spectrograms.
Unexpectedly, it was found that not only did option 1 save on file size (as expected) it also operated
much faster. This means that the function dlmread is actually slower then wavread and specCreate
combined. An interesting result.

5.4 Returning Results


Due to constraints in finding an efficient way to do speech recognition results were rather poor, with
efficiency around 25% as seen in Table 5.3. Interestingly, once the tester was able to get into a groove
of saying the letter in a certain way (obviously matching a library entry), efficiency could spike to
100% as seen in the second w testing. Efficiency would therefore be improved with a larger library.

Table 5.3: Results of Speech Recognition. Top row is trial number, values are those returned.

37
6 Conclusions and Recommendations
6.1 Conclusions on Project Objectives
The efficiency of ~25% was far below the wanted efficiency of 60%. However, it was determined that
this was due to the small library size (only three samples per entry). With a larger number of samples in
the library, it can be stated with confidence that efficiency would improve.

The use of Dynamic Time Warping for speech recognition was proven to be a viable method. That said,
it's returns are poorer then one would hope due in large to the fact that it is very difficult to normalize
the match goodness values.

Two interesting and unexpected discoveries were made during the course of the project. The first
pertains to the creation of the library for use in DTW pattern recognition. While it was known that
saving files in the Microsoft wav format would save physical space compared to saving as a delimited
text file, it was assumed that not having to convert into a spectrogram after reading would give the
delimited text file an edge in computational speed. However, testing showed that the function dlmread
(when reading in the spectrogram) was in fact slower then the combination of wavread (reading in the
recorded audio) plus specCreate.m (which converts the audio into its spectrogram).

The other discovery was in the attempt to increase the speed of the DTW by putting a check in trace
code. It was thought that by finishing once an out of bounds situation was achieved, it might be
possible to speed up the computational time. However, it was found that any speed gained from
breaking early was negligible. Whether this was due to the increased time of checking the if statements,
or if the process was already fast enough that any difference was statistical noise was not explored.

6.2 Recommendations
While the Speech Recognition presented is viable, the effectiveness is below that which can be found in
market for comparable price.

38
Appendix A: Computer Software Design Tools
C#
C# is a multi-paradigm programming language encompassing object-oriented (class-based)
programming disciplines. It is a Microsoft product within the .NET initiative.

Microsoft Visual C# 2008 Express Edition was used during the creation of this project (mainly the
component created by Mr. Hernandez). It was a free program with registration.

Mathwork's MATLAB
MATLAB stands for "Matrix Laboratory" and is a numerical computing environment developed by
MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation of
algorithms, and interfacing with programs written in other languages. It excels in the manipulation of
matrices.

A 2007 Student Edition was used during the creation of this project. It is available from Mathworks at a
cost of $99USD. [I already had a copy.]

Data Acquisition Toolbox


From Mathworks: "Data Acquisition Toolbox™ software provides a complete set of tools for analog
input, analog output, and digital I/O from a variety of PC-compatible data acquisition hardware. The
toolbox lets you configure your external hardware devices, read data into MATLAB® and Simulink®
environments for immediate analysis, and send out data."

It is available from Mathworks for $29USD.

39
Appendix B: Additional Testing Notes
Note: In online copy only. For the physical copy, these have been removed. If one wishes to view this
code, they may contact the writer for information into such.
TESTING A
From Test File 5
-16
Results (100=match) x 10 Value of C variable for match
101 3.5

100.8
3
100.6

100.4 2.5

100.2
2

100

1.5
99.8

99.6 1

99.4
0.5
99.2

99 0
0 20 40 60 80 100 0 20 40 60 80 100

If the normalization was working as I wanted it to, this should be consistent value.

Try removing the divided by in DTW. Results:


-15
Results (100=match) x 10 Value of C variable for match
101 8

100.8
7

100.6
6
100.4

5
100.2

100 4

99.8
3

99.6
2
99.4

1
99.2

99 0
0 20 40 60 80 100 0 20 40 60 80 100

The division is not what is causing the problem.

40
Want to check consistency, will return division and see if it is as before, then work on creating better
system.
-16
Results (100=match) x 10 Value of C variable for match
101 3.5

100.8
3
100.6

100.4 2.5

100.2
2

100

1.5
99.8

99.6 1

99.4
0.5
99.2

99 0
0 20 40 60 80 100 0 20 40 60 80 100

So, yes. Good, at least there's no funny bug. Consistent results.

Second replacement, using area instead of diagonal value.


Results (100=match) -17 Value of C variable for match
x 10
101 9

100.8 8

100.6
7

100.4
6

100.2
5
100
4
99.8

3
99.6

2
99.4

99.2 1

99 0
0 20 40 60 80 100 0 20 40 60 80 100

Area and diagonal produce basically same results.

41
TESTING B
From testFileSeven
Testing to see what kind of c values a single letter gets.

Value of C variable for 23


0.018

0.016

0.014

0.012

0.01

0.008

0.006

0.004

0.002

0
0 10 20 30 40 50 60 70 80 90

Okay, we see 69 matching (ie good match for w).

42
Look at a:

Value of C variable for 1


0.03

0.025

0.02

0.015

0.01

0.005

0
0 10 20 30 40 50 60 70 80 90

3 matches at 4.9343e-019, which is expected. But next closest is 69 at 0.0027. So here's the w problem.

43
TESTING C
From testFileEight.
Want to look at correspondence between c value and file size.
Value of C (red) v Size(b) for 1
50

45

40

35

30

25

20

15

10

0
0 10 20 30 40 50 60 70 80 90

Can actually see an anti-correlation. Note that this is using an area division in the DTW.

44
Going to remove this and try again.
Value of C (red) v Size(b) for 1
16

14

12

10

0
0 10 20 30 40 50 60 70 80 90

Again, anti-correlation between file size and the c value being produced.

That is to say, bigger files are producing smaller c values. Hmmmm.


Also not that 'w's are some of the biggest files, perhaps accounting for the tendency to run to 0.

45
What if I multiplied instead? Here are the results if DTW multiplied by area:
Value of C (red) v Size(b) for 1
14000

12000

10000

8000

6000

4000

2000

0
0 10 20 30 40 50 60 70 80 90

We see that now we have correlation, which we don't want either.

46
Testing D
From testFileNine.
Simple test on time it takes to do comparison algorithms vs how many samples are in the library.
Time to run through comparison vs number of samples in library
0.12

0.1

0.08
t (in s)

0.06

0.04

0.02

0
0 2 4 6 8 10 12 14 16 18 20
Number of samples

47
Testing E
From testFileTwo
Here is the comparison between opening up the files as .wavs and converting to spectrograms and
saving as .txt files.

.wav in red, .txt in blue


0.025

0.02

0.015
t (in s)

0.01

0.005

0
1 1.5 2 2.5 3 3.5 4
Trial number

120

100

80
size (in kB)

60

40

20

0
1 1.5 2 2.5 3 3.5 4
Trial number

Can see that the .wav's are, strangely, both processed faster and stored in smaller files.

Twav =0.0046s Ttxt =0.0155s


Swav =8.0350kB Stxt =67.1000kB

Interesting to note that the txt is proportional, but wav is not.

48
Testing F
From testFileTen.
Very similar to test file two. Going to get timings for various bits in relation to the size of the wav.

Times of recorder.m for Four trials -4 Times of nomalizer.m vs Size


x 10
4 4

3 3

t (in s)
t (in s)

2 2

1 1

0 0
1 1.5 2 2.5 3 3.5 4 2 4 6 8 10 12 14
Trial Size in kB
-4 Times of usefullSig.m vs Size Times of capAnal.m vs Size
x 10
4 0.02

3 0.015

t (in s)
t (in s)

2 0.01

1 0.005

0 0
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Size in kB Size in kB
-3 -3
x 10 Times of hamWindow.m vs Size x 10 Times of specCreat.m vs Size
1.5 3

1 2
t (in s)

t (in s)

0.5 1

0 0
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Size in kB Size in kB

49
Testing G
From testFileEleven

Looking at the matchMat and DTW timing for different sized wavs.

-3
x 10 Times of matchMat.m vs Size
6

4
t (in s)

0
2 4 6 8 10 12 14
Size in kB

Times of DTW.m vs Size


0.01

0.008

0.006
t (in s)

0.004

0.002

0
2 4 6 8 10 12 14
Size in kB

50
Testing H
From testFileTwelve

Timing speechRec.m for various sizes of arrays.

Time of speechRec.m in blue for trials (in s)


8

0
0 2 4 6 8 10 12 14 16 18 20
4
x 10 Size of audio in files in red (in Array Size)
3

0
0 2 4 6 8 10 12 14 16 18 20
-3
x 10 Ratio between time and size (in s/ArraySize)
1.5

0.5

0
0 2 4 6 8 10 12 14 16 18 20

Note: sample 6 of rSpec was removed as it was an extremely large size and made other results hard to
read.

51
tSpec
4.72751166698561 5.50082608264458 5.53807886197827 5.02077966613075
5.79221702758311 3.87255475207044 5.82172762180668 5.92860489252126
4.43773356669633 5.57504354603728 5.82292959021328 4.65210285106068
5.06504150667194 4.07166958370407 5.85625526428638 6.10156730813553
5.79194227199267 5.74865583474995 5.91190165230497 3.85032357464426

sSpec
10454 17240 18265 12890 20603
320 20223 20626 8183 18018
20620 10064 13402 5017 20596
20603 19634 19305 18471 3005

rSpec
0.000452220362252306 0.000319073438668479 0.000303207164630620
0.000389509671538460 0.000281134641925114 0.0121017336002201
0.000287876557474494 0.000287433573767151 0.000542311324293820
0.000309415226220295 0.000282392317663108 0.000462251873118112
0.000377931764413665 0.000811574563225847 0.000284339447673645
0.000296149459211548 0.000294995531832162 0.000297780670020717
0.000320063973380162 0.00128130568207796

52
Testing I
From testFileThirteen

This is comparing the c values of the three a's in the library currently.

-13 c value returned for letter "1"


x 10
5

4.5

4
c value returned for perfect match

3.5

2.5

1.5

0.5

0
1500 2000 2500 3000 3500 4000 4500
size of array

This is for DTW multiplying by area. Ideally, when we have a perfect match, the results should be the
same no matter the size of the array.

53
Here is the result for dividing by diagonal, and dividing by area respectively.
-16 c value returned for letter "1"
x 10
1

0.9

0.8
c value returned for perfect match

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
1500 2000 2500 3000 3500 4000 4500
size of array

54
-17 c value returned for letter "1"
x 10
1.4

1.2
c value returned for perfect match

0.8

0.6

0.4

0.2

0
1500 2000 2500 3000 3500 4000 4500
size of array

55
And here is with no factoring due to size:
-15 c value returned for letter "1"
x 10
3

2.5
c value returned for perfect match

1.5

0.5

0
1500 2000 2500 3000 3500 4000 4500
size of array

56
Testing J
From testFileFourteen

Getting c's across a certain numLibSam.

This is with DTW having no factoring:


-15 c value returned for sample "2"
x 10
8

6
c value returned for perfect match

0
1000 2000 3000 4000 5000 6000 7000
size of array

MAJOR PROBLEM:
A PERFECT MATCH SHOULD BE A PERFECT MATCH SHOULD BE A PERFECT MATCH.

57
Here's divided by area, multiplied by area, divided by diagonal:
-17 c value returned for sample "2"
x 10
9

c value returned for perfect match 8

0
0 1000 2000 3000 4000 5000 6000 7000
size of array
-11 c value returned for sample "2"
x 10
1.2

1
c value returned for perfect match

0.8

0.6

0.4

0.2

0
0 1000 2000 3000 4000 5000 6000 7000
size of array
-16
x 10 c value returned for sample "2"
3.5

3
c value returned for perfect match

2.5

1.5

0.5

0
0 1000 2000 3000 4000 5000 6000 7000
size of array

58
Testing K
From testFileFifteen

Want to see if the size of the spectrograms is affecting the c. Not sure if there's a difference between
array size and spectrogram size.

This is for no factoring.


-15
x 10 c value returned for sample "2"
8
c value returned for perfect match

0
0 2000 4000 6000 8000 10000 12000 14000
area of spec

-15
x 10
8
c value returned for perfect match

0
0 500 1000 1500 2000 2500
area of match matrix

59
WONDERING IF SIZE OF SPECTROGRAM HAS TO DO WITH IT.

IF I CAN STANDARDIZE THESE, WILL IT IMPROVE?

Changed specCreate.m to have X=specgram(x);


Result:
-14 c value returned for sample "2"
x 10
c value returned for perfect match

0.8

0.6

0.4

0.2

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
area of spec
-14
x 10
c value returned for perfect match

0.8

0.6

0.4

0.2

0
0 10 20 30 40 50 60 70
area of match matrix

Note: This will cause an error in matchMat.m as dimensions will no longer agree.

Here, all the match matrices are either 7*7 or 8*8.

60
Testing L
From testFileSixteen

With division:
-17 c values for library entries
x 10
1

-1

-2

-3
c value

-4

-5

-6

-7

-8

-9
0 5 10 15 20 25 30
library entry

No division:
-15
x 10 c values for library entries
8

2
c value

-2

-4

-6

-8
0 5 10 15 20 25 30
library entry

61
In respect to size, no division:
-14 c values for library entries
x 10

0.5
c value

-0.5

5 10 15 20 25 30 35 40 45 50
library entry size

Putting division back in:


-17
x 10 c values for library entries
2

-2
c value

-4

-6

-8

-10
5 10 15 20 25 30 35 40 45 50
library entry

62
Crazy thought!
Do p q trace through. Take these and divide by diagonal.

With these changes, results for c by size are:


c values for library entries
0.8

0.7

0.6

0.5
c value

0.4

0.3

0.2

0.1

0
5 10 15 20 25 30 35 40 45 50
library entry

63
Here I'm going to go through the tests of the previous testing data with the new DTW.

Testing A

Woops. Not a success after all.. Why does it seem to be working in testFileSixteen but not other test
files?

Creating testFileSeventeen to mimic testFileSixteen except that will use DTW.m.

Here's with all perfect matches (x axis is library entry size):


c values for library entries
0.8

0.7

0.6

0.5
c value

0.4

0.3

0.2

0.1

0
5 10 15 20 25 30 35
library entry

64
Now to test all against 'a':

c values for library entries


1

0.9

0.8

0.7

0.6
c value

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30
library entry

Perfect match is about 0.7071.

Going to modify speechRec so that it is closest to 0.7071 rather than lowest which is match.

65
Results of testFileFive post modification.
Results (100=match) -6 Value of C variable for match
x 10
100 7

90
6

80

5
70

60
4

50

3
40

30
2

20

1
10

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Doesn't make sense. A constant c value? And all of them are 6.781186547621942e-06?

Something's not working the way I think it's working.

Why do I have outs=0? That means r is never changing, ie we never get abs(DTW(M)-cp)<cmin.

But that should be impossible as I've determined that a perfect match produces a value of 0.7071, and
I'm guaranteed to get at least one perfect match, producing a c~0.

Code must not work like I'm thinking it does.

Mistake in the code caused some problems: had it as 1:num instead of 0:num!

Running through code, problem becomes apparent. 6.781186547510920e-06 is the smallest number
MatLab can produce! As a result, multiple returns are all giving this values back to me.

66
-6
Results (100=match) x 10 Value of C variable for match
100 6.7812

80
6.7812

60
6.7812
40

6.7812
20

0 6.7812
0 20 40 60 80 100 0 20 40 60 80 100

Okay, getting shit results. Not much to do about that now.

New day, new idea.

By taking length of q or p, am I not getting the number of steps? I believe so. This make results closer
then they actually are.

Nope, this is good.

Testing B
From testFileTwenty.

Testing if the code I've written to pop out of the back trace early actually has an effect on speed.

With ideal1=ideal+m/6;

Results: t0 =37.1087 t1 = 37.2153


t0 =37.0801 t1 = 38.6007
t0 =37.1170 t1 = 37.4828

Result: As suspected, the addition of code to catch out of bounds appears, over an extreme sample, less
efficient than not having the code.

Testing C
From testFileTwentyOne

Testing the results when first do a perfect match (x=1), then a known broken (x=2).
Graphical results of a five of trials below. Appears as if statistical noise.

67
-3
x 10 DTW in blue, DTWTHREE in red
2

1.5

0.5

0
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Conclusion: the increased chance of having done something wrong is not worth the negligible benefit.

Testing D
From testFileTwelve

Showing c values of 'b' matches.

68
c values for library entries
10000
c value

5000

0
0 5 10 15 20 25 30
library entry
size of match matric for library entries
40
Size of Match Matrix

30

20

10

0
0 5 10 15 20 25 30
library entry
c values for Size of Match Matrix
10000
c value

5000

0
0 5 10 15 20 25 30 35
Size of Match Matrix

Testing Successful Recognition


Letter 1 2 3 4 5 6 7 8 9 10 %

69
a 100 28 8 15 100 8 2 5 28 6 20
b 15 1 1 9 1 100 3 1 6 6 10
c 2 100 1 100 100 1 5 1 5 1 30
w 14 14 5 6 12 22 6 100 100 100 30
w 100 100 100 100 100 100 100 100 100 100 100

70
Appendix C: Code of Software Elements
Note: In online copy only. For the physical copy, these have been removed. If one wishes to view this
code, they may contact the writer for information into such.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: libCreate.m
% Author: Brett A. Lindsay 0648981
% Required Files:
%
% Function: cepAnal.m will perform cepstral analysis on the input
% signal in order to remove the effects of the speakers
% vocal tract.
%
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = cepAnal(x)

c=cceps(x);

pass=int16(length(c)/6);
mask=ones(length(c),1);
mask(pass:length(c)-pass,1)=mask(pass:length(c)-pass,1)-1;

c=c.*mask;

x=icceps(c);

out = x;
return

71
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: DTW.m
% Author: Brett A. Lindsay 0648981
% Required Files:
%
% Function: DTW.m returns the value of the quickest path through the
% local match matrix of two audio signals (M), normalised.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = DTW(M)

M=1-M; %need to find lowest path.


[m,n]=size(M);
D=zeros(m+1,n+1); %create matrix to trace through.
D(1,:) = NaN;
D(:,1) = NaN;
D(1,1)=0;
D(2:m+1,2:n+1)=M;

phi=zeros(m,n);

for i=1:m
for j=1:n
[dmax,tb]=min([D(i,j),D(i,j+1),D(i+1,j)]);
D(i+1,j+1)=D(i+1,j+1)+dmax;
phi(m,n)=tb;
end
end

% figure,imagesc(D),colormap(gray);
i=m;j=n;p=m;q=n;

% out=0;
while i>1 && j>1
tb=phi(i,j);
if tb==1
i=i-1;
j=j-1;
elseif tb==2
i=i-1;
elseif tb==3
j=j-1;
else
break;
end
p=[i,p];
q=[j,q];

72
% out=out+1;
end

%portion for returning trace value.


out=0;
if (p(1,1)>1)
out=p(1,1);
else
out=q(1,1);
end
out=(out+length(p)-1)*10000;

% D=D(2:m+1,2:n+1);
% out=D(size(D,1),size(D,2));

out=out/sqrt(m^2+n^2); %divide by diagonal so that all answers are equally


weighted.

end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: DTW.m
% Author: Brett A. Lindsay 0648981
% Required Files:
%
% Function: .
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = DTWORIGINAL(M)

M=1-M; %need to find lowest path.


[m,n]=size(M);

73
D=zeros(m+1,n+1); %create matrix to trace through.
D(1,1)=0;
D(2:m+1,2:n+1)=M;

%phi=zeros(m,n);

%Currently this goes through the whole matrix.


%try bounding these i and j.
for i=1:m
for j=1:((2*n/3+i)*(i<(n/3))+n(i>=(n/3)))
[dmax,tb]=min([D(i,j),D(i,j+1),D(i+1,j)]);
D(i+1,j+1)=D(i+1,j+1)+dmax;
%phi(m,n)=tb;
end
end

%figure,imagesc(D),colormap(gray);
% i=m;j=n;p=m;q=n;
%
% while i>1 && j>1
% tb=phi(i,j);
% if tb==1
% i=i-1;
% j=j-1;
% elseif tb==2
% i=i-1;
% elseif tb==3
% j=j-1;
% else
% break;
% end
% p=[i,p];
% q=[j,q];
% end

D=D(2:m+1,2:n+1);

diag=sqrt(m^2+n^2); %For normalization of output.


area=m*n; %For normalization of output.

out = abs(D(size(D,1),size(D,2))/diag); %normalised value to be returned.


% out = abs(D(size(D,1),size(D,2))*area);
% out = abs(D(size(D,1),size(D,2)));

end

74
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Attempting to create a faster working DTW by capping the trace through.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = DTWTHREE(M)

M=1-M; %need to find lowest path.


[m,n]=size(M);
D=zeros(m+1,n+1); %create matrix to trace through.
D(1,:) = NaN;
D(:,1) = NaN;
D(1,1)=0;
D(2:m+1,2:n+1)=M;

phi=zeros(m,n);

for i=1:m
for j=1:n
[dmax,tb]=min([D(i,j),D(i,j+1),D(i+1,j)]);
D(i+1,j+1)=D(i+1,j+1)+dmax;
phi(i,j)=tb;
end
end

% figure,imagesc(D),colormap(gray);
i=m;j=n;p=m;q=n;

while i>1 && j>1


tb=phi(i,j);
if tb==1
i=i-1;
j=j-1;
elseif tb==2
i=i-1;
elseif tb==3
j=j-1;
else
break;
end
p=[i,p];
q=[j,q];
%Breaking code.
%p is verticle, q is horizontal of trace.
%Idea:
%Take the verticle size of the matrix ie 14
%third it=~4
%if the verticle value is greater or less than this distance from the
% ideal p, then it's too far out.
% ideal p=q*tan(atan(m/n));
%ie if at q=10, if p<3 or p>17, it's a poor match.
ideal=q(1,1)*tan(atan(m/n));

75
ideal1=ideal+m/6;
ideal2=ideal-m/6;

if (p(1,1)>ideal1)
i=-1; %easy way to stop the while loop.
p=m*n; %Some high value.
end
if (p(1,1)<ideal2)
i=-1; %easy way to stop the while loop.
p=m*n; %Some high value.
else
end
end

%portion for returning trace value.


out=0;
if (p(1,1)>1)
out=p(1,1)-1;
else
out=q(1,1)-1;
end
out=(out+length(p))*10000;

D=D(2:m+1,2:n+1);

out=out/sqrt(size(D,1)^2+size(D,2)^2); %divide by diagonal so that all answers are


equally weighted.

end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: hamWindow.m

76
% Author: Brett A. Lindsay 0648981
% Required Files: speechRec.m
%
% Function: hamWindow.m will apply a hamming window to the signal,
% before returning the data.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = hamWindow(x)

w=window(@hamming,length(x));
x=x.*w;

out = x;
return

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: libCreate.m
% Author: Brett A. Lindsay 0648981
% Required Files: recorder.m
% preEmphasis.m
% normalizer.m
% hamWindow.m
% usefullSig.m
% specCreate.m
% capAnal.m
%
% Function: libCreate.m will be used to input audio signals into a
% reference sound library for use with the speech
% recognition block of the project.
% Assumes fs=8192 Hz.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function libCreate()

fs=8192;

77
audioIn = recorder(); %get audio signal

%!!!!!!
%Need to look into effectiveness of preEmphasis network.
% Seems pointless for descrete SR.
%audioIn = preEmphasis(audioIn); %pass through pre-emphasis network

audioIn = normalizer(audioIn);

audioIn = usefullSig(audioIn);
%sound(audioIn,fs);

audioIn = hamWindow(audioIn);

%This may need more work.


audioIn = cepAnal(audioIn);

%Testing showed it was better to save these as .wav's and convert them to
% spectrograms when needed.
%audioIn=specCreate(audioIn);
%dlmwrite(['Library/test' libNum '.txt'], audioIn, 'delimiter',
% ...'\t','precision', 4);

libNum=input('Please input number to be associated with file (ie 10-999).','s');


%String.
%wavwrite(audioIn,fs,['Library/test' libNum '.wav']);
wavwrite(audioIn,fs,['Library/lib' libNum '.wav']);
%wavwrite(audioIn,fs,['Library/setUp' libNum '.wav'])
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: depSetUp.m
% Author: Brett A. Lindsay 0648981
% Required Files:
%
% Function: matchMat.m creates the "local match matrix".
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

78
function out = matchMat(A,B)

A=abs(A); B=abs(B); %Need absolute values

%Calculates the cos of the angle between two vectors of each point in the
% matrix
%Find the average (RMS) value of the matrix, so that later when the A and
% B matrixes are multiplied, we can somewhat normalise them back to
% reasonable levels.
sA= sqrt(sum(A.^2));
sB = sqrt(sum(B.^2));

Mat = (A'*B)./(sA'*sB);
out = Mat;
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: normalizer.m
% Author: Brett A. Lindsay 0648981
% Required Files: speechRec.m
%
% Function: normalizer.m will normalize the data to 0.5, then pass it
% back.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = normalizer(x)

x = 0.5*x/max(abs(x));

out = x;
return

79
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: recorder.m
% Author: Brett A. Lindsay 0648981
% Required Files: speechRec.m
%
%
% Function: recorder.m will return a 3 second audio signal.
%
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function [out fs]= recorder()

fs = 8192; %in Hz, default sampling frequency for sound(), etc.


t=3; %in s, number of seconds to record for.
ai_length = t*fs;

% Set up MatLab Oscilloscope / Winsound Analoginput


ai = analoginput('winsound');
addchannel(ai, 1);
set(ai, 'SampleRate', fs);
set(ai, 'TriggerType', 'manual');
set(ai, 'TriggerRepeat', 0);
set(ai, 'SamplesPerTrigger', ai_length);

%Look into changing this from a manual trigger to a rising edge:


%set(ai, 'TriggerType', 'software');
%set(ai, 'TriggerCondition', 'Rising');
%set(ai, 'TriggerConditionValue', 0.01);
%set(ai, 'TriggerChannel', ai.Channel(1));
%set(ai, 'TriggerDelay', -0.1);
%set(ai, 'TriggerDelayUnits', 'seconds');
%set(ai, 'TimeOut', 10);

% Get data from the microphone


beep on;
beep;
start(ai);
trigger(ai);
data = getdata(ai);

80
beep;
delete(ai);

out = data; %return the audio input.

return

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: specCreate.m
% Author: Brett A. Lindsay 0648981
% Required Files: speechRec.m
%
% Function: specCreate.m will create a spectogram out of the input
% audio signal x.
% Assumes x is a (length,1) vector and fs=8192 Hz.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = specCreate(x,fs)

%var=min(256,length(x));

%S = specgram(a,nfft,fs,window,numoverlap)
%x is the signal;
%window is the window WIDTH. ->use 512.
%noverlap = length of the window/2
%nfft=min(256,length(a)) is the default, seems good.
%fs is assumed to be 8192 Hz.
%X = specgram(x,var,fs,var,var/2);

%Simpler form has max of 8 time windowing periods, not very accurate for
%DTW (?):
% X=spectrogram(x);

81
X = specgram(x,512,fs,512,384);

out = X;
return

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: speechRec.m
% Author: Brett A. Lindsay 0648981
% Required Files: library
% recorder.m
% preEmphasis.m
% normalizer.m
% hamWindow.m
% usefullSig.m
% specCreate.m
% capAnal.m
% matchMat.m
% DTW.m
%
% Function: speechRec.m will be called by Jon Hernandez's c# program,
% which will pass in a character to be checked.
% speechRec.m will signal the user for audio input (ie.
% their answer/command), record this, process this, and test
% against a library using the method of Dynamic Time Warping.
% speechRec.m will then return information about the
% character being tested or if a command was entered.
%
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = speechRec(in)

82
[audioIn fs]= recorder(); %get audio signal

%!!!!!!
%Need to look into effectiveness of preEmphasis network.
% Seems pointless for descrete SR.
%audioIn = preEmphasis(audioIn); %pass through pre-emphasis network
audioIn = normalizer(audioIn);
audioIn = usefullSig(audioIn);
audioIn = cepAnal(audioIn);
audioIn = hamWindow(audioIn);

%[audioIn,fs] = wavread(['Library/lib' int2str(in) int2str(1) '.wav']);

audioIn=specCreate(audioIn,fs);

%Comparison loop.
numLibEnt=30; %number of library entries, 1-30
%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').
cmin=500; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
cp=0.7071*10000; %From experimental data, if the DTW block produces a value of
%0.7071 then this is a perfect match. This value is normalised
%for any size difference, et cetera.
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than
%min and don't have a match.
for m=1:numLibEnt
for n=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(x,fs);
M=matchMat(audioIn,Y);
ctemp=abs(DTW(M)-cp);
if (ctemp<c)
c=ctemp;
r=m;
end
end
end

%returning block.
% 1-30 - incorrect character (1-30).
% 50 - no satisfactory match.
% 100 - correct character.
if r==0
out=50;
elseif r==in;
out=100;
else
out=r;
end

return

83
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% To be used in the
% Electrical and Computer Engineering Project
% Submitted in partial fulfillment of the requirements for the degree
% of Bachelor of Engineering at McMaster Universtiy
%
% To be used in conjunction with projects of
% Chris Agam & Jon Hernandez
%
% File: usefullSig.m
% Author: Brett A. Lindsay 0648981
% Required Files: speechRec.m
%
% Function: usefullSig.m will clip out the parts of the signal which
% are considered useful (ie. it will remove the beginning
% and end, before and after user has spoken).
% Assumes x is a (length,1) vector.
% Sensitivity (z) should be - if soft spoken, + if loud.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = usefullSig(x)

%sensitivity.
z=-1;
if z<0
s=0.4;
elseif z>0
s=0.2;
else
s=0.3;
end

%Note: Changing Threshold will dramatically change matching ability


% Would like to have more adaptive thresh.
thresh=s*max(x); %s% of maximum.
l=length(x);
%f's for os of 10,20,50,100 respectively, with length of ~24k
%f=0.0004069;
%f=0.0008138;
%f=0.002;
f=0.0041;

os=floor((1+s)*f*l); %offset.
a=0; %The lower bound.
as=1;
b=0;
bs=1;

for i=1:l
if (as && abs(x(i,1))>thresh)
a=i-os;
as=0;
end

84
if (bs && abs(x(l-i,1))>thresh)
b=l-i+os;
bs=0;
end
end

%Without these, there is the potential to go outside the vector bounds.


if (a-os)<1
a=1;
end
if (b+os)>l
b=l;
end

% Trying to solve w problem:


% out(length(x),1)=0;
% out(a:b,1) = x(a:b,1);

out = x(a:b,1);
return

85
TEST FILE 1
clc;clear;close all;

fs=8192;
t=0:1/fs:3-1/fs;

audioIn=recorder();
figure,plot(t,audioIn);
sound(audioIn,fs)

%audioIn = preEmphasis(audioIn);
%figure,plot(audioIn);

audioIn = normalizer(audioIn);
figure,plot(t,audioIn);
sound(audioIn,fs)

audioIn = usefullSig(audioIn);
figure,plot(audioIn);
sound(audioIn,fs)

audioIn=cepAnal(audioIn);
figure,plot(audioIn)
sound(audioIn,fs);

audioIn = hamWindow(audioIn);
figure,plot(audioIn);
sound(audioIn,fs)

pause(1);

var=min(256,length(audioIn));
figure, specgram(audioIn,512,fs,512,384);
%figure, specgram(B,var,fs,var,var/2);
%figure, specgram(C,512,fs,512,384);

pause(1);close all;

TEST FILE 2
% This code is used to test the time it takes to open and covert a .wav file
% vs stroing the data as spectrograms in a .txt file and reading them directly.

clc;clear;close all;

86
t=1:4;
t1=zeros(1,4);
t2=zeros(1,4);
T1=zeros(1,4);
T2=zeros(1,4);
stxt=[51.3 111 89.8 16.3];
swav=[6.15 13.4 10.2 2.39];
Stxt=zeros(1,4);
Swav=zeros(1,4);

for i=1:4
tic;
C=dlmread(['Library/test00' int2str(i) '.txt']);
t2(1,i)=toc;
end

for i=1:4
tic;
[audioIn fs]=wavread(['Library/test00' int2str(i) '.wav']);
C=specCreate(audioIn,fs);
t1(1,i)=toc;
end

T1(1,:)=sum(t1)/length(t1) %#ok<NOPTS>
T2(1,:)=sum(t2)/length(t2) %#ok<NOPTS>
Swav(1,:)=sum(swav)/length(swav) %#ok<NOPTS>
Stxt(1,:)=sum(stxt)/length(stxt) %#ok<NOPTS>

figure(1), subplot(2,1,1),plot(t,t1,'r',t,t2,'b',t,T1,'--r',t,T2,'--b'),ylabel('t
(in s)'),xlabel('Trial number'),title('.wav in red, .txt in blue');
subplot(2,1,2),plot(t,stxt,'b',t,swav,'r',t,Stxt,'--b',t,Swav,'--r'),ylabel('size
(in kB)'),xlabel('Trial number');

TEST FILE 3
clc;clear;close all;
fs=8192;

A=wavread('Library/lib271.wav');
var=min(256,length(A));
A=specgram(A,var,fs,var,var/2);

B=wavread('Library/lib272.wav');
var=min(256,length(B));
B=specgram(B,var,fs,var,var/2);

87
M=matchMat(A,B);

min=DTW(M);

TEST FILE 4
clc;clear;close all;

a=1;
tic;
out=speechRec(a);
toc;

TEST FILE 5

% The purpose of this test is to see if the algorithm can recognise files
% already in the system.
%
%If it can't, then the algorithm is fundamentally broken.
%
%The results will be stored in an array.
%
%Hopefully, it is all 100s.

%NOTE: Update, successful test.

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').

out=0;in=0; %predefining.

88
outArray=zeros(1,numLibEnt*(1+numLibSam));
outArrayCount=1;

cArray=zeros(1,numLibEnt*(1+numLibSam));
cArrayCount=1;

for a=1:numLibEnt
for b=0:numLibSam
%Read in each file.
[x fs]=wavread(['Library/lib' int2str(a) int2str(b) '.wav']);
X=specCreate(x,fs);

%comparison loops.
cmin=1; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
cp=0.7071; %From experimental data, if the DTW block produces a value of
%0.7071 then this is a perfect match. This value is normalised
%for any size difference, et cetera.
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than
%min and don't have a match.
for m=1:numLibEnt
for n=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(x,fs);
M=matchMat(X,Y);
ctemp=abs(DTW(M)-cp);
if (ctemp<c)
c=ctemp;
r=m;
end
end
end

cArray(1,cArrayCount)=c;
cArrayCount=cArrayCount+1;

%returning block.
% 1-30 - incorrect character (1-30).
% 50 - no satisfactory match.
% 100 - correct character.
in=a;%want to see if the found r is a.
if r==0
out=50;
elseif r==in;
out=100;
else
out=r;
end

outArray(1,outArrayCount)=out;
outArrayCount=outArrayCount+1;

end
end

89
figure(1),
subplot(1,2,1),plot(outArray),title('Results (100=match)');
subplot(1,2,2),plot(cArray),title('Value of C variable for match');

TEST FILE 6
% This file is created to run through all the files in the library and listen
% to them, for personal understanding.

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').

out=0;in=0; %predefining.

outArray=zeros(1,numLibEnt*(1+numLibSam));
outArrayCount=1;

for a=1:numLibEnt
for b=0:numLibSam
%Read in each file.
[x fs]=wavread(['Library/lib' int2str(a) int2str(b) '.wav']);
sound(x,fs);

end
end

TEST FILE 7
% This test is to take a look at what a c values a letter will generate over
% the whole testing algorithm.

numLibEnt=30; %number of library entries, 1-30

90
%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').

cArray=zeros(1,numLibEnt*(1+numLibSam));
cArrayCount=1;

%File to test (w=23)


in=1;

[x fs]=wavread(['Library/lib' int2str(in) '2.wav']);


X=specCreate(x,fs);

%comparison loops.
cmin=1; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than
%min and don't have a match.
for m=1:numLibEnt
for n=0:numLibSam
[y fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(y,fs);
M=matchMat(X,Y);
ctemp=DTW(M);
%Added code here:
cArray(1,cArrayCount)=ctemp;
cArrayCount=cArrayCount+1;
if ctemp<c
c=ctemp;
r=m;
end
end
end

%returning block.
% 1-30 - incorrect character (1-30).
% 50 - no satisfactory match.
% 100 - correct character.
if r==0
out=50;
elseif r==in;
out=100;
else
out=r;
end

figure(1),plot(cArray),title(['Value of C variable for ' int2str(in)]);

91
TEST FILE 8
% Here I want to see the correspondance between the size of the audio file
% and the c values returned.

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').

cArray=zeros(1,numLibEnt*(1+numLibSam));
cArrayCount=1;
sArray=zeros(1,numLibEnt*(1+numLibSam));
sArrayCount=1;

%File to test
in=1;

[x fs]=wavread(['Library/lib' int2str(in) '2.wav']);


X=specCreate(x,fs);

%comparison loops.
cmin=1; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than
%min and don't have a match.
for m=1:numLibEnt
for n=0:numLibSam
[y fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(y,fs);
M=matchMat(X,Y);
ctemp=DTW(M);
%Added code here:
cArray(1,cArrayCount)=ctemp;
cArrayCount=cArrayCount+1;
sArray(1,sArrayCount)=size(y,1);
sArrayCount=sArrayCount+1;
if ctemp<c
c=ctemp;
r=m;
end
end
end

92
%returning block.
% 1-30 - incorrect character (1-30).
% 50 - no satisfactory match.
% 100 - correct character.
if r==0
out=50;
elseif r==in;
out=100;
else
out=r;
end

% figure(1),subplot(1,2,1),plot(cArray,'r'),title(['Value of C (red) v Size(b) for


' int2str(in)]);
% subplot(1,2,2),plot(sArray,'b');
figure(1),plot(cArray,'r'),title(['Value of C (red) v Size(b) for ' int2str(in)]);
hold on;plot(sArray,'b');hold off;

TEST FILE 9
% Here, this is going to be a test for how much time a certain number of
% iterations of the library will take (ie if the library has x samples)
% in regards to the comparison block.

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').

mod=20; %number of times to modify the run through.

timeArray=zeros(1,mod);

%File to test
in=1;
[x fs]=wavread(['Library/lib' int2str(in) '2.wav']);
X=specCreate(x,fs);

cmin=1; %minimum comparison excepted.


c=cmin;
ctemp=0; %#ok<NASGU>
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than

93
%min and don't have a match.
for m=1:numLibEnt
for l=1:mod
tic;

for n=0:numLibSam
[y fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(y,fs);
M=matchMat(X,Y);
ctemp=DTW(M);
if ctemp<c
c=ctemp;
r=m;
end
end

t=toc;
t=t/3; %going to take an average for better results.
if l==1
timeArray(1,l)=t;
else
timeArray(1,l)=t+timeArray(1,l-1);
end

end
end

figure(1),plot(timeArray),title('Time to run through comparison vs number of


samples in library'),xlabel('Number of samples'),ylabel('t (in s)');

TEST FILE 10
% Doing timings for varying sizes of wav files.

clc;clear;close all;

swav=[6.15 13.4 10.2 2.39]; %sizes of test wav files.


numSam=4;
numRun=20;

tNorm(1,numSam)=0;
tUsef(1,numSam)=0;
tCepA(1,numSam)=0;
tHamW(1,numSam)=0;
tSpec(1,numSam)=0;
tReco(1,numSam)=0;

94
for i=1:numSam
[x fs]=wavread(['Library/test00' int2str(i) '.wav']);

t=0;
for m=1:numRun
tic;
xp=normalizer(x);
t=t+toc;
end
t=t/numRun;
tNorm(1,i)=t;
figure(1),subplot(3,2,2),stem(swav,tNorm),title('Times of nomalizer.m vs
Size'),xlabel('Size in kB'),ylabel('t (in s)');

t=0;
for m=1:numRun
tic;
xp=usefullSig(x);
t=t+toc;
end
t=t/numRun;
tUsef(1,i)=t;
figure(1),subplot(3,2,3),stem(swav,tUsef),title('Times of usefullSig.m vs
Size'),xlabel('Size in kB'),ylabel('t (in s)');

t=0;
for m=1:numRun
tic;
xp=cepAnal(x);
t=t+toc;
end
t=t/numRun;
tCepA(1,i)=t;
figure(1),subplot(3,2,4),stem(swav,tCepA),title('Times of capAnal.m vs
Size'),xlabel('Size in kB'),ylabel('t (in s)');

t=0;
for m=1:numRun
tic;
xp=hamWindow(x);
t=t+toc;
end
t=t/numRun;
tHamW(1,i)=t;
figure(1),subplot(3,2,5),stem(swav,tHamW),title('Times of hamWindow.m vs
Size'),xlabel('Size in kB'),ylabel('t (in s)');

t=0;
for m=1:numRun
tic;
xp=specCreate(x,fs);
t=t+toc;
end
t=t/numRun;
tSpec(1,i)=t;
figure(1),subplot(3,2,6),stem(swav,tSpec),title('Times of specCreat.m vs

95
Size'),xlabel('Size in kB'),ylabel('t (in s)');

t=0;
for m=1:numRun
tic;
xp=recorder();
t=t+toc;
end
t=t/numRun;
tReco(1,i)=t;
figure(1),subplot(3,2,1),stem([1 2 3 4],tReco),title('Times of recorder.m for
Four trials'),xlabel('Trial'),ylabel('t (in s)');

end

TEST FILE 11
% Doing timing for matchMat and DTW for vaious sizes of wavs.

clc;clear;close all;

swav=[6.15 13.4 10.2 2.39]; %sizes of test wav files.


numSam=4;
numRun=20;

tMatc(1,numSam)=0;
tDTW(1,numSam)=0;

for i=1:numSam
[x fs]=wavread(['Library/test00' int2str(i) '.wav']);
X=specCreate(x,fs);

t=0;
for m=1:numRun
tic;
XP=matchMat(X,X);
t=t+toc;
end
t=t/numRun;
tMatc(1,i)=t;
figure(1),subplot(2,1,1),stem(swav,tMatc),title('Times of matchMat.m vs
Size'),xlabel('Size in kB'),ylabel('t (in s)');

t=0;
for m=1:numRun
tic;
a=DTW(XP);

96
t=t+toc;
end
t=t/numRun;
tDTW(1,i)=t;
figure(1),subplot(2,1,2),stem(swav,tDTW),title('Times of DTW.m vs
Size'),xlabel('Size in kB'),ylabel('t (in s)');

end

TEST FILE 12
% Doing timing for speechRec.m
function testFileTwelve()

numTri=20; %number of trials.


t=0;
tSpec(1,numTri)=0;
sSpec(1,numTri)=0;

for m=1:numTri
t=0;
tic;
[out fileSize]=speechRecTest(5);
t=toc;
tSpec(1,m)=t;
sSpec(1,m)=fileSize;
end

rSpec=tSpec./sSpec;

figure(1);
subplot(3,1,1),stem(tSpec,'r'),title('Time of speechRec.m in red for trials (in
s)');
subplot(3,1,2),stem(sSpec,'b'),title('Size of audio in files in blue (in Array
Size)');
subplot(3,1,3),stem(rSpec,'g'),title('Ratio between time and size (in
s/ArraySize)');

t=0;%So I can stop debugger.

end

function [out fileSize]=speechRecTest(in)

[audioIn fs]= recorder(); %get audio signal

97
%!!!!!!
%Need to look into effectiveness of preEmphasis network.
% Seems pointless for descrete SR.
%audioIn = preEmphasis(audioIn); %pass through pre-emphasis network
audioIn = normalizer(audioIn);
audioIn = usefullSig(audioIn);

%ADDED CODE HERE TO GET SIZE


fileSize=size(audioIn,1);

audioIn = hamWindow(audioIn);
audioIn = cepAnal(audioIn);
audioIn=specCreate(audioIn,fs);

%Comparison loop.
numLibEnt=30; %number of library entries, 1-30
%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').
cmin=500; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
r=0; %r is the variable for which is the current lowest match c.
%if r stays as 0, we therefore never achieved a c lower than
%min and don't have a match.
for m=1:numLibEnt
for n=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(x,fs);
M=matchMat(audioIn,Y);
ctemp=DTW(M);
if ctemp<c
c=ctemp;
r=m;
end
end
end

%returning block.
% 1-30 - incorrect character (1-30).
% 50 - no satisfactory match.
% 100 - correct character.
if r==0
out=50;
elseif r==in;
out=100;
else
out=r;
end
end

98
TEST FILE 13
%This is comparing the c values of the three a's in the library currently.

letter=1; %which letter to compare.


numLibSam=2; %number of library samples (ie. 0 to 2 entries of 'a').

s(1,(1+numLibSam))=0;
c(1,(1+numLibSam))=0;

for m=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(letter) int2str(m) '.wav']);
X=specCreate(x,fs);
s(1,m+1)=size(x,1);

M=matchMat(X,X);
c(1,m+1)=DTW(M);
end

figure(1), stem(s,c),ylabel('c value returned for perfect match'),xlabel('size of


array'),
title(['c value returned for letter "' int2str(letter) '"']);

TEST FILE 14
% Testing values of c's across libSamples.

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 2 entries of 'a').

s(1,(numLibEnt))=0;
c(1,(numLibEnt))=0;

for m=1:numLibEnt
[x fs]=wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);
X=specCreate(x,fs);
s(1,m)=size(x,1);

M=matchMat(X,X);

99
c(1,m)=DTW(M);
end

figure(1), stem(s,c),ylabel('c value returned for perfect match'),xlabel('size of


array'),
title(['c value returned for sample "' int2str(numLibSam) '"']);

TEST FILE 15
% Testing values of c's across libSamples, now comparing with size of
% spectrograms. C's will be for perfect match.

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=0; %number of library samples (ie. 0 to 2 entries of 'a').

aX(1,(numLibEnt))=0;
aM(1,(numLibEnt))=0;
c(1,(numLibEnt))=0;

for m=1:numLibEnt
[x fs]=wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);
X=specCreate(x,fs);
[mX,nX]=size(X);
aX(1,m)=mX*nX;

M=matchMat(X,X);
[mM,nM]=size(M);
aM(1,m)=mM*nM;

c(1,m)=DTW(M);
end

figure(1), subplot(2,1,1),stem(aX,c),ylabel('c value returned for perfect


match'),xlabel('area of spec'),
title(['c value returned for sample "' int2str(numLibSam) '"']);
subplot(2,1,2),stem(aM,c),ylabel('c value returned for perfect match'),xlabel('area
of match matrix'),

100
TEST FILE 16
%Testing bits of Dr. Ellis' code.

clc;clear;close all;

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=0; %number of library samples (ie. 0 to 2 entries of 'a').

s(1,(numLibEnt))=0;
c(1,(numLibEnt))=0;

for m=1:numLibEnt

[d1,sr] = wavread(['Library/lib' int2str(1) int2str(numLibSam) '.wav']);


[d2,sr] = wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);

% Listen to them together:


ml = min(length(d1),length(d2));
soundsc(d1(1:ml)+d2(1:ml),sr)
% or, in stereo
soundsc([d1(1:ml),d2(1:ml)],sr);

D1 = specgram(d1,512,sr,512,384);
D2 = specgram(d2,512,sr,512,384);

SM = matchMat(D1,D2);

figure(1)
subplot(121)
imagesc(SM)
colormap(1-gray)

[p q C cp]=DTWTWO(1-SM);
hold on; plot(q,p,'r'); hold off

subplot(122)
imagesc(C)
hold on; plot(q,p,'r'); hold off

% c(1,m)=C(size(C,1),size(C,2))/(size(C,1)*size(C,2));
c(1,m)=cp;
s(1,m)=size(C,1);

end

figure(2),stem(s,c),xlabel('library entry'),ylabel('c value'),title('c values for


library entries');

101
TEST FILE 17
%Testing my code in style of testFileSixteen

clc;clear;close all;

numLibEnt=30; %number of library entries, 1-30


%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=0; %number of library samples (ie. 0 to 2 entries of 'a').

s(1,(numLibEnt))=0;
c(1,(numLibEnt))=0;

for m=1:numLibEnt

[d1,sr] = wavread(['Library/lib' int2str(2) int2str(numLibSam) '.wav']);


[d2,sr] = wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);

% % Listen to them together:


% ml = min(length(d1),length(d2));
% soundsc(d1,sr)
% % or, in stereo
% soundsc(d2,sr);

D1 = specgram(d1,512,sr,512,384);
D2 = specgram(d2,512,sr,512,384);

M = matchMat(D1,D2);

cp=DTW(M);

c(1,m)=cp;
s(1,m)=size(M,2);

end

e(1,30)=0
for m=1:30
e(1,m)=7071;
end

figure(1),subplot(3,1,1),stem(e,'r'),
hold on, stem(c,'b'),xlabel('library entry'),ylabel('c value'),title('c values for
library entries');,hold off;
subplot(3,1,2),stem(s,'b'),xlabel('library entry'),ylabel('Size of Match
Matrix'),title('size of match matric for library entries');
subplot(3,1,3), stem(e,'r'),
hold on,stem(s,c,'b'),xlabel('Size of Match Matrix'),ylabel('c value'),title('c
values for Size of Match Matrix');,hold off;

102
TEST FILE 18
%For getting pictures.
clc;clear;close all;

[audioIn fs]= recorder();


% sound(audioIn,fs)
figure(1),plot(audioIn),xlabel('Time Axis'),ylabel('Signal Magnitude'),title('Input
sound');

audioIn = normalizer(audioIn);
figure(2),subplot(2,2,1),plot(audioIn),xlabel('Time Axis'),ylabel('Signal
Magnitude'),title('Normalized sound');
audioIn = usefullSig(audioIn);
figure(2),subplot(2,2,2),plot(audioIn),xlabel('Time Axis'),ylabel('Signal
Magnitude'),title('Useful sound');
audioIn = cepAnal(audioIn);
figure(2),subplot(2,2,3),plot(audioIn),xlabel('Time Axis'),ylabel('Signal
Magnitude'),title('Post Cepstral Filtering');
audioIn = hamWindow(audioIn);
figure(2),subplot(2,2,4),plot(audioIn),xlabel('Time Axis'),ylabel('Signal
Magnitude'),title('Window`d (hamming) sound');

sound(audioIn,fs)
close all;

[y fs]=wavread(['Library/lib31.wav']);
figure(3),subplot(3,1,1),specgram(y,512,fs,512,384),title('Spectrogram of Input
Sound `c`');
Y=specCreate(y,fs);

[x fs]=wavread(['Library/lib21.wav']);
subplot(3,1,2),specgram(x,512,fs,512,384),title('Spectrogram of Close Library Sound
`b`');
X=specCreate(x,fs);

[z fs]=wavread(['Library/lib231.wav']);
subplot(3,1,3),specgram(z,512,fs,512,384),title('Spectrogram of Far Library Sound
`w`');
Z=specCreate(z,fs);

MP=matchMat(X,X);
figure(4),subplot(3,2,1),imagesc(MP),colormap(1-gray),title('Perfectly Matching
Input Specs');
[p q CP cp]=DTWTWO(1-MP);
subplot(3,2,2),imagesc(CP);hold on; plot(q,p,'r'); hold off;title('Match Matrix
(left), Quickest Path (right)');

M=matchMat(X,Y);
subplot(3,2,3),imagesc(M),colormap(1-gray),title('Somewhat Matching Input Specs');
[p q C c]=DTWTWO(1-M);
subplot(3,2,4),imagesc(C);hold on; plot(q,p,'r'); hold off

103
MO=matchMat(X,Z);
subplot(3,2,5),imagesc(MO),colormap(1-gray),title('Poorly Matching Input Specs');
[p q CO co]=DTWTWO(1-MO);
subplot(3,2,6),imagesc(CO);hold on; plot(q,p,'r'); hold off

TEST FILE 19
%To test the newer versions of DTW.
clc;clear;

[y fs]=wavread(['Library/lib31.wav']);
Y=specCreate(y,fs);

[x fs]=wavread(['Library/lib21.wav']);
X=specCreate(x,fs);

[z fs]=wavread(['Library/lib231.wav']);
Z=specCreate(z,fs);

M=matchMat(X,Z);
imagesc(M),colormap(1-gray),title('Poorly Matching Input Specs');
c=DTWTHREE(M);

TEST FILE 20
%Will test if there's a time difference beween DTW.m and DTWTHREE.m
%DTWTHREE has the breaking code.

clc;clear;

numLibEnt=30; %number of library entries, 1-30

104
%1-26 being alphabet, 27-30 being enter, yes, no, back.
numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').
cmin=500; %minimum comparison excepted.
c=cmin;
ctemp=0; %#ok<NASGU>
cp=0.7071*10000; %From experimental data, if the DTW block produces a value of
%0.7071 then this is a perfect match. This value is normalised
%for any size difference, et cetera.
r=0;

tic
for a=1:numLibEnt
for b=1:numLibSam
[audioIn,fs] = wavread(['Library/lib' int2str(a) int2str(b) '.wav']);
audioIn=specCreate(audioIn,fs);

for m=1:numLibEnt
for n=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(x,fs);
M=matchMat(audioIn,Y);
ctemp=abs(DTW(M)-cp);
if (ctemp<c)
c=ctemp;
r=m;
end
end
end

end
end
t0=toc %#ok<NOPTS>

tic
for a=1:numLibEnt
for b=1:numLibSam
[audioIn,fs] = wavread(['Library/lib' int2str(a) int2str(b) '.wav']);
audioIn=specCreate(audioIn,fs);

for m=1:numLibEnt
for n=0:numLibSam
[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);
Y=specCreate(x,fs);
M=matchMat(audioIn,Y);
ctemp=abs(DTWTHREE(M)-cp);
if (ctemp<c)
c=ctemp;
r=m;
end
end
end

end
end
t1=toc %#ok<NOPTS>

105
TEST FILE 21
%This tests more specific differences beween DTW.m and DTWTHREE.m
clc;clear;
a=rand %#ok<NOPTS>

[x fs]=wavread(['Library/lib21.wav']);
X=specCreate(x,fs);

[z fs]=wavread(['Library/lib231.wav']);
Z=specCreate(z,fs);

L=30000000;

t0(1,2)=0;
t1(1,2)=0;

for m=L;
M=matchMat(X,X);
tic
c=DTW(M);
t0(1,1)=toc+t0(1,1);
end

for m=L;
M=matchMat(X,X);
tic
c=DTWTHREE(M);
t1(1,1)=toc+t1(1,1);
end

for m=L;
M=matchMat(X,Z);
tic
c=DTW(M);
t0(1,2)=toc+t0(1,2);
end

for m=L;
M=matchMat(X,Z);
tic
c=DTWTHREE(M);
t1(1,2)=toc+t1(1,2);
end

t0%#ok<NOPTS>
t1%#ok<NOPTS>

stem([1,2],t0,'b'),hold on, stem([1,2],t1,'r'),title('DTW in blue, DTWTHREE in


red')

106
References
[1] Chiba, S, and Sakoe, H., “Dynamic programming algorithm optimization for spoken word
recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, 26 pp. 43- 49.

[2] Dumitru, C.O. and Gavat, I., “Vowel, digit and continuous speech recognition based on statistical,
neural and hybrid modelling by using ASRS_RL,” EUROCON 2007 - The International Conference on
Computer as a Tool, pp. 856-863, September 2007.

[3] Ellis, Dan. "Dynamic Time Warp (DTW) in Matlab." Dan Ellis's Home Page (Columbia University
Electrical Engineering). Web. http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

[4] Flanagan, JL. Speech Analysis: Synthesis & Perception. New York: Academic In., 1965. Print.

[5] Fry, DB. The Physics of Speech. Cambridge: Cambridge UP, 1979. Print.

[6] Gold, B., and N. Morgan. Speech and Audio Signal Processing. John Wiley & Sons Inc., 2000.
Print.

[7] Hart, P. "Voice recognition: what all the talk is about," Telecommunication,. vol 29. no 7. July 1995.

[8] Jawed, F., Muzaffar, F. et al. “DSP implementation of voice recognition using dynamic time
warping algorithm,” 2005 Student Conference on Engineering Sciences and Technology, SCONEST.
Karachi, Pakistan, 2005.

[9] Kale, Kaustubh R. "Dynamic Time Warping." Computaional NeuroEngineering Lab at the
University of Florida. Web. http://www.cnel.ufl.edu/~kkale/dtw.html

[10] Mrvaljevic, N. and Ying, S. “Comparison between speaker dependent mode and speaker
independent mode for voice recognition,” Bioengineering, Proceedings of the Northeast Conference,
Boston, United States of America, April. 2009.

[11] Nelson, B. and Runger, G., “Predicting processes when embedded events occur: Dynamic time
warping,” Journal of Quality Technology, vol 35, no 2, pp. 213-226, April 2003.

[12] National Federation of the Blind, “Braille readers are leaders,” [Online] 2009 Available:
http://www.nfb.org/nfb/Braille_coin.asp [Accessed: Oct. 7 2009]

[13] The MathWorks Store, [Online] 2009


Available: http://www.mathworks.com/store/ [Accessed: Oct. 4 2009]

[14] Lindsay, B. "4BI6 Group 13 Logbook," 2009-2010.

107
VITA
NAME: Brett Lindsay
PLACE OF BIRTH: Burlington Ontario, Canada
YEAR OF BIRTH: 1988
SECONDARY EDUCATION: Lord Elgin High School (2002-2004)
Robert Bateman High School (2004-2006)
UNDERGRAD EDUCATION: McMaster University (2006-2010)
HONOURS and AWARDS: Queen Elizabeth II Aiming for the Top Scholarship 2006
McMaster Entrance Scholarship
Smurfit-Stone Scholarship 2006, 2007, 2008, 2009
Dean’s Honour List 2007, 2009

108

Potrebbero piacerti anche