Sei sulla pagina 1di 47


(Submitted in partial fulfillment for the award of Bachelor of Electronics Engineering Degree by the University of Mumbai) By Pratik Chopra Harshad Dange

Under the guidance of

Mr. Shirish S. Halbe (Asst. Professor & Hobby Centre Co-ordinator )

S. Halbe (Asst. Professor & Hobby Centre Co-ordinator ) Department of Electronics Engineering, K. J. Somaiya

Department of Electronics Engineering, K. J. Somaiya College of Engineering, Vidyavihar, Mumbai - 400077. 2006 - 2007.




seat number 8139 has completed the B.E. project on Prof. Shirish S. Halbe GUIDE Prof.
Prof. Shirish S. Halbe
Prof. Milind Marathe
H. O. D.
Dr. P.P Parikh
Director / Principal
Date of Examination

Bachelor of Electronics Engineering Degree by the University of Mumbai.

This is to certify that Mr. Pratik Chopra of Electronics Department, bearing the



Controlled Robot and is accepted and examined for the partial fulfillment of the



We take this opportunity to express our deepest gratitude towards Mr. S.S. Halbe, our project guide, who has been the driving force behind this project and whose guidance and co-operation has been a source of inspiration for us. We would also like to thank Prof. Samir Mhatre for his valuable support whenever needed. We are very much thankful to our professors, colleagues and authors of various publications to which we have been referring to. We express our sincere appreciation and thanks to all those who have guided us directly or indirectly in our project. Also much needed moral support and encouragement was provided on numerous occasions by our whole division Finally we thank our parents for their immense support.



1. Introduction--------------------------------------------------------------------5

2. The Task------------------------------------------------------------------------7

3. Speech Recognition Types/Styles-------------------------------------------9

4. Approaches to statistical Speech Recognition----------------------------11

5. Nature of Problem------------------------------------------------------------13

6. Solution to Problems---------------------------------------------------------16

7. Design Approach-------------------------------------------------------------18

a. Speech Recognition Module----------------------------------------19

b. Microcontroller and Decoder circuit-----------------------------28

c. RF module------------------------------------------------------------33

d. Driver Circuit--------------------------------------------------------35

e. Buffer-----------------------------------------------------------------35

f. Batteries--------------------------------------------------------------35

8. Training and Recognition---------------------------------------------------36

9. Applications-------------------------------------------------------------------37

10. Components Used------------------------------------------------------------38

11. Datasheet-HM2007----------------------------------------------------------39

12. Project Progress Report Summary----------------------------------------46

13. Bibliography------------------------------------------------------------------47



When we say voice control, the first term to be considered is Speech Recognition i.e. making the system to understand human voice. Speech recognition is a technology where the system understands the words (not its meaning) given through speech.

the words (not its meaning) given through speech. Speech is an ideal method for robotic control

Speech is an ideal method for robotic control and communication. The speech- recognition circuit we will outline, functions independently from the robot’s main intelligence [central processing unit (CPU)]. This is a good thing because it doesn’t take any of the robot’s main CPU processing power for word recognition. The CPU must merely poll the speech circuit’s recognition lines occasionally to check if a command has been issued to the robot. We can even improve upon this by connecting the recognition line to one of the robot’s CPU interrupt lines. By doing this, a recognized word would cause an interrupt, letting the CPU know a recognized word had been spoken. The advantage of using an interrupt is that polling the circuit’s recognition line occasionally would no longer be necessary, further reducing any CPU overhead.

Another advantage to this stand-alone speech-recognition circuit (SRC) is its programmability. You can program and train the SRC to recognize the unique words you want recognized. The SRC can be easily interfaced to the robot’s CPU.

To control and command an appliance (computer, VCR, TV security system, etc.) by speaking to it, will make it easier, while increasing the efficiency and effectiveness of


working with that device.At its most basic level speech recognition allows the user to perform parallel tasks, (i.e. hands and eyes are busy elsewhere) while continuing to work with the computer or appliance.

Robotics is an evolving technology. There are many approaches to building robots, and no one can be sure which method or technology will be used 100 years from now. Like biological systems, robotics is evolving following the Darwinian model of survival of the fittest.

Suppose you want to control a menu driven system. What is the most striking property that you can think of?

Well the first thought that came to our mind is that the range of inputs in a menu driven system is limited. In fact, by using a menu all we are doing is limiting the input domain space. Now, this is one characteristic which can be very useful in implementing the menu in stand alone systems. For example think of the pine menu or a washing machine menu. How many distinct commands do they require?

Why build robots?

Robots are indispensable in many manufacturing industries. The reason is that the cost per hour to operate a robot is a fraction of the cost of the human labor needed to perform the same function. More than this, once programmed, robots repeatedly perform functions with a high accuracy that surpasses that of the most experienced human operator. Human operators are, however, far more versatile. Humans can switch job tasks easily. Robots are built and programmed to be job specific. You wouldn’t be able to program a welding robot to start counting parts in a bin. Today’s most advanced industrial robots will soon become “dinosaurs.” Robots are in the infancy stage of their evolution. As robots evolve, they will become more versatile, emulating the human capacity and ability to switch job tasks easily. While the personal computer has made an indelible mark on society, the personal robot hasn’t made an appearance. Obviously there’s more to a personal robot than a personal computer. Robots require a combination of elements to be effective: sophistication of intelligence, movement, mobility, navigation, and purpose.

Without risking human life or limb, robots can replace humans in some hazardous duty service. Robots can work in all types of polluted environments, chemical as well as nuclear. They can work in environments so hazardous that an unprotected human would quickly die.


Chapter2. THE TASK

The purpose of this project is to build a robotic car which could be controlled using voice commands. Generally these kinds of systems are known as Speech Controlled Automation Systems (SCAS). Our system will be a prototype of the same.

We are not aiming to build a robot which can recognize a lot of words. Our basic idea is to develop some sort of menu driven control for our robot, where the menu is going to be voice driven.

What we are aiming at is to control the robot using following voice commands. Robot which can do these basic tasks:-

voice commands. Robot which can do these basic tasks:- 1. move forward 2. move back 3.

1. move forward

2. move back

3. turn right

4. turn left

5. load

6. release

7. stop ( stops doing the current job )



INPUT (Speaker speaks)

OUTPUT (Robot does)


moves forward


moves back


turns right


turns left


Lifts the load


Releases the load


stops doing current task

(Words are chosen in such a way that they sound least familiar)



Voice enabled devices basically use the principal of speech recognition.It is the process of electronically converting a speech waveform (as the realization of a linguistic expression) into words (as a best-decoded sequence of linguistic units).

Converting a speech waveform into a sequence of words involves several essential steps:

1. A microphone picks up the signal of the speech to be recognized and converts it into an electrical signal. A modern speech recognition system also requires that the electrical signal be represented digitally by means of an analog-to-digital (A/D) conversion process, so that it can be processed with a digital computer or a microprocessor.

2. This speech signal is then analyzed (in the analysis block) to produce a representation consisting of salient features of the speech. The most prevalent feature of speech is derived from its short-time spectrum, measured successively over short-time windows of length 20–30 milliseconds overlapping at intervals of 10–20 ms. Each short-time spectrum is transformed into a feature vector, and the temporal sequence of such feature vectors thus forms a speech pattern.

3. The speech pattern is then compared to a store of phoneme patterns or models through a dynamic programming process in order to generate a hypothesis (or a number of hypotheses) of the phonemic unit sequence. (A phoneme is a basic unit of speech and a phoneme model is a succinct representation of the signal that corresponds to a phoneme, usually embedded in an utterance.) A speech signal inherently has substantial variations along many dimensions.

Before we understand the design of the project let us first understand speech recognition types and styles. Speech recognition is classified into two categories, speaker dependent and speaker independent.

Speaker dependent systems are trained by the individual who will be using the system. These systems are capable of achieving a high command count and better than 95% accuracy for word recognition. The drawback to this approach is that the system only responds accurately only to the individual who trained the system. This is the most common approach employed in software for personal computers.

Speaker independent is a system trained to respond to a word regardless of who speaks. Therefore the system must respond to a large variety of speech patterns, inflections and enunciation's of the target word. The command word count is usually lower than the speaker dependent however high accuracy can still be maintain within processing limits. Industrial requirements more often need speaker independent voice systems, such as the AT&T system used in the telephone systems.

A more general form of voice recognition is available through feature analysis and this technique usually leads to "speaker-independent" voice recognition. Instead of trying to


find an exact or near-exact match between the actual voice input and a previously stored voice template, this method first processes the voice input using "Fourier transforms" or "linear predictive coding (LPC)", then attempts to find characteristic similarities between the expected inputs and the actual digitized voice input. These similarities will be present for a wide range of speakers, and so the system need not be trained by each new user. The types of speech differences that the speaker-independent method can deal with, but which pattern matching would fail to handle, include accents, and varying speed of delivery, pitch, volume, and inflection. Speaker-independent speech recognition has proven to be very difficult, with some of the greatest hurdles being the variety of accents and inflections used by speakers of different nationalities. Recognition accuracy for speaker- independent systems is somewhat less than for speaker-dependent systems, usually between 90 and 95 percent. Speaker independent systems do not ask to train the system as an advantage, but perform with lower quality. These systems find applications in telephony communications such as dictating a number or a word where many people are in concern. However, there is a need for a well training database in speaker independent systems.

Recognition Style

Speech recognition systems have another constraint concerning the style of speech they can recognize. They are three styles of speech: isolated, connected and continuous.

Isolated speech recognition systems can just handle words that are spoken separately. This is the most common speech recognition systems available today. The user must pause between each word or command spoken. The speech recognition circuit is set up to identify isolated words of .96 second lengths.

Connected is a half way point between isolated word and continuous speech recognition. Allows users to speak multiple words. The HM2007 can be set up to identify words or

phrases 1.92 seconds in length. This reduces the word recognition vocabulary number to


Continuous is the natural conversational speech we are use to in everyday life. It is extremely difficult for a recognizer to shift through the text as the word tend to merge together. For instance, "Hi, how are you doing?" sounds like "Hi,.howyadoin" Continuous speech recognition systems are on the market and are under continual development.


4. Approaches of Statistical Speech Recognition

a. Hidden Markov model (HMM)-based speech recognition

Modern general-purpose speech recognition systems are generally based on hidden Markov models (HMMs). This is a statistical model which outputs a sequence of symbols or quantities.

One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought as a Markov model for many stochastic processes (known as states).

Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, to give the very simplest setup possible, the hidden Markov model would output a sequence of n- dimensional real-valued vectors with n around, say, 13, outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and de-correlating the spectrum using a cosine transform, then taking the first

(most significant) coefficients. The hidden Markov model will tend to have, in each state,

a statistical distribution called a mixture of diagonal covariance Gaussians which will

give likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

The above is a very brief introduction to some of the more central aspects of speech

recognition. Modern speech recognition systems use a host of standard techniques which

it would be too time consuming to properly explain, but just to give a flavor; a typical

large-vocabulary continuous system would probably have the following parts. It would need context dependency for the phones (so phones with different left and right context have different realizations); to handle unseen contexts it would need tree clustering of the contexts; it would of course use cepstral normalization to normalize for different recording conditions and depending on the length of time that the system had to adapt on different speakers and conditions it might use cepstral mean and variance normalization for channel differences, vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use LDA followed perhaps by heteroscedastic linear discriminant analysis or a global semi tied covariance transform (also known as maximum likelihood linear transform (MLLT)). A serious company with a large amount of training data would probably want to consider discriminative training techniques like maximum mutual information (MMI), MPE, or (for short utterances) MCE, and if a large amount of speaker-specific enrollment data was available a more wholesale speaker adaptation could be done using MAP or, at least, tree-


based maximum likelihood linear regression. Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, but there is a choice between dynamically creating combination hidden Markov models which includes both the acoustic and language model information, or combining it statically beforehand (the AT&T approach, for which their FSM toolkit might be useful). Those who value their sanity might consider the AT&T approach, but be warned that it is memory hungry.

b. Neural network-based speech recognition

Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden Markov model part for language modeling.

c. Dynamic time warping (DTW)-based speech recognition

Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analyzed with DTW.

A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.



Speech recognition is the process of finding a interpretation of a spoken utterance; typically, this means finding the sequence of words that were spoken.

This involves preprocessing the acoustic signals to parameterize it in a more usable and useful form. The input signal must be matched against a stored pattern and then makes a decision of accepting or rejecting a match. No two utterances of the same word or sentence are likely to give rise to the same digital signal. This obvious point not only underlies the difficulty in speech recognition but also means that we be able to extract more than just a sequence of words from the signal.

The different types of problems we are going to face in our project have been enumerated below: -


The voice of a man differs from the voice of a woman that again differs from the voice of a baby. Different speakers have different vocal tracts and source physiology.

Electrically speaking, the difference is in frequency. Women and babies tend to speak at higher frequencies from that of men.


No two persons speak with the same loudness. One person will constantly go on speaking in a loud manner while another person will speak in a light tone. Even if the same person speaks the same word on two different instants, there is no guarantee that he will speak the word with the same loudness at the different instants. The problem of loudness also depends on the distance the microphone is held from the user's mouth.

Electrically speaking, the problem of difference is reflected in the amplitude of the generated digital signal.


Even if the same person speaks the same word at two different instants of time, there is no guarantee that he will speak exactly similarly on both the occasions.

Electrically speaking there is a problem of difference in time i.e. indirectly frequency.



Physically the speech signal (actually all sound) is a series of pressure changes in the medium between the sound source and the listener. The most common representation of the speech signal is the oscillogram, often called the waveform. In this the time axis is the horizontal axis from left to right and the curve shows how the pressure increases and decreases in the signal. The utterance we have used for demonstration is "phonetician",. The signal has also been segmented, such that each phoneme in the transcription has been aligned with its corresponding sound event.

has been aligned with its corresponding sound event. S PECTROGRAM phonetician In the spectrogram the time



In the spectrogram the time axis is the horizontal axis, and frequency is the vertical axis. The third dimension, amplitude, is represented by shades of darkness. Consider the spectrogram to be a number of spectrums in a row, looked upon "from above", and where the highs in the spectra are represented with dark spots in the spectrogram.

From the picture it is obvious how different the speech sounds are from a spectral point of view

in the spectrogram. From the picture it is obvious how different the speech sounds are from


Now, let's look at the spectrograms of the vowel /i:/ in "three" and "tea".

of the vowel /i:/ in "three" and "tea". Figure Example of vowel /i:/ in different phonetic

Figure Example of vowel /i:/ in different phonetic contexts.


A machine will have to face many problems, when trying to imitate the ability of humans. The audio range of frequencies varies from 20 Hz to 20 kHz. Some external noises have frequencies that may be within this audio range. These noises pose a problem since they cannot be filtered out.


There may be problems due to differences in the electrical properties of different mikes and transmission channels.


Pitch and other source features such as breathiness and amplitude can be varied independently.


We have to make sure that robot does not go out of reach of our voice.

Output of microphone is very small.

Output of Voice recognition chip is not compatible with input required at motors.



After analyzing the problems we come out with the solutions which are listed below.

1. Amplitude Variation:-

Amplitude variation of the electrical signal output of microphone may occur mainly due to:

a) Variation of distance between sound source and the transducer.

b) Variation of strength of sound generated by source.

To recognize a spoken word, it does not matter whether it has been spoken loudly or less loudly. This is because characteristic features of a word spoken lies in its frequency & not in its loudness (amplitude). Thus, at a certain stage this amplitude information is suitably normalized.

2. Recognition of a word: -

If same word is spoken two times at different time instants, they sound similar to us; question arises what is the similarity in-between them? It is important to note that it does not matter whether one of spoken word was of different loudness than the other. The difference lies in frequency. Hence, any large frequency variation would cause the system not to recognize the word. In speaker independent type of system, some logic can be implemented to take care of frequency variation. A small frequency variation i.e. features variation within tolerable limits is considered to be acceptable.

3. Noise:-

Along with the sound source of the speech the other stray sounds also are picked up by the microphone, thus degrading the information contained in the signal.

4. Microphone response: -

Two different microphones may not have same response. Hence if microphone is changed, or the system is installed on a new PC due to different response the success rate of recognition may drop.

5. In order our voice is recognized by robot at a distance we will use wireless mic. In

case robot does not recognize any word, we will make an arrangement such that robot

automatically stops after some time.

6. We will use microphone pre-amplifier circuit. It is in-built in HM2007

7. We use decoding logic and motor driving circuits so chip and motors are made

compatible, thereby solving compatibility problem.


8. One of the important problem which needed to be solved was to provide sufficient current and voltage to entire assembly when interfered together. Since the current drawn from supply was so much that a 9V battery could not last for a longer period, we used current buffer IC. In our application we have used 74LS245.



The most challenging part of the entire system is designing and interfacing various stages together. Our approach was to get the analog voice signal being digitized. The frequency and pitch of words be stored in a memory. These stored words will be used for matching with the words spoken. When the match is found, the system outputs the address of stored words. Hence we have to decode the address and according to the address sensed, the car will perform the required task. Since we wanted the car to be wireless, we used RF module. The address was decoded using microcontroller and then applied to RF module. This together with driver circuit at receivers end made complete intelligent systems.

It must be noted that we did not use wireless mic instead used analog RF module which transmitted 5 different frequencies each for right, left, forward, backward, crane movement.


a. Voice Recognition Module

b. Microcontroller and Decoder

c. RF module

d. Motor Driver Circuit

e. Buffer


Block Diagram:

Block Diagram: Voice Recognition Module The speech recognition module basically consists of: Voice Recognition Chip :

Voice Recognition Module

The speech recognition module basically consists of:

Voice Recognition Chip: It is the heart of the entire system. HM2007 is a voice recognition chip with on-chip analog front end, voice analysis, recognition process and system control functions. The input voice command is analyzed, processed, recognized and then obtained at one of its output port which is then decoded , amplified and given to motors of robot car.


We initially used an Indian manufactured voice recognition chip AP7003. It is a monolithic user dependence speech recognition IC designed for toy application. AP7003 consist of microphone amplifier, A/D converter, speech processor and I/O controller. After pre-recording, AP7003 can recognize up to 12 different sentences each with 1.5 sec length with highly I/O programmability. However it was not much accurate and reliable. So we started looking for another alternative. We found HM 2007 as a right choice.

for another alternative. We found HM 2007 as a right choice. The chip provides the options

The chip provides the options of recognizing either forty .96 second words or twenty 1.92 second words. This circuit allows the user to choose either the .96 second word length (40 word vocabulary) or the 1.92 second word length (20 word vocabulary). For memory the circuit uses an 8K X 8 static RAM.

The chip has two operational modes; manual mode and CPU mode. The CPU mode is designed to allow the chip to work under a host computer. This is an attractive approach to speech recognition for computers because the speech recognition chip operates as a co-processor to the main CPU. The jobs of listening and recognition don’t occupying any of the computer's CPU time. When the HM2007 recognizes a command it can signal an interrupt to the host CPU and then relay the command code. The HM2007 chip can be cascaded to provide a larger word recognition library.

The circuit we are building operates in the manual mode. The manual mode allows one to build a stand alone speech recognition board that doesn't require a host computer and may be integrated into other devices to utilize speech control.

The major components of this design are: a speech recognition chip, memory, keypad, and LED 7-segment display. The chip is designed for speaker dependent (one- user) applications, but can be manipulated to perform speaker independent (multiple- users) applications. The keypad and LED 7-segment display will be used to program and test the voice recognition circuit.


More about the HM2007 chip

The HM2007 is a single-chip complementary metal-oxide semiconductor (CMOS) voice- recognition large-scale integration (LSI) circuit. The chip contains an analog front end, voice analysis,recognition, and system control functions. The chip may be used in a stand-alone or connected CPU.


Single-chip voice-recognition CMOS LSI


External RAM support

Maximum of 40-word recognition

Maximum word length of 1.92 s

Microphone support

Manual and CPU modes available

Response time less than 300 milliseconds (ms)

5 volt (5V) power supply

The system we are building is typically trained as speaker dependent (single user).Thus the user will be its real master.

Microphone: It takes the analog voice commands and sends it to voice recognition chip(HM 2007) in the form of electrical signal.

The human ear has an auditory range from 10 to 15,000 Hz. Sound can be picked up easily using a microphone and amplifier. Microphones typically have an auditory range that surpasses that of human hearing.

Microphones are transducers which detect sound signals and produce an electrical image of the sound, i.e., they produce a voltage or a current which is proportional to the sound signal. The most common microphones for musical use are dynamic, ribbon, or condenser microphones. Besides the variety of basic mechanisms, microphones can be designed with different directional patterns and different impedances.

variety of basic mechanisms, microphones can be designed with different directional patterns and different impedances. 21


variety of basic mechanisms, microphones can be designed with different directional patterns and different impedances. 21

Dynamic Microphones

Dynamic Microphones Principle: sound moves the cone and the attached coil of wire moves in the
Dynamic Microphones Principle: sound moves the cone and the attached coil of wire moves in the

Principle: sound moves the cone and the attached coil of wire moves in the field of a magnet. The generator effect produces a voltage which "images" the sound pressure variation - characterized as a pressure microphone.


Relatively cheap and rugged.

Can be easily miniaturized.



uniformity of

response to different

frequencies does not

match that of the ribbon




Ribbon Microphones

ribbon or condenser microphones Ribbon Microphones Principle: the air movement associated with the sound moves

Principle: the air movement associated with the sound moves the metallic ribbon in the magnetic field, generating an imaging voltage between the ends of the ribbon which is proportional to the velocity of the ribbon - characterized as a "velocity" microphone.


Adds "warmth" to the tone by accenting lows when close- miked.

Can be used to discriminate against distant low frequency noise in its most common gradient form.


Accenting lows sometimes produces "boomy" bass.

Very susceptible to wind noise. Not suitable for outside use unless very well shielded


Condenser Microphones

Condenser Microphones Principle: sound pressure changes the spacing between a thin metallic membrane and the stationary
Condenser Microphones Principle: sound pressure changes the spacing between a thin metallic membrane and the stationary

Principle: sound pressure changes the spacing between a thin metallic membrane and the stationary back plate. The plates are charged to a total charge

back plate. The plates are charged to a total charge where C is the capacitance and

where C is the capacitance and V the voltage of the biasing battery.


Best overall frequency response makes this the microphone of choice for many recording applications.



May pop and crack when close miked

Requires a battery or external power supply to bias the plates.

A change in plate spacing will cause

a change in charge Q and force a

current through resistance R. This current "images" the sound pressure, making this a "pressure" microphone

Pop filters in front of mics.

Some microphones are very sensitive to minor gusts of wind--so sensitive in fact that they will produce a loud pop if you breath on them. To protect these mics (some of which can actually be damaged by blowing in them) engineers will often mount a nylon screen between the mic and the artist. This is not the most common reason for using pop filters though:

Vocalists like to move around when they sing; in particular, they will lean into microphones. If the singer is very close to the mic, any motion will produce drastic changes in level and sound quality. (You have seen this with inexpert entertainers using hand held mics.) Many engineers use pop filters to keep the artist at the proper distance. The performer may move slightly in relation to the screen, but that is a small proportion of the distance to the microphone.


Keypad: It is used for training/programming the chip. It also allocates definite memory locations to voice commands. The keypad is made up of 12 switches.

to voice commands. The keypad is made up of 12 switches. . Figure 2 When the

Figure 2

When the circuit is turned on, the HM2007 checks the static RAM. If everything checks out the board displays "00" on the digital display and lights the red LED (READY). It is in the "Ready" waiting for a command.

"00" on the digital display and lights the red LED (READY). It is in the "Ready"


7-segment Display: It is used to test the voice recognition circuit.

The 7 segment display is used as a numerical indicator on many types of test equipment.It is an assembly of light emitting diodes which can be powered individually. They most commonly emit red light. Powering all the segments will display the number 8. Powering a,b,c d and g will display the number 3. Numbers 0 to 9 can be displayed. The d.p represents a decimal point.

0 to 9 can be displayed. The d.p represents a decimal point. The one shown is
0 to 9 can be displayed. The d.p represents a decimal point. The one shown is

The one shown is a common anode display since all anodes are joined together and go to the positive supply. The cathodes are connected individually to zero volts. Resistors must be placed in series with each diode to limit the current through each diode to a safe value.

Common cathode displays where all the cathodes are joined are also available.


Applications and Drivers

A numeral to be displayed on a seven segment display is usually encoded in BCD form, and a logic circuit driver ON or OFF the proper segments of the display. This logic is also called decoder. Various decoders are available to drive common anode and common cathode displays. One of the easily available decoder is 7447 AND 7448 TTL decoders. They are open collector TTL that are designed to pull down common anode (7447 type) and common cathode (7448 type) through external current limiting resistors.

We used 7448 decoder chip driving a common cathode seven segment display.

Circuit Diagram of voice recognition module:

display. Circuit Diagram of voice recognition module: 8k x 8 RAM : It stores decoded voice

8k x 8 RAM: It stores decoded voice commands by the chip at the assigned locations.


Output of Voice recognition module

The 8-bit output is taken from the output of the 74LS373 data octal latch. The output is not a standard 8-bit byte, but it is broken into two 4-bit binary coded decimal (BCD) nibbles. BCD code is related to standard binary numbers as Table below illustrates.

to standard binary numbers as Table below illustrates . As you can see, the binary and

As you can see, the binary and BCD numbers remain the same until reaching decimal 10. At decimal 10, BCD jumps to the upper nibble and the lower nibble resets to zero. The binary numbers continue to decimal 15, and then jump to the upper nibble at 16 where the lower nibble resets. If a computer is expecting to read an 8-bit binary number and BCD is provided, this will be the cause of errors. Further since the module outputs nos. 55, 66 and 77 as default value for errors and we want these outputs not to be used, we use microcontroller.


Microcontroller and driver circuit

Decoder: It is second most important part of the project. The output from the chip is given to decoder (micro-controller) which acts as a DMC i.e. a Digital Motor Controller. DMC senses the output ports of HM2007 chip and produces proper o/p as per the commands forward, backward, left, right, load, release, stop. The proper functionality of the system depends on the proper decoding logic.

of the system depends on the proper decoding logic. We use port0 as input port and

We use port0 as input port and port1 as output port.

P0.0 to P0.6 are given inputs from 7 output pins of voice recognition module

While P0.7 is kept grounded


Microcontroller circuit:

Microcontroller circuit: 29


Table shows the output codes generated due to different commands after programming the microcontroller.







































































(For wireless car, this is input to RF module and then to motors through driver ckt)








































































(For wired car, this is input directly to driver ckt)


Keil 2 µVision

Keil 2 µVision • It is software which allows us to use C language, basic language

It is software which allows us to use C language, basic language as per user convenience. This can be then converted into hex codes. Thus making programming simpler. Thus no need to refer opcodes for commands.


Aec_isp_v3 µC Programmer

Aec_isp_v3 µC Programmer • It is used to program 89S51, 89S52, 89S53. • It reads, programs

It is used to program 89S51, 89S52, 89S53.

It reads, programs hex files into microcontroller.

Running the Software: Your code needs to be in Intel Hex Format.AEC_ISP will open the file you specify and load it into a buffer. You can specify a default file in the command line; e.g.: To specify TEST.HEX as the default file; start by typing ‘AEC_ISP TEST.HEX’.


RF module:

RF module: Let's take a closer look at the RC truck we saw in 1 s

Let's take a closer look at the RC truck we saw in 1 st chapter. We will assume that the exact frequency used is 27.9 MHz. Here's the sequence of events that take place when you use the RC transmitter:

You press a trigger to make the truck go forward.

The trigger causes a pair of electrical contacts to touch, completing a circuit connected to a specific pin of an integrated circuit (IC).

The completed circuit causes the transmitter to transmit a set sequence of electrical pulses.

Each sequence contains a short group of synchronization pulses, followed by the pulse sequence. For our truck, the synchronization segment -- which alerts the receiver to incoming information -- is four pulses that are 2.1 milliseconds (thousandths of a second) long, with 700-microsecond (millionths of a second) intervals. The pulse segment, which tells the antenna what the new information is, uses 700-microsecond pulses with 700- microsecond intervals.


A typical RC signal transmission Here are the pulse sequences used in the pulse segment:

A typical RC signal transmission

A typical RC signal transmission Here are the pulse sequences used in the pulse segment: 1.
A typical RC signal transmission Here are the pulse sequences used in the pulse segment: 1.

Here are the pulse sequences used in the pulse segment:

1. Forward: 16 pulses

2. Backward: 40 pulses

3. Forward/Left: 28 pulses

4. Forward/Right: 34 pulses

5. U-turn: 52 pulses

6. Crane movement: 46 pulses

The transmitter sends bursts of radio waves that oscillate with a frequency of 27,900,000 cycles per second (27.9 MHz).

The truck is constantly monitoring the assigned frequency (27.9 MHz) for a signal. When the receiver receives the radio bursts from the transmitter, it sends the signal to a filter that blocks out any signals picked up by the antenna other than 27.9 MHz. The remaining signal is converted back into an electrical pulse sequence.

The pulse sequence is sent to the IC in the truck, which decodes the sequence and starts the appropriate motor. For our example, the pulse sequence is 16 pulses (forward), which means that the IC sends positive current to the motor running the wheels. If the next pulse sequence were 40 pulses (reverse), the IC would invert the current to the same motor to make it spin in the opposite direction.

The motor's shaft actually has a gear on the end of it, instead of connecting directly to the axle. This decreases the motor's speed but increases the torque, giving the truck adequate power through the use of a small electric motor!

The truck moves forward.


Buffer: We used IC 74LS245 as buffer ic.It solved the current supply problem. It is a 3- state octal bus transceiver. They are designed for asynchronous two-way communication between data buses.The device allows the A bus to the B bus or vice-versa depending upon the logic level at the direction control (DIR) input. The enable input Ġ can be used to disable the device so that the buses are effectively isolated.


Batteries are by far the most commonly used electric power supply for robotics. Batteries are so commonplace that it’s easy to take them for granted. An understanding of batteries will help you choose batteries that will optimize your robot’s design.

Primary batteries

Primary batteries are one-time-use batteries. The batteries we will look at in this class deliver 1.5 V per cell. They are designed to deliver their rated electrical capacity and then be discarded. When building robotic systems, discarding depleted primary batteries can become expensive. However, one advantage to using primary batteries is that they typically have a greater electrical capacity than rechargeables. If one is engaged in a function (i.e., a robotic war) that requires the highest power density available for one-shot use, primary batteries may be the way to go.

Secondary batteries

Secondary batteries are rechargeable. The most common rechargeable batteries are NiCds and lead-acid. Secondary batteries, while initially more expensive, are cheaper in the long run. Typically secondary batteries can be recharged 200 to 1000 times.



To record or train a command, the chip stores the analog signal pattern and amplitude and saves it in the 8kx8 SRAM. In recognition mode, the chip compares the user- inputted analog signal from the microphone with those stored in the SRAM and if it recognizes a command, an output of the command identifier will be sent to the microprocessor through the D0 to D7 ports of the chip. For training, testing (if recognized properly) and clearing the memory, keypad and 7-segment display is used.

To Train:

To train the circuit begin by pressing the word number you want to train on the keypad. Use any numbers between 1 and 40. For example press the number "1" to train word number 1. When you press the number(s) on the keypad the red led will turn off. The number is displayed on the digital display. Next press the "#" key for train. When the "#" key is pressed it signals the chip to listen for a training word and the red led turns back on. Now speak the word you want the circuit to recognize into the microphone clearly. The LED should blink off momentarily, this is a signal that the word has been accepted.

Continue training new words in the circuit using the procedure outlined above. Press the "2" key then "#" key to train the second word and so on. The circuit will accept up to forty words. You do not have to enter 40 words into memory to use the circuit. If you want you can use as many word spaces as you want


The circuit is continually listening. Repeat a trained word into the microphone. The number of the word should be displayed on the digital display. For instance if the word "directory" was trained as word number 25. Saying the word "directory" into the microphone will cause the number 25 to be displayed.

Error Codes:

The chip provides the following error codes:


= word too long


= word too short


= word no match



We believe such a system would find wide variety of applications. Menu driven systems such as e-mail readers, household appliances like washing machines, microwave ovens, and pagers and mobiles etc. will become voice controlled in future

The robot is useful in places where humans find difficult to reach but human voice reaches. E.g. in a small pipeline, in a fire-situations, in highly toxic areas.

The robot can be used as a toy.

It can be used to bring and place small objects.

It is the one of the important stage of Humanoid robots.

Command and control of appliances and equipment

Telephone assistance systems

Data entry

Speech and voice recognition security systems



Parts list for speech-recognition circuit

(1) IC1 HM2007 IC

(1) IC2 SRAM 8K X 8

(1) IC3 74LS373

(2) IC4 and IC5 7448

(1) XTAL 3.57 MHz

(1) Speech-recognition PCB

(1) 12-contact keypad

(2) 7-segment displays

(2) 16-pin, 220-ohm, 1/4W resistor packs

(1) 22K-ohm, 1/4-W resistor

(1) 5.6K-ohm, 1/4-W resistor

(1) 0.0047-uF cap

(1) C2 100-uF, 16V cap

(1) C5 0.1-uF cap

(1) 7805 voltage regulator

(1) Microphone

(1) 9V battery clip

Parts list for interface circuit

(1) Micrcontroller 89S51

(1) 74LS373 Octal D flip-flop tri-state

(4) 220 ohm 7pin Ressistor Bank

(10) Miniature LEDs


(1)RF module

(1)40 Mhz crystal

(3)DC motors


(4)Male-Female 7pin connectors

Parts available from: Images Company

39 Seneca Loop

Staten Island, NY 10314




Chapter11. DATASHEET

Single-chip voice-recognition CMOS LSI


External RAM support

Maximum of 40-word recognition

Maximum word length of 1.92 s

Microphone support

Manual and CPU modes available

Response time less than 300 milliseconds (ms)

5 volt (5V) power supply

• Manual and CPU modes available • Response time less than 300 milliseconds (ms) • 5
• Manual and CPU modes available • Response time less than 300 milliseconds (ms) • 5
• Manual and CPU modes available • Response time less than 300 milliseconds (ms) • 5
• Manual and CPU modes available • Response time less than 300 milliseconds (ms) • 5












Chapter12. Project Progress Report Summary

Calendar year 2006:

June -Work started

July -Gathered useful information on voice processing techniques, microphone properties. (Chapter 3,4)

August -We tried another chip AP7003-02, manufactured by Indian company A-plus India. (Page 20)

September – We built a voice recognition module using AP7003-02.

October – Our attempts did not suceed with AP7003-02.

November- Tried to find some better alternative but finally decided to go with HM2007 and decided to get it imported from US.(Page 19,20; Chapter 11)

December- Project work was on hold.

Calendar year 2007:

January – In 2 nd week of January we worked upon voice recognition part and circuit was soldered.In last week we got desired output of voice recognition module. (Page 26)

February – We worked upon microcontroller part. With lot of minor problems being solved we finally even managed to complete microcontroller part. At the end of February we got somewhat success with our wired model using proper driver circuit. (Page 28, 29,


March- We made use of our waste toy car and decoded it’s remote control logic and matched it with our microcontroller output. Finally with buffers being added between microcontroller and rf module we were able to bring entire circuit together. At this point of time we also won certificate in project paper presentation.(Page 33,34,35)

April- Project complete.



Web: for selecting motors and other robotic concepts. For microphones types and properties. for understanding microphone concepts, rf radio working and other related concepts.



The 8051 microcontroller –Kenneth Ayala, 3 rd reprint, 2005; Thomson Asia Ltd.,Singapore; Chapter 3,6,7&8.For programming 89S51

Modern Digital Electronics –RP Jain, 3 rd edition; Tata Mcgraw Hill; Chapter 6&10. For A/D converter and 7 segment display connections.


Keil2 software

Used for simulating the microcontroller program


Used for burning/programming the microcontroller