Sei sulla pagina 1di 64

HAND GESTURE RECOGNITION SYSTEM USING HAAR WAVELET

A PROJECT REPORT Submitted by J.JENKIN WINSTON (Reg. No: 96207106036) M.MARIA GNANAM (Reg.No:96207106056) R.RAMASAMY (Reg.No:96207106306)

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING
in ELECTRONICS AND COMMUNICATION ENGINEERING

NATIONAL ENGINEERING COLLEGE, KOVILPATTI ANNA UNIVERSITY OF TECHNOLOGY, TIRUNELVELI - 627 007.
March 2011
1

ANNA UNIVERSITY OF TECHNOLOGY, TIRUNELVELI

BONAFIDE CERTIFICATE
Certified that this project report titled HAND GESTURE RECOGNITON SYSTEM USING HAAR WAVELET is the bonafide work of J.JENKIN WINSTON (96207106036), M.MARIA GNANAM (96207106056), R.RAMASAMY (96207106306) who carried out the project work under my supervision.

Signature Head of the Department Dr.V.Vijayarangan, B.E., M.Sc.,(Engg),Ph.D., Department of ECE National Engineering College, College, Kovilpatti -628 503

Signature Supervisor Mr.M.Sundaram, M.E., Sr.Lecturer/ECE, National Engineering Kovilpatti -628 503

Submitted for Viva-Voce Examination held at NATIONAL ENGINEERING COLLEGE, Kovilpatti on ____________

Internal Examiner Examiner

External

ACKNOWLEDGEMENT First and foremost we express our wholehearted gratitude to the Almighty for having given wisdom and courage to take over this project. We wish to express our sincere thanks to the Director Dr.Kn.K.S.K.Chockalingam, B.E., M.Sc(Engg)., Ph.D., who helped us in carrying our project successfully. We would like to express our sincere thanks to our former Principal Dr.N.S.Marimuthu, B.E., M.Sc(Engg)., Ph.D., for providing us this opportunity to do this project. Our heartfelt acknowledgement goes to the Professor and Head of the Department of Electronics and Communication Engineering, Dr.V.Vijayarangan, B.E., M.Sc(Engg)., Ph.D., for his valuable and consistent encouragement for carrying out the project. Our gratitude is no less for our project coordinators Mr.N.Arumugam, M.E., Assistant Professor and Mrs.S.D.Jayavathi, M.E., Senior Lecturer, in the Department of Electronics and Communication for their encouragement. We express our deepest gratitude to our guide Mr.M.Sundaram, M.E., Senior lecturer, in the Department of Electronics and Communication Engineering, for rendering excellent guidance and for being extremely kind and approachable in nature, being a great source of support and encouragement throughout the course of the project work.

We hereby acknowledge the efforts of all staff members, technicians of Electronics and Communication Engineering Department, whose help was instrumental in completion of my project.

Also we would like to express our hearty thanks to our beloved parents and dear friends for their valuable suggestions and cooperation for the project.

ABSTRACT With the increasing growth of technology and the entrance into the digital age, we handle difficulties concerned with handicapped people in a new approach. The sign language they use for communication is not understandable by everyone. This isolates them from the speaking community. So we have aimed at providing an effective means of communication for the dumb people by programming a gesture recognition system with the concept of image processing. Our algorithm is developed in Matlab to recognize static hand gestures, namely, a subset of American Sign Language (ASL). It is fairy robust to background cluster and uses skin color for hand gesture tracking and recognition. In this project we have reduced the database size by normalizing the orientation of hands using the idea of principal axis. We have taken correlation factor to improve the degree of recognition. Every human has a hand geometry different from one another. So to tradeoff this we are using a transform that converts an image into a feature vector, which will then be compared with the feature vectors of a training set of gestures. Improvising on all this features would decrease the computation time. This method is very compact and handy compared to other hand gesture recognition systems.

TABLE OF CONTENTS CHAPTER NO ABSTRACT TITLE PAGE NO


i

LIST OF FIGURES LIST OF ABBREVIATIONS 1 INTRODUCTION 1.1 Prelude 1.2 Need for sign language 1.3 American sign language 1.4 Gesture recognition 1.4.1 Gesture recognition and pen computing 1.4.2 Gesture types 1.4.3 Uses

ii

iii

1 1 4 5 6 7 7 8

2 3

LITERATURE SURVEY BACKGROUND 3.1 Existing system 3.2 Problem statement

11 12 12 13

METHODOLOGY 4.1 Image capturing devices 4.1.1 Challenges


6

14 14 15

4.2 Significance of grayscale images 4.2.1 Grayscale as single channel of multichannel color images

16 17

4.3 Hand segmentation 4.3.1 Threshold selection 4.3.2 Adaptive thresholding 4.3.3 Multiband thresholding

18 20 20 20

4.4 Morphological operation 4.4.1 Structuring element 4.4.2 Image closing 4.4.3 Effect of image closing

21 21 23 24

4.5 Image registration 4.5.1 Algorithm classifications 4.5.1.1 Intensity based vs feature based 4.5.1.2 Spatial vs Frequency domain methods 4.5.1.3 Single vs Multi-modality methods 4.5.1.4 Automatic vs Interactive methods 4.5.2 Uncertainity 4.5.3 Transformation methods 4.5.4 Radon transform

26 26 27 28 28 29 29

30

4.6 Feature Extraction 4.6.1 Wavelets 4.6.2 Wavelet transform 4.6.3 The discrete wavelet transform 4.6.4 2D-Discrete wavelet transform

31 33 34 34 35

OVERVIEW OF THE PROJECT 5.1 An overlay of our algorithm 5.2 Proposed work

37 37 38

SOFTWARE DESCRIPTION 6.1 Introduction

41 41

6.2 Features of Matlab 6.2.1 Command window 6.2.2 Graphics window 6.2.3 Edit window 6.2.4 Input output 6.2.5 Data type 6.2.6 Dimensioning 6.2.7 Case sensitivity

42 42 42 42 42 42 43 43

6.3 Images in Matlab

43

6.4 File types 6.4.1 M-files

44 44

6.4.2 Script files 6.4.3 Function files 6.4.4. MAT-files

44 45 45

SIMULATION RESULT

46

CONCLUSION

52

REFERENCES

53

LIST OF FIGURES FIGURE NO. 1.1 4.1 4.2 4.3 4.4 4.5 4.6 ASL examples A model gray image Three channels of a RGB image Original image Example of a threshold effect used on an image Structuring element Effect of closing using 3x3 square structuring element Multi-resolution expansion using Haar wavelet Overlay of our algorithm Resized image Gray scale image Segmented hand Morphologically operated image Normalized image Horizontal vector of DWT TITLE PAGE NO. 5 17 18 19 19 23

25 32 37 46 47 48 49 50 51

4.7 5.1 7.1 7.2 7.3 7.4 7.5 7.6

10

LIST OF ABBREVIATIONS

ASL CAD PUI GUI HMI MRI CT

American Sign Language Computer Aided Design Perceptual User Interface Graphical User Interface Human Machine Interface Magnetic Resonance Imaging Computed Tomography

PET DWT STFT MATLAB Positron Emission Tomography Discrete Wavelet Transform Short Time Fourier Transform Matrix Laboratory

11

CHAPTER 1 INTRODUCTION

1.1. PRELUDE

Since the existing common computer devices are adequate. It is also now that computers have been so tightly integrated with everyday life, that new applications and hardware are constantly introduced. The means of

communicating with computers at the moment are limited to keyboards, mice, light pen, trackball, keypads etc. These devices have grown to be familiar but inherently limit the speed and naturalness with which we interact with the computer. Recently, there has been a surge in interest in recognizing human hand gestures. Hand gesture recognition has various applications like computer games, machinery control (e.g. crane), and thorough mouse replacement. One of the most structured sets of gestures belongs to sign language. In sign language, each gesture has an assigned meaning (or meanings). Computer recognition of hand gestures may provide a more naturalcomputer interface, allowing people to point, or rotate a CAD model by rotating their hands. Hand gestures can be classified in two categories as static and dynamic. A static gesture is a particular hand configuration and pose, represented by a single image. A dynamic gesture is a gesture, represented by a sequence of images. We will focus on the recognition of static images. The reliance on sign language among dumb people communities result in linguistic isolation from the general community. The overwhelming majority of hearing people do not understand sign language. Many approaches for effective man-machine communication have been proposed such as voice, face, iris, retinal
12

scans and gesture recognition systems. Gesture recognition, along with facial recognition, voice recognition, eye tracking and lip movement recognition are components of what developers refer to as a perceptual user interface (PUI). The goal of PUI is to enhance the efficiency and ease of use for the underlying logical design of a stored program, a design discipline known as usability. In personal computing, gestures are most often used for input commands. Despite the use of face and voice features, hands require less complexity in terms of imaging conditions. Consequently hand based recognition is friendlier and it is less prone to disturbances and robust to environmental conditions. Our goal is to offer a sign recognition system as another choice of augmenting communication between dumb people and the speaking community. This wearable system would capture and recognize the dumb users signing. The user could then cue the system to generate text or speech. Recognizing gestures as input allows computers to be more accessible for the physically-impaired and makes interaction more natural in a 3D virtual world environment. Hand and body gestures can be amplified by a controller that contains accelerometers and gyroscopes to sense tilting, rotation and acceleration of movement or the computing device can be outfitted with a camera so that software in the device can recognize and interpret specific gestures. Conventional methods used in hand gesture recognition systems are glove based techniques with embedded accelerometer and multiple sensors and computer vision based technique. The use of accelerometer demands for hardware components and power supply. In hand gesture recognition using a sensing glove with 6 embedded accelerometers, it recognizes 28 static hand gestures and the computation time is 1 characters/second. However, this algorithm is not efficient to be applied in realtime. Another recognition system by using colored gloves and neural networks algorithm was introduced. But the success rate ranges from 70% to
13

93%. Although these systems can recognize hand gestures, the wearing of a sensory glove is not convenient for daily application.

For computer vision based techniques, one or a set of cameras are utilized to capture hand images for recognition. It is based on computer vision techniques without restricting backgrounds or using any markers. This method first separates the region of hand gesture from complex background images by measuring entropy from adjacent frames images. A hand gesture is then recognized by the approach of improved centroidal profile. However, mis-recognitions can be caused by hand gestures with similar spatial features. Therefore the number of hand gestures that can be recognized by the proposed algorithm is limited. To be an effective vision system, it should be glove-free, fast, small database and accurate. Moreover the use of computer vision based technique increases the complexity of image recognition. In addition to the technical challenges of implementing gesture recognition, there are also social challenges. Gestures must be simple, intuitive and universally acceptable. The study of gestures and other nonverbal types of communication is known as kinesics. The key problem in gesture interaction is how to make hand gestures understood by computers. The approaches present can be mainly divided into Data-Glove based and Vision Based approaches. The Data-Glove based methods use sensor devices for digitizing hand and finger motions into multiparametric data. The extra sensors make it easy to collect hand configuration and movement. However, the devices are quite expensive and bring much cumbersome experience to the users. In contrast, the Vision Based methods require only a camera, thus realizing a natural interaction between humans and computers without the use of any extra devices. These systems tend to
14

complement biological vision by describing artificial vision systems that are implemented in software and hardware. This poses a challenging problem as these systems need to be background invariant, lighting insensitive, person and camera independent to achieve real time performance. Moreover, such systems must be optimized to meet the requirements, including accuracy and robustness.

In this project, a new approach in realtime hand gesture recognition is developed. It is a recognition algorithm based on Haar wavelet representation. Hands are extracted by a skin color approach rather than user input. The problem of hand orientation in the image is also solved by utilizing the idea of axis of elongation. It helps keeping the database small by standardizing the hand gestures in fixed orientations. We then introduce a new approach to disperse hand features in the image which shows a promotion in the success rate.

1.2. NEED FOR SIGN LANGUAGE Creating a proper sign language (ASL American Sign Language at this case) dictionary is not the desired result at this point. This would combine advanced grammar and syntax structure understanding of the system, which is outside the scope of this project. The American Sign Language will be used as the database since its a tightly structured set. From that point further applications can be suited. The distant (or near) future of computer interfaces could have the usual input devices and in conjunction with gesture recognition some of the users feelings would be perceived as well. Taking ASL recognition further a full real-time dictionary could be created with the use of video. As mentioned before this would require some Artificial Intelligence for grammar and syntax purposes.
15

Another application is huge database annotation. It is far more efficient when properly executed by a computer, than by a human.

1.3. AMERICAN SIGN LANGUAGE American Sign Language is the language of choice for most deaf people in the United States. It is part of the deaf culture and includes its own system of puns, inside jokes, etc. However, ASL is one of the many sign languages of the world. As an English speaker would have trouble understanding someone speaking Japanese, a speaker of ASL would have trouble understanding the Sign Language of Sweden. ASL also has its own grammar that is different from English. ASL consists of approximately 6000 gestures of common words with finger spelling used to communicate obscure words or proper nouns. Finger spelling uses one hand and 26 gestures to communicate the 26 letters of the alphabet. Some of the signs can be seen in fig1.1.

Fig 1.1 ASL examples Another interesting characteristic that will be ignored by this project is the ability that ASL offers to describe a person, place or thing and then point to a place in space to temporarily store for later reference.

16

ASL uses facial expressions to distinguish between statements, questions and directives. The eyebrows are raised for a question, held normal for a statement, and furrowed for a directive. There has been considerable work and research in facial feature recognition, they will not be used to aid recognition in the task addressed. This would be feasible in a full real-time ASL dictionary.

1.4. GESTURE RECOGNITION Gesture recognition is a language technology with the goal of interpreting human gestures via mathematical algorithms. Gestures can originate from any bodily motion or state but commonly originate from the face or hand. Current focuses in the field include emotion recognition from the face and hand gesture recognition. Many approaches have been made using cameras and computer vision algorithms to interpret sign language. However, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques. Gesture recognition can be seen as a way for computers to begin to understand human body language, thus building a richer bridge between machines and humans than primitive text user interfaces or even GUIs (graphical user interfaces), which still limit the majority of input to keyboard and mouse. Gesture recognition enables humans to interface with the machine (HMI) and interact naturally without any mechanical devices. Using the concept of gesture recognition, it is possible to point a finger at the computer screen so that the cursor will move accordingly. This could potentially make conventional input devices such as mouse, keyboards and even touch-screens redundant.

17

Gesture recognition can be conducted with techniques from computer vision and image processing. The literature includes ongoing work in the computer vision field on capturing gestures or more general human pose and movements by cameras connected to a computer. 1.4.1. GESTURE RECOGNITION AND PEN COMPUTING In some literature, the term gesture recognition has been used to refer more narrowly to non-text-input handwriting symbols, such as inking on a graphics tablet, multi-touch gestures, and mouse gesture recognition. This is computer interaction through the drawing of symbols with a pointing device cursor. 1.4.2. GESTURE TYPES In computer interfaces, two types of gestures are distinguished: Offline gestures: Those gestures that are processed after the user interaction with the object. An example is the gesture to activate a menu. Online gestures: Direct manipulation gestures. They are used to scale or rotate a tangible object.

18

1.4.3. USES Gesture recognition is useful for processing information from humans which is not conveyed through speech or type. As well, there are various types of gestures which can be identified by computers. Sign language recognition: Just as speech recognition can transcribe speech to text, certain types of gesture recognition software can transcribe the symbols represented through sign language into text. For socially assistive robotics: By using proper sensors (accelerometers and gyros) worn on the body of a patient and by reading the values from those sensors, robots can assist in patient rehabilitation. The best example can be stroke rehabilitation. Directional indication through pointing: Pointing has a very specific purpose in our society, to reference an object or location based on its position relative to ourselves. The use of gesture recognition to determine where a person is pointing is useful for identifying the context of statements or instructions. This application is of particular interest in the field of robotics.

19

Control through facial gestures: Controlling a computer through facial gestures is a useful application of gesture recognition for users who may not physically be able to use a mouse or keyboard. Eye tracking in particular may be of use for controlling cursor motion or focusing on elements of a display. Alternative computer interfaces: Foregoing the traditional keyboard and mouse setup to interact with a computer, strong gesture recognition could allow users to accomplish frequent or common tasks using hand or face gestures to a camera. Immersive game technology: Gestures can be used to control interactions within video games to try and make the game player's experience more interactive or immersive. Virtual controllers: For systems where the act of finding or acquiring a physical controller could require too much time, gestures can be used as an alternative control mechanism. Controlling secondary devices in a car, or controlling a television set are examples of such usage. Affective computing: In affective computing, gesture recognition is used in the process of identifying emotional expression through computer systems.

20

Remote control: Through the use of gesture recognition, "remote control with the wave of a hand" of various devices is possible. The signal must not only indicate the desired response, but also which device to be controlled.

21

CHAPTER 2

LITERATURE SURVEY

A hand gesture analysis system based on a three-dimensional hand skeleton model with 27 degrees of freedom was developed by Lee and Kunii. They incorporated five major constraints based on the human hand kinematics to reduce the model parameter space search. To simplify the model matching, specially marked gloves were used.

Full ASL recognition systems (words, phrases) incorporate data gloves. Takashi and Kishino discuss a Data glove-based system that could recognize 34 of the 46 Japanese gestures (user dependent) using a joint angle and hand orientation coding technique. From their paper, it seems the test user made each of the 46 gestures 10 times to provide data for principle component and cluster analysis. A separate test was created from five iterations of the alphabet by the user, with each gesture well separated in time. While these systems are technically interesting, they suffer from a lack of training.

Excellent work has been done in support of machine sign language recognition by Sperling and Parish, who have done careful studies on the bandwidth necessary for a sign conversation using spatially and temporally subsampled images. Point light experiments (where lights are attached to significant locations on the body and just these points are used for recognition), have been carried out by Poizner.

22

CHAPTER 3

BACKGROUND

3.1. EXISTING SYSTEM

The key problem in gesture interaction is how to make hand gestures understood by computers. The approaches present can be mainly divided into Data-Glove based, Vision Based and Analysis of Drawing Gestures approaches. Research on hand gestures can be classified into three categories. The first category, glove based analysis, employs sensors (mechanical or optical) attached to a glove that transduces finger flexions into electrical signals for determining the hand posture. The relative position of the hand is determined by an additional sensor. This sensor is normally a magnetic or an acoustic sensor attached to the glove. The methods use sensor devices for digitizing hand and finger motions into multi-parametric data. The extra sensors make it easy to collect hand configuration and movement. However, the devices are quite expensive and bring much cumbersome experience to the users. For some data glove applications, look-up table software toolkits are provided with the glove to be used for hand posture recognition.

The second category, vision based analysis, is based on the way human beings perceive information about their surroundings, yet it is probably the most difficult to implement in a satisfactory way. Several different approaches have been tested so far. One is to build a three-dimensional model of the human hand. The model is matched to images of the hand by one or more cameras, and
23

parameters corresponding to palm orientation and joint angles are estimated. These parameters are then used to perform gesture classification. Another Vision Based method requires only a camera, thus realizing a natural interaction between humans and computers without the use of any extra devices. These systems tend to complement biological vision by describing artificial vision systems that are implemented in software and hardware. This poses a challenging problem as these systems need to be background invariant, lighting insensitive, person and camera independent to achieve real time performance. Moreover, such systems must be optimized to meet the requirements, including accuracy and robustness.

The third category, analysis of drawing gestures, usually involves the use of a stylus as an input device. Analysis of drawing gestures can also lead to recognition of written text. The vast majority of hand gesture recognition work has used mechanical sensing, most often for direct manipulation of a virtual environment and occasionally for symbolic communication. Sensing the hand posture mechanically has a range of problems, however, including reliability, accuracy and electromagnetic noise. Visual sensing has the potential to make gestural interaction more practical, but potentially embodies some of the most difficult problems in machine vision. The hand is a non-rigid object and even worse self-occlusion is very usual.

3.2. PROBLEM STATEMENT In the existing system we make use of sensors, accelerometer and sensing glove. In all these methods more number of hardware components are required. The sweat produced in hand would reduce the efficiency of the tactile sensors. The efficiency of the sensors would also diminish due to wear caused due to aging. Moreover we could not expect people to move around with sensing glove.
24

CHAPTER 4

METHODOLOGY 4.1. IMAGE CAPTURING DEVICES The ability to track a person's movements and determine what gestures they may be performing can be achieved through various tools. Although there is a large amount of research done in image/video based gesture recognition, there is some variation within the tools and environments used between implementations. Depth-aware cameras: Using specialized cameras such as time-of-flight cameras, one can generate a depth map of what is being seen through the camera at a short range, and use this data to approximate a 3-D representation of what is being seen. These can be effective for detection of hand gestures due to their short range capabilities. Stereo cameras: Using two cameras whose relations to one another are known, a 3d representation can be approximated by the output of the cameras. To get the cameras' relations, one can use a positioning reference such as a lexianstripe or infrared emitters. In combination with direct motion measurement (6D-Vision) gestures can directly be detected.

25

Controller-based gestures: These controllers act as an extension of the body so that when gestures are performed, some of their motion can be conveniently captured by software. Mouse gestures are one such example, where the motion of the mouse is correlated to a symbol being drawn by a person's hand, as is the Wii Remote, which can study changes in acceleration over time to represent gestures. Single camera: A normal camera can be used for gesture recognition where the resources/environment would not be convenient for other forms of imagebased recognition. Although not necessarily as effective as stereo or depth aware cameras, using a single camera allows a greater possibility of accessibility to a wider audience. 4.1.1. CHALLENGES There are many challenges associated with the accuracy and usefulness of gesture recognition software. For image-based gesture recognition there are limitations on the equipment used and image noise. Images or video may not be under consistent lighting, or in the same location. Items in the background or distinct features of the users may make recognition more difficult. The variety of implementations for image-based gesture recognition may also cause issue for viability of the technology to general usage. For example, an algorithm calibrated for one camera may not work for a different camera. The
26

amount of background noise also causes tracking and recognition difficulties, especially when occlusions (partial and full) occur. Furthermore, the distance from the camera, and the camera's resolution and quality, also cause variations in recognition accuracy. In order to capture human gestures by visual sensors, robust computer vision methods are also required, for example for hand tracking and hand posture recognition or for capturing movements of the head, facial expressions or gaze direction.

The recognition problem is approached through a matching process in which the segmented hand is compared with all the postures in the systems memory using the Hausdorff distance. The systems visual memory stores all the recognizable postures, their distance transform, their edge map and morphologic information. A faster and more robust comparison is performed thanks to this data, properly classifying postures, even those which are similar, saving valuable time needed for real time processing. The postures included in the visual memory may be initialized by the human user, learned or trained from previous tracking hand motion or they can be generated during the recognition process. 4.2. SIGNIFICANCE OF GRAYSCALE IMAGES The image captured by the camera is in RGB form. Inorder to reduce complexity in hand segmentation we convert the RGB to gray scale images. A grayscale (or gray level) image is simply one in which the only colors are shades of gray. The reason for differentiating such images from any other sort of color image is that less information needs to be provided for each pixel. In fact
27

a `gray' color is one in which the red, green and blue components all have equal intensity in RGB space, and so it is only necessary to specify a single intensity value for each pixel, as opposed to the three intensities needed to specify each pixel in a full color image.

Fig 4.1 A model gray image Often, the grayscale intensity is stored as an 8-bit integer giving 256 possible different shades of gray from black to white. If the levels are evenly spaced then the difference between successive gray levels is significantly better than the gray level resolving power of the human eye. 4.2.1. GRAYSCALE AS SINGLE CHANNEL OF MULTICHANNEL COLOUR IMAGES Color images are often built of several stacked color channels, each of them representing value levels of the given channel. For example, RGB images are composed of three independent channels for red, green and blue primary color components. Here is an example of color channel splitting of a full RGB color image. The column at left shows the isolated color channels in natural colors, while at right there are their grayscale equivalences:
28

Fig 4.2 Three channels of a RGB image The reverse is also possible: to build a full color image from their separate grayscale channels. By mangling channels, using offsets, rotating and other manipulations, artistic effects can be achieved instead of accurately reproducing the original image. 4.3. HAND SEGMENTATION Thresholding is the simplest method of image segmentation. From a grayscale image, thresholding can be used to create binary images. The key parameter in the thresholding process is the choice of the threshold value. During the thresholding process, individual pixels in an image are marked as object pixels if their value is greater than some threshold value (assuming an object to be brighter than the background) and as background pixels otherwise. This convention is known as threshold above. Variants include threshold below, which is opposite of threshold above; threshold inside, where a pixel is labeled "object"
29

if its value is between two thresholds; and threshold outside, which is the opposite of threshold inside. Typically, an object pixel is given a value of 1 while a background pixel is given a value of 0. Finally, a binary image is created by coloring each pixel white or black, depending on a pixel's label's.

Fig 4.3 Original Image

Fig 4.4 Example of a threshold effect used on an image

30

4.3.1. THRESHOLDING SELECTION The key parameter in the thresholding process is the choice of the threshold value (or values, as mentioned earlier). Several different methods for choosing a threshold exist; users can manually choose a threshold value, or a thresholding algorithm can compute a value automatically, which is known as automatic thresholding. A simple method would be to choose the mean or median value, the rational being that if the object pixels are brighter than the background, they should also be brighter than the average. In a noiseless image with uniform background and object values, the mean or median will work well as threshold, however, this will generally not be the case. A more sophisticated approach might be to create a histogram of the image pixel intensities and use the valley point as the threshold. The histogram approach assumes that there is some average value for the background and object pixels, but that the actual pixel values have some variation around these average values. However, this may be computationally expensive, and image histograms may not have clearly defined valley points, often making the selection of an accurate threshold difficult. 4.3.2. ADAPTIVE THRESHOLDING Thresholding is called adaptive thresholding when a different threshold is used for different regions in the image. This may also be known as local or dynamic thresholding. 4.3.3. MULTIBAND THRESHOLDING Color images can also be thresholded. One approach is to designate a separate threshold for each of the RGB components of the image and then combine them with an AND operation. This reflects the way the camera

31

works and how the data is stored in the computer, but it does not correspond to the way that people recognize color. Therefore, it is easy to design a threshold value for a grayscale image rather than the color image. 4.4. MORPHOLOGICAL OPERATION While point and neighborhood operations are generally designed to alter the look or appearance of an image for visual considerations, morphological operations are used to understand the structure or form of an image. This usually means identifying objects or boundaries within an image. Morphological operations play a key role in applications such as machine vision and automatic object detection. 4.4.1. STRUCTURINING ELEMENT In mathematical morphology, a structuring element is a shape, used to probe or interact with a given image, with the purpose of drawing conclusions on how this shape fits or misses the shapes in the image. It is typically used in morphological operations, such as dilation, erosion, opening, and closing, as well as the hit-or-miss transform. According to Georges Matheron, knowledge about an object depends on the manner in which we probe (observe) it. In particular, the choice of a certain structuring element for a particular morphological operation influences the information one can obtain. There are two main characteristics that are directly related to structuring elements.

32

Shape For example, the structuring element can be a ``ball" or a line; convex or a ring, etc. By choosing a particular structuring element, one sets a way of differentiating some objects from others, according to their shape or spatial orientation. Size For example, one structuring element can be a square or a square. Setting the size of the structuring element is similar to setting the observation scale, and setting the criterion to differentiate image objects or features according to size. SE = strel ('disk', R, N) creates a flat, disk-shaped structuring element, where R specifies the radius. R must be a non-negative integer. N must be 0, 4, 6, or 8. When N is greater than 0, the disk-shaped structuring element is approximated by a sequence of N periodic-line structuring elements. When N equals 0, no approximation is used, and the structuring element members consist of all pixels whose centers are no greater than R away from the origin. If N is not specified, the default value is 4.

33

Fig 4.5 Structuring Element

4.4.2. IMAGE CLOSING Closing is an important operator from the field of mathematical morphology. Like its dual operator opening, it can be derived from the fundamental operations of erosion and dilation. Like those operators it is normally applied to binary images, although there are gray level versions. Closing is similar in some ways to dilation in that it tends to enlarge the boundaries of foreground (bright) regions in an image (and shrink background color holes in such regions), but it is less destructive of the original boundary shape. As with other morphological operators, the exact operation is determined by a structuring element. The effect of the operator is to preserve background regions that have a similar shape to this structuring element, or that can completely contain the structuring element, while eliminating all other regions of background pixels. Closing is opening performed in reverse. It is defined simply as dilation followed by erosion using the same structuring element for both operations. See the sections on erosion and dilation for details of the individual steps. The closing
34

operator therefore requires two inputs: an image to be closed and a structuring element. Gray level closing consists straightforwardly of a gray level dilation followed by gray level erosion. Closing is the dual of opening, i.e. closing the foreground pixels with a particular structuring element, is equivalent to closing the background with the same element. 4.4.3. EFFECT OF IMAGE CLOSING One of the uses of dilation is to fill in small background color holes in images, e.g. `pepper noise'. One of the problems with doing this, however, is that the dilation will also distort all regions of pixels indiscriminately. By performing an erosion on the image after the dilation, i.e. a closing, we reduce some of this effect. The effect of closing can be quite easily visualized. Imagine taking the structuring element and sliding it around outside each foreground region, without changing its orientation. For any background boundary point, if the structuring element can be made to touch that point, without any part of the element being inside a foreground region, then that point remains background. If this is not possible, then the pixel is set to foreground. After the closing has been carried out the background region will be such that the structuring element can be made to cover any point in the background without any part of it also covering a foreground point, and so further closings will have no effect. This property is known as idempotence. The effect of a closing on a binary image using a 33 square structuring element is illustrated in Fig 4.6.

35

Fig 4.6 Effect of closing using a 33 square structuring element

As with erosion and dilation, this particular 33 structuring element is the most commonly used, and in fact many implementations will have it hardwired into their code, in which case it is obviously not necessary to specify a separate structuring element. To achieve the effect of a closing with a larger structuring element, it is possible to perform multiple dilations followed by the same number of erosions. Closing can sometimes be used to selectively fill in particular background regions of an image. Whether or not this can be done depends upon whether a suitable structuring element can be found that fits well inside regions that are to be preserved, but is doesn't fit inside regions that are to be removed.

36

4.5. IMAGE REGISTRATION Image registration is the process of overlaying two or more images of the same scene taken at different times, from different viewpoints, and/or by different sensors. It geometrically aligns two imagesthe reference and sensed images. The present differences between images are introduced due to different imaging conditions. Image registration is a crucial step in all image analysis tasks in which the final information is gained from the combination of various data sources like in image fusion, change detection, and multichannel image restoration. Typically, registration is required in remote sensing (multispectral classification, environmental monitoring, change detection, image mosaicing, weather forecasting, creating super-resolution images, integrating information into geographic information systems (GIS)), in medicine (combining computer tomography (CT) and NMR data to obtain more complete information about the patient, monitoring tumor growth, treatment verification, comparison of the patients data with anatomical atlases), in cartography (map updating), and in computer vision (target localization, automatic quality control), to name a few.

4.5.1. ALGORITHM CLASSIFICATIONS

4.5.1.1. INTENSITY BASED VS FEATURE BASED Image registration or image alignment algorithms can be classified into intensity-based and feature-based. One of the images is referred to as the reference or source and the second image is referred to as the target or sensed. Image registration involves spatially transforming the target image to align with the reference image. Intensity-based methods compare intensity patterns in
37

images via correlation metrics, while feature-based methods find correspondence between image features such as points, lines, and contours. Intensity-based methods register entire images or sub images. If sub images are registered, centers of corresponding sub images are treated as corresponding feature points. Feature-based method established correspondence between a number of points in images. Knowing the correspondence between a number of points in images, a transformation is then determined to map the target image to the reference images, thereby establishing point-by-point correspondence between the reference and target images. 4.5.1.2. SPATIAL VS FREQUENCY DOMAIN METHODS Spatial methods operate in the image domain, matching intensity patterns or features in images. Some of the feature matching algorithms are outgrowths of traditional techniques for performing manual image registration, in which an operator chooses corresponding control points (CPs) in images. When the number of control points exceeds the minimum required to define the appropriate transformation model, iterative algorithms like RANSAC can be used to robustly estimate the parameters of a particular transformation type (e.g. affine) for registration of the images. Frequency-domain methods find the transformation parameters for registration of the images while working in the transform domain. Such methods work for simple transformations, such as translation, rotation, and scaling. Applying the Phase correlation method to a pair of images produces a third image which contains a single peak. The location of this peak corresponds to the relative translation between the images. Unlike many spatial-domain algorithms, the phase correlation method is resilient to noise, occlusions, and other defects typical of medical or satellite images. Additionally, the phase correlation uses the
38

fast Fourier transform to compute the cross-correlation between the two images, generally resulting in large performance gains. The method can be extended to determine rotation and scaling differences between two images by first converting the images to log-polar coordinates. Due to properties of the Fourier transform, the rotation and scaling parameters can be determined in a manner invariant to translation. 4.5.1.3. SINGLE VS MULTI- MODALIY METHODS Another classification can be made between single-modality and multi-modality methods. Single-modality methods tend to register images in the same modality acquired by the same scanner/sensor type, while multi-modality registration methods tended to register images acquired by different scanner/sensor types. Multi-modality registration methods are often used in medical imaging as images of a subject are frequently obtained from different scanners. Examples include registration of brain CT/MRI images or whole body PET/CT images for tumor localization, registration of contrast-enhanced CT images against non-contrast-enhanced CT images for segmentation of specific parts of the anatomy, and registration of ultrasound and CT images for prostate localization in radiotherapy. 4.5.1.4. AUTOMATIC VS INTERACTIVE METHODS Registration methods may be classified based on the level of automation they provide. Manual, interactive, semi-automatic, and automatic methods have been developed. Manual methods provide tools to align the images manually. Interactive methods reduce user bias by performing certain key
39

operations automatically while still relying on the user to guide the registration. Semi-automatic methods perform more of the registration steps automatically but depend on the user to verify the correctness of a registration. Automatic methods do not allow any user interaction and perform all registration steps automatically. 4.5.2. UNCERTAINITY There is a level of uncertainty associated with registering images that have any spatio-temporal differences. A confident registration with a measure of uncertainty is critical for many change detection applications such as medical diagnostics. In remote sensing applications where a digital image pixel may represent several kilometers of spatial distance (such as NASA's LANDSAT imagery), an uncertain image registration can mean that a solution could be several kilometers from ground truth. Several notable papers have attempted to quantify uncertainty in image registration in order to compare results. However, many approaches to quantifying uncertainty or estimating deformations are computationally intensive or are only applicable to limited sets of spatial transformations. 4.5.3. TRANSFORMATION METHODS Image registration algorithms can also be classified according to the transformation models they use to relate the target image space to the reference image space. The first broad category of transformation models includes linear transformations, which include translation, rotation, scaling, and other affine

40

transforms. Linear transformations are global in nature, thus, they cannot model local geometric differences between images. The second category of transformations allow 'elastic' or 'nonrigid' transformations. These transformations are capable of locally warping the target image to align with the reference image. Nonrigid transformations include radial basis functions (thin-plate or surface splines, multiquadrics, and compactlysupported transformations), physical continuum models (viscous fluids), and large deformation models (diffeomorphisms).

4.5.4. RADON TRANSFORM The Radon transform of a 2-D function f (x, y) is defined as:

where r is the perpendicular distance of a line from the origin and q is the angle between the line and the y-axis. According to the Fourier slice theorem, this transformation is invertible and the 1-D Fourier transforms of the Radon transform along r are the 1-D radial samples of the 2-D Fourier transform of f (x, y) at the corresponding angles. The transform we have used is radon transform. The RADON function computes the Radon transform, which is the projection of the image intensity along a radial line oriented at a specific angle.

R = RADON(I,THETA) returns the Radon transform of the intensity image I for the angle THETA degrees. If THETA is a scalar, the result R is a

41

column vector containing the Radon transform for THETA degrees. If THETA is a vector, then R is a matrix in which each column is the Radon transform for one of the angles in THETA. If you omit THETA, it defaults to 0:179.

[R,Xp] = RADON(...) returns a vector Xp containing the radial coordinates corresponding to each row of R.

4.6. FEATURE EXTRACTION In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction. When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (much data, but not much information) then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is called feature extraction. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. When performing analysis of complex data one of the major problems stems from the number of variables involved. Analysis with a large number of variables generally requires a large amount of memory and computation power or a classification algorithm which overfits the training sample and generalizes poorly to new samples. Feature extraction is a

42

general term for methods of constructing combinations of the variables to get around these problems while still describing the data with sufficient accuracy. The discrete wavelets transform (DWT) decomposes an input signal into low and high frequency component using a filter bank. Haar wavelet, which characteristics the filter bank, has important properties of orthogonality, linearity, and completeness. We can repeat the DWT multiple times to multiple-level resolution of different octaves. For each level, wavelets can be separated into different basis functions for image compression and recognition.

Fig 4.7 Multi-resolution expansion using Haar wavelet

The wavelet transform can be used to represent a two-dimensional (2D) signal by the 2D resolution decomposition procedure, where an image is repeatedly decomposed into an approximation and several detail components at each level. In order to construct the wavelet pyramid, we decide the number of Haar coefficients and approximation levels. We would like to extract salient points
43

from any part of the image where something happens in the image at any resolution. A high wavelet coefficient (in absolute value) at a coarse resolution corresponds to a region with high global variations. The properly chosen length of the Haar wavelet and the number of the approximation levels provides the optimum local key points or features.

4.6.1. WAVELETS

The Wavelets analysis is performed using a prototype function called wavelet, which has the effect of a band pass filter. Wavelets are functions defined over a finite interval and having an average value of zero. The basic idea of wavelet transform is to represent any arbitrary functions f (t) as a superposition of a set of such wavelets or basis function. These basis functions are derived from a single prototype mother wavelet. The term wavelet means a small wave. The smallness refers to the condition that this window function is of finite length (compactly supported). The wave refers to the condition that this function is oscillatory. The term mother implies that the functions with different region of support that are used in the transformation process are derived from (scaling) and translations (shifts).

44

4.6.2. WAVELET TRANSFORM

The wavelet transform is a mathematical tool that decomposes a signal in to a representation that shows signal details and trends as a function of time. It is used to characterize transient, reduce noise, compress data, and perform many other operations. Wavelet analysis is a windowing technique, similar to the STFT, with the variable -sized windows. Wavelet analysis is capable of revealing aspects of data that other signal analysis techniques miss, including aspects such as trends, breakdown points, discontinuities, and self-similarity. It is also often used to compress or denoise a signal without any appreciable degradation.

4.6.3. THE DISCRETE WAVELET TRANSFORM

Discrete Wavelet Transform, transforms discrete signal from time domain in to time-frequency domain. The transformation product is set of

coefficients organized in the way that enables not only spectrum analyses of the signal, but also spectral behavior of the signal in time. This is achieved by decomposing signal, breaking into two components, each caring information about source signal. Filters from the filter bank used for decomposition come in pairs: low pass and high pass. The filtering is succeeded by down sampling (obtained filtering result is re-sampled so that every second coefficient is kept). Low pass filtered signal contains information about slow changing component of the signal, looking very similar to the original signal, only two times shorter in term of samples. High pass filtered signal contains information about fast

changing component of the signal. In most cases high pass component is not so rich with data offering good property for compression. In some cases, such as
45

audio or video signal, it is possible to discard some of the samples of the high pass component without noticing any significant changes in signal. Filters from filter bank are called wavelets.

4.6.4. 2D-DISCRETE WAVELET TRANSFORM

The two-dimensional DWT can be implemented using digital filters and down samplers. With separable two-dimensional scaling and wavelet functions, we get one approximation coefficients and three sets of detail coefficients such as horizontal, vertical and diagonal coefficients. The concepts of one-dimensional DWT and its implementation through sub band coding can be easily extended to two-dimensional signals for digital images. In case of sub band analysis of images, we require extraction of its approximate forms in both horizontal and vertical directions, details in horizontal direction alone (detection of horizontal edges), details in vertical direction alone (detection of vertical edges) and details in both horizontal and vertical directions (detection of diagonal edges). This analysis of 2-D signals require the use of following two-dimensional filter functions through the multiplication of separable scaling and wavelet functions in (horizontal) and (vertical) directions, as defined below:

46

represents the approximated signal, signal with horizontal details, signal with vertical details and signals with diagonal details respectively.

47

CHAPTER 5 OVERVIEW OF THE PROJECT 5.1. AN OVERLAY OF OUR ALGORITHM

CAPTURED HAND GESTURE RESIZING IMAGE GRAY SCALED IMAGE HAND SEGMENTATION MORPHOLOGICAL OPERATION IMAGE NORMALIZATION HAAR WAVELET TRANSFORM FEATURE EXTRACTION RECOGNISED SPEECH SIGNAL
Fig 5.1 Overlay of our algorithm
48

5.2. PROPOSED WORK Here we have approached gesture recognition through image processing. With a constraint of constant background and constant zoom level we have tracked the hand gesture. The captured image is generalized to a common size so that the machine has a constant frame size to process. The RGB image requires a threshold value for each sub-image. So to reduce such complexity we have converted it into a gray scale image. The hand is then extracted using the skin color approach and we assume that the front arm of user is cover by clothes. A pixel is defined as a skin pixel if it satisfies the following condition

Gray scale < Threshold

Where the gray scale denotes the intensity value of the input hand image.

By the skin color approach, a hand map image can then be defined. In the hand map image, a white pixel (pixel value =1) and a black pixel (pixel value = 0) indicate the skin and non skin pixels respectively. Now the segmented hand image is binary and it undergoes many preprocessing. The binary image obtained has noise and does not figure out the exact hand geometry. Because of noise in a hand image, holes are resulted which are then minimized by utilizing morphological operations. A structuring element of suitable size is designed and image closing is done to get the exact geometry of the hand. In morphology dilation expands an image and erosion shrinks it. Closing tends to smooth contours but it generally fuses narrow breaks and long thin gulfs, it eliminates some small holes, and fills gaps in the contour.

49

The closing of set A by structuring element B, denoted by

A B = ( A B )B
Which, in words, says that the closing of A by B is simply the dilation of A by B, followed by the erosion of the result by B. The gesture captured at different time intervals may be at different angular position. Our next step is to normalize the segmented hand to a common axis. Image registration is performed to rotate the image to a constant image axis thereby reducing the number of training images in the database. This normalization is done with the help of radon transform. There are several choices for the selection of features inorder to discriminate between hands in a hand gesture recognition system. In numerical analysis and functional analysis, a discrete wavelet transform (DWT) is any wavelet transform for which the wavelets are discretely sampled. As with other wavelet transforms, a key advantage it has over Fourier transforms is temporal resolution, it captures both frequency and location information. For an input represented by a list of 2n numbers, the Haar wavelet transform may be considered to simply pair up input values, storing the difference and passing the sum. This process is repeated recursively, pairing up the sums to provide the next scale, finally resulting in 2n 1 differences and one final sum. The feature used here is the horizontal component of the discrete wavelet transformed image. This vector component which is different from other statistical method gives a better recognition rate.

50

This vector component is correlated with the training image. Correlation is a measure of how well the predicted values from a forecast model "fit" with the real-life data. The correlation coefficient is a number between 0 and 1. If there is no relationship between the predicted values and the actual values the correlation coefficient is 0 or very low (the predicted values are no better than random numbers). As the strength of the relationship between the predicted values and actual values increases so does the correlation coefficient. A perfect fit gives a coefficient of 1.0. Thus higher the correlation coefficient, better the recognition.

Then comes our speech signal unit which plays the corresponding wave file. The recognized image has to be authenticated by a sound file. This would make the ordinary speaking community to easily understand what the dumb people mean to say. In this part we have saved a .wav file for each alphabet. The .wav file for that particular gesture is read and played. This is done by using wavread and wavplay command. As this project computes wavelet and finds the correlation for a particular gesture which is very easy for computation compared to other gesture recognition system it has an added advantage. Moreover the hardware requirement is reduced.

51

CHAPTER 6

SOFTWARE DESCRIPTION

6.1. INTRODUCTION The simulation tool used for the development of the software is MATLAB. MATLAB stands for Matrix Laboratory. It is a technical computing environment for high performance numeric computation and visualization. It indicates numerical analysis, matrix computation, signal processing and graphics in an easy to use environment, where problems and solutions are expressed just as they are written mathematically, without traditional programming. MATLAB allows us to express the entire algorithm in few dozen lines to compute the solutions with great accuracy in a few minutes on a computer, and to readily manipulate a tree dimensional display of the result in color. The basic building block of MATLAB is the matrix. The fundamental data-type is the array, vectors, scalars, real matrices and complex matrices are all automatically handled as special cases of fundamental data types. It also features a family of applications specific solutions called Tool Boxes. Areas in which tool boxes are available include signal processing, image processing, control systems designs, dynamic system simulations, system identifications, neural networks, wavelength communications and others. These tool boxes are collections of functions written for special applications such as symbolic computing, image processing, neural networks etc.

52

6.2. FEATURES OF MATLAB Some of the special features of MATLAB are 6.2.1. COMMAND WINDOW This is the main window. It is characterized by the MATLAB command prompt>>, when the application is launched the user is taken to this prompt. All commands including those for running user-written programs are typed in the MATLAB prompt. 6.2.2. GRAPHICS WINDOW The outputs of all the graphics are flushed to the graphics (or) figure window, a separate gray window with (default) white background. The user can create as many figure windows, as the memory will allow. 6.2.3. EDIT WINDOW This is where we write, edit, create and save our own programs in M-files. Any text editor can be used to carry out these tasks. On most systems such as PCs and Macs MATLAB provides its own built-in editor. On other systems standard file editing program is invoked by typing a command prompt. 6.2.4. INPUT OUTPUT MATLAB supports interactive computation, taking the input from the screen and flushing the output to the screen. In addition, it reads input files and writes output files. The following features hold all forms of input- output. 6.2.5. DATA TYPE The fundamental data type in the MATLAB is the array. It encompasses several distinct data objects, integers, matrices, doubles, character strings, structures and cells. In most cases, however data type (or) data object declaration is not needed.

53

6.2.6. DIMENSIONING Dimensioning is automatic in MATLAB. No dimension statement is required in vectors (or) arrays. The command size and length yields the dimension of an existing matrix (or) vector.

6.2.7. CASE SENSITIVITY MATLAB is case sensitive, i.e. it differentiates lowercase and uppercase letters. Thus a and A are different variables. Most MATLAB commands and built-in function calls are typed in lowercase letters. 6.3. IMAGES IN MATLAB The basic data structure in MATLAB is the array, an ordered set of real or complex elements. This object is naturally suited to the representation of images, real-valued, ordered sets of color or intensity data MATLAB stores. Most images as two-dimensional arrays, in which each element of the matrix corresponds to a single pixel in the displayed image. For example an image composed of 200 rows and 300 columns of different colored dots are stored in MATLAB as 200 by 300 matrix. By default, MATLAB stores most data in arrays of class double. The data in these arrays is stored as double precision (64-bit) floating-point numbers. All of MATLABs function and capabilities work with these arrays. The number of pixels in an image may be large; for example a 1000 by 1000 image has a million pixels. Since each pixel is represented by at least one array element, this image would require about 8 megabytes of memory. In order to reduce memory requirements, MATLAB supports storing image data in arrays of class unit 8. The data in these arrays requires one eighth as much memory as data in double arrays. Because the types of values that can be

54

stored in unit 8 arrays and double arrays differ, the image processing toolbox uses different conventions for interpreting the values in these arrays.

6.4. FILE TYPES MATLAB has four types of files for storing information. They are M-files. Script files Function files. MAT files 6.4.1. M-FILES M-files are standard ASCII text files with an .m extension to the file name. There are two types of these files namely script files and function files. Most programs written in MATLAB are saved as M-files. All built in functions are provided with source code in readable form so that they can be copied and modified. 6.4.2. SCRIPT FILES Script files are an M-file with a valid set of MATLAB commands in it. A script file is executed by typing the name of the file (without the in extension) on the commands stored in the script file, one by one at the MATLAB prompt. Naturally script files work on the global variables i.e. variables currently present in the work space. A script file may contain any number of commands; including those that call built-in functions written by the user. Script files are useful when a certain set of commands has to be repeated several times.

55

6.4.3. FUNCTION FILES A function file is also an M-file, like a script file, except that the variables in a function file are all local. Function files are like programs or subroutine in FORTRAN, procedures in PASCAL and functions in C. A Function file begins with a function definition line, which has a welldefined list of inputs and outputs. Without this line the file becomes a script file. The syntax of function definition line is as follows:

Function[Output Variable]=Function-name(input variable)

Where the function name should be the file name in which the function is written.

6.4.4. MAT-FILES

MAT-Files are binary data files with a mat extension. MAT-Files are created by MATLAB when a data is saved with save command. The data is written in a special format, which only MATLAB can decode. MAT-Files can be loaded in to the MATLAB using the load command.

56

CHAPTER 7

SIMULATION RESULT

STEP: 1 The input image from different camera may have varying dimensions (M x N). So to standardize the size of the input image which is to be processed we resize it.

Fig 7.1 Resized image

57

STEP: 2 The RGB image obtained is converted to gray scale for easier thresholding process. This gray scale image is a combination of the three sub colors red, green and blue.

Fig 7.2 Gray scale image

58

STEP: 3 Here we segment the hand region from the background by choosing an appropriate threshold value. This process gives an outline for the hand region that need to be processed.

Fig 7.3 Segmented hand

59

STEP: 4 In this step we create a structuring element and perform the image closing operation. This process removes the noise components and give an exact geometry of the hand gesture.

Fig 7.4 Morphologically operated image

60

STEP: 5 Here we rotate the image to a common axis. This greatly reduces the number of images in the database.

Fig 7.5 Normalized image

61

STEP: 6 This step involves the feature extraction. The horizontal component of the wavelet transformed test image is used for recognition. This step too reduces the database size by great measure.

Fig 7.6 Horizontal vector of DWT

62

CHAPTER 8

CONCLUSION

The inspiration behind this project came from the thought of helping to alleviate the language barrier which stands between the dumb and hearing communities. Attempting to translate finger spelling to a spoken English alphabet was just a minute step towards achieving this ultimate goal. The resulting gesture recognition approach achieved this desired step.The discrete wavelet concept and normalization of the image axis helps to reduce the database size. Performing wavelet transform is time efficient as it is easier for computation. The normalization also provides an uniform pattern to correlate with the training image and give out the corresponding speech signal which it matches.Our method seems to be more promising as there is a substantial reduction in error rate and processing time.

63

CHAPTER 9 REFERENCES [1]Wing Kwong Chung, Xinyu Wu, and Yangsheng Xu, A Realtime Hand Gesture Recognition based on Haar Wavelet Representation, Proceedings of the 2008 IEEE International Conference on Robotics and Biomimetics Bangkok, Thailand, February, 2009. [2] J. Allen, P. Asselin, and R. Foulds , American Sign Language Finger Spelling Recognition System, Proceedings of Bioengineering Conference, March, 2003. [3] H. Brashear, T. Starner, P. Lukowicz, H. Junker, Using multiple sensors for mobile sign language recognition, Proceedings of IEEE International

Symposium on Wearable Computers, pp. 45-52, October 2003. [4] J. H. Shin, J. S. Lee, S. K. Kil, D. F. Shen, J. G. Ryu, E. H. H. K. Min, and S. H. Hong, Hand Region Extraction and Gesture Recognition using entropy analysis, Proceedings of International Journal of Computer Science and

Network Security, Vol. 6 No. 2 216 222, February 2006. [5] C.L. Huang and W.Y. Huang, Sign language recognition using model based tracking and a 3D Hopfield neural network, Machine Vision Application, Vol. 10, pp. 292301, 1998. [6] G. Gomez, M. Sanchez, and L. E. Sucar, On selecting an appropriate colour space for skin detection, Proceedings of Mexican International Conference on Artificial Intelligence, Yucatan, pp. 69-78, 2002.

64

Potrebbero piacerti anche