Sei sulla pagina 1di 6

IMAGE-TO-CLASS DYNAMIC TIME WARPING FOR 3D HAND GESTURE RECOGNITION

Hong Cheng, Zhongjun Dai, Zicheng Liu


University of Electronics Science and Technology of China, Microsoft Research Redmond
hcheng@uestc.edu.cn, dzjun.uestc@gmail.com, zliu@microsoft.com
ABSTRACT
3D Human Computer Interaction (HCI) becomes more and
more popular thanks to the emergence of commercial depth
cameras. Moreover, hand gestures provide a natural and attractive alternative to cumbersome interface devices for HCI.
In this paper, we present an Image-to-Class Dynamic Time
Warping (I2C-DTW) approach for 3D hand gesture recognition. The main idea is that we divide the time-series curve of a
3D hand gesture into various finger combinations, called fingerlets, which can either be learned or be set manually to represent each gesture and to capture inter-class variations. Furthermore, the I2C-DTW approach searches for the minimal
path to warp two fingerlets, which are from one test image and
the specific class, respectively. Then the gesture recognition
is to use the ensemble of multiple image-to-class DTW distance of fingerlets to obtain better performance. The proposed
approach is evaluated on two 3D hand gesture datasets and
the experiment results show that the proposed I2C-DTW approach significantly improves the recognizing performance.
Index Terms Image-to-Class Distance, Fingerlets, Dynamic Time Warping, 3D Hand Gesture Recognition, Human
Computer Interaction
1. INTRODUCTION
The use of hand gestures has been around since the beginning of time and are much older than speech. Moreover, hand
gestures are natural, ubiquitous and meaningful part of spoken language, and researchers have claimed that gesture and
sound form a tightly integrated system during human cognition [1]. Inspired by human interaction mainly by vision and
sound, the use of hand gestures is one of most powerful and
efficient ways in Human Computer Interaction (HCI) [2, 3, 4]
There are three basic types of sensors which are capable
of sensing hand gestures, mount based sensors (Microsofts
Digits hand-gesture sensor bracelet, IR gesture-sensing systems), touch based sensors (iPhone), and vision based sensors [2, 5, 6]. There are some advantages of vision based
Thanks to the NSFC (No. 61075045, 61273256) agency for funding
and the Program for New Century Excellent Talents in University (NECT10-0292).

sensors. First of all, vision based sensors can be less cumbersome and uncomfortable than the other two sensors due
to no physical contact with users. Moreover, the users can
interact with a computer at a distance. Finally, vision based
sensors can be robust for gesture recognition even in noisy environments while sound would be in trouble. Through computational complexity is indeed much higher than the other
sensors, vision based gesture recognition becomes more and
more attractive thanks to the fast development of vision sensors and powerful processors. Especially, the emergence of
the Kinect devices has drawn more attention on 3D vision
based gesture recognition [7, 3].
In this paper, we proposed an Image-to-Class Dynamic
Time Warping (I2C-DTW) approach for 3D hand gesture
recognition. Traditionally, Dynamic Time Warping (DTW)
approaches are used to find an optimal match between two
given signals (e. g. time series) with certain restrictions [8, 9].
Instead of finding an optimal image-to-image match, the proposed approach first searches for the minimal path to warp
two fingerlets, which are from one test image and the specific
class, respectively. Afterwards, the gesture recognition is to
use the ensemble of multiple image-to-class DTW distance
of fingerlets to obtain better recognizing performance. The
contributions are two-folded. First of all, we divide the timeseries curve of a hand gesture into different finger combinations, called fingerlets. The important fingerlets can either
be learned by using data mining techniques or be set manually. Furthermore, the class-dependent fingerlet weights are
learned from data as well. Secondly, in contrast to the Imageto-Image DTW (I2I-DTW), the proposed I2C-DTW approach
can leverage both the local and global gesture information especially when distinguishing two gestures with the same numbers of stretched fingers. The proposed approach is evaluated
on two 3D hand gesture data sets and the experiment results
show that the proposed I2C-DTW approach significantly improves the recognizing performance.
2. RELATED WORK
Vision based hand gesture recognition approaches usually
consist of three modules, detection, tracking and recognition
[6]. The detection modules are to define and extract visual
features which can encode hand gestures in the field of view

of the camera. The tracking modules are the processes of locating a hand over time using a camera. Especially, model
based tracking approaches provide ways to estimate model
parameters which are not directly observable at current moment. Finally, the recognition modules are to cluster the spatiotemporal features from the previous two modules and label
these groups with different gesture classes. Diverse hand gesture approaches have been proposed during the past ten years
by using different types of visual features and classifiers. In
this paper, we focus on extracting visual features and classifying approaches for 3D hand gesture recognition. Hence, we
present a review on the work of 3D gesture recognition approaches. Review on 2D gesture recognition approaches can
be found in[5, 6, 10].
Keskin et al. developed a real time hand tracking and
3D dynamic gesture recognition using Hidden Markov Models (HMM)[11], where the user was wearing a colored glove
and the hand coordinates were obtained via 3D reconstruction
from stereo. Alon et al. proposed the simultaneous localization and recognition of dynamic hand gestures approach
which core was a Dynamic Space Time Warping (DSTW)
algorithm[12], where a pair of query and model gestures in
both space and time were aligned. Bonansea proposed a complete solution for the one-hand 3D gesture recognition problem since it focused on both the 3D gesture recognition and
understanding the scene being presented [13]. Oikonomidis
et al. provided an optimization solution of estimation the 3D
position, orientation and full articulation of a hand obtained
by a Kinect sensor [14]. The problem is solved by minimizing the discrepancy between the appearance and 3D structure
of hypothesized instances of a hand model and actual hand
observation using a variant of Particle Swarm Optimization
(PSO). Kurakin et al. proposed a real-time system for dynamic hand gesture recognition, where action graph is used to
share similar robust properties with standard HMM[15]. This
approach requires less training samples by allowing states
shared among different gestures. Ren et al. proposed a novel
distance metric for hand dissimilarity measure, called FingerEarth Movers Distance (FEMD) to handle the noisy hand
shape obtained from the Kinect sensor[4]. Guo proposed a
3D hand hierarchy model as the hand hypothesis in the tracking algorithm [2], where the hand gesture is extracted first as
static postures and the particle swarm optimization was used
to find the hands 3D articulation comparing with the hand
hypothesis.
Hand gestures are spatio-temporal patterns. Most of researchers focus two important issues, extracting hand gesture
features and classifying hand gestures. It is popular to simply
use time-series curves to represent gestures. In this paper, we
will generate fingerlets from those time-series curves by using different finger combinations. In addition, many gesture
recognition approaches are to use image-to-image classifying
approaches to recognize gestures. For better handling interclass variations, we propose image-to-class DTW approaches.

3. THE OVERVIEW OF THE PROPOSED


APPROACH
In this paper, we propose an image-to-class dynamic time
warping approach using fingerlet ensemble of 3D hand gestures. Instead of directly using traditional time-series curves
of 3D gestures, we represent each gesture as fingerlet ensemble. The important fingerlets could be learned from training
samples. By doing so, we can better model inter-class variations between gestures. Furthermore, the classifier is built on
the combination of multiple image-to-class DTW distances to
improve the recognizing performance.
The proposed image-to-class dynamic time warping approach for 3D gesture recognition consists of segmenting a
hand, extracting fingerlet ensemble, generating fingerlet ensemble matrix of each gesture class, and the I2C-DTW classifier, shown in Fig. 1. In this figure, there are two basic steps,
a training step and a testing step. In the training step, given a
number of training samples with class labels, we first implement hand segmentation. We then extract the contour of the
hand. Upon these features, we generate training descriptor
matrix of per class. In testing step, we can get testing descriptor matrix in almost same way. By using the training descriptor matrix of all classes from the training step, the I2C-DTW
classifier makes the classification decision over each test sample.
4. FINGERLET ENSEMBLE
4.1. Hand Segmentation
Hand segmentation is fundamental for gesture recognition.
We release two constraints, static threshold and an users
black belt in [4] by using the following dynamic thresholding and principal orientation techniques.
We implement hand segmentation by using the dynamic
threshold technique as follows. First, we detect a face from
the RGB image in the first t frames, and then we get the average depth value of this face, fd . In an interaction situation,
we assume that a hand is around the face, and thus we define
a rectangle Region of Interest (ROI) to find out m points with
minimum depth values. Finally, we determine the dynamic
threshold as
td = (fd rd ),
(1)
where is a constant factor; rd is the average depth value
of this ROI. By using this dynamic threshold approach, we
can get the rough hand segmentation. In our experiments,
m = 100, = 0.4. For obtaining a more accurate hand
shape, we use both the hand shape scale and skin color as
priors to segment a hand. Finally, similar to [15], we calculate
the principal orientations of the users arm to compute an inplane rotation matrix. By doing so, the principal orientation
of the users forearm points upward after rotation.

Fig. 1. The framework of the proposed I2C-DTW approach.


Given an accurate segmented hand shape, we discuss how
to extract time-series curve of each gesture. As we know,
the most discriminative features of gestures are finger shapes
while a hand palm does not play an important role for gesture
recognition. Fig. 2 (b) is the segmented hand and its original
time-series curves shown in Fig. 2 (e). Hence, we remove the
palm out by using a circle whose radius varies with different
gestures, shown in Fig. 2 (c) and its time-series curve without
the effect of the hand palm shown in Fig. 2 (f). We detect the
hand shape using a Canny edge kernel, and then define the
center point as P0 and the initial point P1 , as shown in Fig.
2 (d). Assume that there are H points on the contour showed
as in Fig. 2 (d), Pi (i = 2, . . . , H). The time-series curve
fw (x) = y (Fig. 2 (f)) is used to represent the distance (Y
axis) between the contour and the center point with w.r.t. the
normalized angle (X axis) as


x =< P0 P1 , P0 Pi > /2,


y = |P0 Pi |/max {|P0 P1 |, |P0 P2 |, , |P0 PH |}.

(2)
Note that the normalized angles are calculated between each
contour vertex Pi and P1 relative to P0 .
4.2. Generating Fingerlets
In this paper, we do not use time-series curve fw RH directly. In contrast, we divide this curve into many fingerlets
by using different finger dividing and combining strategies.
In general, we can extract the feature segmentation from fw

Fig. 2. The procedures of generating time-series curve of a


hand gesture: (a) The rough hand segmentation; (b) The accurate hand segmentation; (c) The accurate hand segmentation
after removing the hand palm part; (d) The hand contour after edge detection; (e) The time-series curve of the segmented
hand in (b); (f) The time-series curve fw from (c).
of a hand gesture to form its fingerlet ensemble as
f = [f1 , f2 , , fs , , fS ] RSH ,

(3)

where fs is a fingerlet consisting of a time-series curve segment and its orientation; s is theindex
 of different finger comk
bination strategies; S l=1 Kl , k {1, 2, , K} is the
number of the stretched fingers of different gestures; Usually
K = 5, S = 15. In general, each fingerlet in Eqn. (3) is
segmented from fw . Moreover, the class-dependent fngerlet
weights could be learned from training samples.

In this paper, we separate fingerlets as follows. We use


convexity detection to separate fingerlests. Afterwards, we
implement different finger combination strategies as
s {1, 2, 3, 4, 5, 12, 23, 34, 45, 123, 234, 345,
1234, 2345, 12345}.

(4)

Here, we assume that the number of fingers of one hand is five


(in most situation), then we label thumb as 1, index finger
as 2, middle finger as 3, ring finger as 4 and little finger
as 5.
We would like to point out that: (1) The fingerlets are
class-dependent. We can employ this to generate classdependent fingerlets ensemble matrix from training samples,
which can reduce computational complexity. For example,
the fingerlets ensemble of gesture 1 can be represented as
f = [f1 , f2 , f3 , f4 , f5 ] only using one finger segment but not
using others. In contrast, we use whole fingerlets ensemble
of Eqn. (3) to represent gesture 5. (2) We discard fingerlets which contains any bent finger, so gesture 1 has only
one fingerlet left with index subset s {2}, and the fingerlets
ensemble is f = [f2 ]. Similarly, the fingerlets index subset
of gesture 3 is s {3, 4, 5, 34, 45, 345}, as shown in Fig.
4. (3) The time-series curve of each fingerlet fs is extracted
from fw by keeping the y-coordinates according to index s,
while the rest y-coordinates are set as 0. (4) For reducing fingerlet spaces, we restrict fingerlets ensemble to be a set of the
fingers which are next to each other (123, 234, etc.).

where the Euclidean distance is used when p = 2.


The total cost cp (A, B) of a warping path p between A
and B with respect to the local cost measure c() is defined as
cp (A, B) =

L


c(ail , bjl ).

(5)

l=1

where an (n, m)-warping path p = (p1 , , pL ) defines an


alignment between A and B by assigning the elements ail of
A to the elements bjl of B. Furthermore, an optimal warping
path between A and B is a warping path p having minimal
total cost among all possible warping path. Finally, we obtain
the DTW distance measure
DTW(A, B) = cp (A, B) = min{cp (A, B)}

(6)

Furthermore, an optimal warping path p can be found by


using the following recursive formula [17]

cp (A, B) = (i, j),
(7)
where
(i, j) = c(i, j)+min{(i1, j), (i, j 1), (i1, j 1)}.
(8)
The optimal warping path is typically subject to some
constraints [18], boundary, continuity, and monotonicity.
5.2. The Image-to-Class DTW Approach

5. THE PROPOSED IMAGE-TO-CLASS DYNAMIC


TIME WARPING APPROACH
In this section, for better understanding the proposed imageto-class DTW approach, we first briefly introduce classical
DTW. We then introduce the proposed I2C-DTW approach.
5.1. The Classical Dynamic Time Warping
Dynamic time warping algorithm are well-known and widely
used in automatic control [8], speech recognition [9], and
computer vision [16, 12]. The survey on dynamic time warping can refer to [17]. The DTW distance measure is an extremely efficient technique that allows a non-linear mapping
of one feature to another by minimizing the distance between
those two features. In other words, it can minimize the effects
of shifting and distortion in feature dimensions by allowing
nonlinear transformation of time-series curves to measure the
similarity of patterns even with different phases. More specifically, the DTW algorithm deriving from dynamic programming aims to formulate a cost matrix and find the optimal path
with certain constraints between two time-series curves.
Mathematically, assume that there are two features A =
{a1 , a2 , . . . , am }, B = {b1 , b2 , . . . , bn }. We calculate a m
n cost matrix C = [c(i, j)] as
c(i, j) = bj ai p ,

Fig. 3. Undesired dynamic warping of traditional feature vector fw of the gestures with same number of stretched fingers:
(a) Before DTW; (b) After DTW.
Most of gesture recognition approaches represent each
gesture as one feature vector, and then calculate the similarity
between one test sample and each training sample. Then the
classifier assigns the gesture label of the test sample with that
of the most similar training sample. These approaches belong to image-to-image hand gesture recognition ones, such
as DSTW [12] and FEMD [4], which provide good gesture
recognition only when the test image is very similar to one
of the database images, but dont generalize much beyond the
labelled gesture images. This limitation is especially severer
for gesture classes with large diversity. More specifically, the
classical DTW performs well in handling the gestures which
have different number of stretched fingers, but not very well

class label of f test . In Eqn. (10), {wsc } are used to leverage


the importance of different fingerlets, and can be learned from
training samples. In this paper, wsc = 1, s = 1, 2, , S.
6. EXPERIMENTAL RESULTS AND ANALYSIS

Fig. 4. The training descriptors matrix Gc (c = 3) from the


10-Gesture data set [4]: Each row denotes different training samples; column 1 is samples of gesture 3 from different subjects, column 2 is their corresponding time-series
curves and each of the rest columns denotes different fingerlets. Black curves denote time-series curves, red bars denote
the principal orientations of each fingers. Note that we ignore
null combination for saving space.
on the gestures which have the same number of stretched fingers. Fig. 3 shows the case of undesired dynamic warping
on two different gestures. Two different gestures (shown in
Fig. 3(a)) could be warped into same time-series curves thus
degrading the recognizing performance (shown in Fig. 3(b)).
The non-linearly aligning characteristic of the I2I-DTW could
warp two curves of different gestures with same number of
stretched fingers into the same one. Inspired by the image-toclass nearest neighbor [19][20], we propose image-to-class
dynamic time warping approach to address this performance
degradation for 3D gesture recognition.
Lets define a fingerlet ensemble matrix of class c as
Gc = [f c,1 , f c,2 , , f c,Nc ]T ,

(9)

where Gc is a Nc Mc matrix; f c,n is the fingerlet ensemble


of training sample n of class c according to Eqn. (3); Nc is
the number of training samples of class c, Mc is the number
of fingerlets from one sample and varies with c. Similarly,
we can obtain the feature vector of a testing sample f test =



[f1 , f2 , , fS ]T .
Moreover, we define the minimal warping path between
ftest and Gc as
I2C-DTW(f test , Gc )
=

S

s=1

wsc

min

n{1,2, ,Nc }



DTW(fs , fc,n,s ) ,

(10)

Datasets: We have evaluated the proposed gesture recognition approach on two datasets. One is the 10-Gesture
dataset [4] which contains both RGB and depth images. For
further evaluating this approach, we have collected a digit
dataset of American Sign Language (ASL) using a Kinect device, called UESTC-ASL dataset. Since there are no publicly
released ASL dataset directly for 3D hand gesture recognition, we collect a dataset of ASL digit gesture from 1 to 10.
Each gesture is performed 11 times by each of the 10 participants in different orientations, depths, and scales. Thus our
dataset has 1100 samples in total. Our dataset is more challenging than the 10-Gesture dataset since the digits in ASL
are more difficult to be recognized.
Experimental Setup: In this paper, we follow the setup
in [4]. For each gesture, we randomly select 1 case from each
subject as training sample, then we have 100 training samples
in total. And we use the rest cases as testing samples. We set
H = 400. In the following two experiments, we employed
confusion matrix to evaluate the performance of the proposed
I2C-DTW by averaging the results over 10 different trials.
6.1. The 10-Gesture Dataset
The experiment compared the proposed I2C-DTW approach
with the FEMD approach on the 10-Gesture dataset. Fig. 5
(a)(c) and (e) shows the three confusion matrices of FEMD,
I2I-DTW and I2C-DTW, and we can see the recognizing rate
of the I2C-DTW approach is 99.5%. The proposed gesture
recognition approach significantly outperforms the FEMD
approach (93.9%) and traditional DTW approach (94.5%)
thanks to both fingerlet representation and image-to-class
classifiers.
6.2. The UESTC-ASL Dataset
This experiment compared the proposed I2C-DTW approach
with the FEMD approach on the UESTC-ASL dataset, shown
in Fig. 5 (b) (d) and (f), respectively. From this figure, we
can see that we have achieved the average rate 90.5% while
the FEMD approach can only achieve the average rate 82.8%
and the traditional I2I-DTW achieves the average rate 81.8%.
Though this dataset is really challenging due to the higher
similarity among the ASL digits thus resulting in smaller
inter-class variations. However, the I2C-DTW is more robust
to handle most of the variations.

where DTW(fc,n,s , fs ) denotes the minimal warping path be


tween fc,n,s and fs ; wsc is the class weight of the sth fingerlet; fc,n,s is one of fingerlet ensemble fc,n . Finally, we choose
the gesture class with the minimal I2C-DTW(f test , Gc ) as the

7. CONCLUSIONS
In this paper, we have proposed a novel image-to-class dynamic time warping approach for 3D hand gesture recogni-

10-Gesture Dataset

UESTC-ASL Dataset

[6] X. Zabulis, H. Baltzakis and A. Argyros, Vision-based


hand gesture recognition for human computer interaction, LEA, Series on Human Factors and Ergonomics,
2009.
[7] J. Shotton, A. Fitzgibbon and M. Cook, Real-time
human pose recognition in parts from single depth images, in IEEE CVPR, 2011.

(a)

(b)

[8] R. Bellman and R. Kalaba, On adaptive control processes, IEEE TAC, 1959.
[9] C. Myers, L. Rabiner and A. Rosenberg, Performance
tradeoffs in dynamic time warping algorithms for isolated word recognition, IEEE TASSP, 1980.

(c)

(d)

[10] V. Pavlovic, R. Sharma and T. Huang, Visual interpretation of hand gestures for human-computer interaction:
a review, IEEE TPAMI, 1997.
[11] C. Keskin, A. Erkan and L. Akarun, Real time hand
tracking and 3D gesture recognition for interactive interfaces using HMM, in ICANN/ICONIPP, 2003.

(e)

(f)

Fig. 5. The confusion matrix on 10-Gesture dataset and


UESTC-ASL dataset: (a)(b) FEMD; (c)(d) I2I-DTW; (e)(f)
I2C-DTW.
tion. We have developed fingerlet ensemble to represent timeseries curve of hand gestures thus getting better gesture features. We have demonstrated the excellent performance of our
approach on two different hand gesture datasets, 10-Gestures
dataset and UESTC-ASL dataset.
8. REFERENCES
[1] S. Kelly, S. Manning and S. Rodak, Gesture gives a
hand to language and learning: perspectives from cognitive neuroscience, developmental psychology and education, Language and Linguistics Compass, 2008.
[2] J. Guo and S. Li, Hand gesture recognition and interaction with 3D stereo camera, The Project Report of
Australian National University, 2011.
[3] H. Du and T. To, Hand gesture recognition using
Kinect, Techical Report, Boston University, 2011.
[4] Z. Ren, J. Yuan and Z. Zhang, Robust hand gesture
recognition based on finger-earth movers distance with
a commodity depth camera, in ACM MM, 2011.
[5] R. Lockton, Hand gesture recognition using computer
vision, The Project Report of Oxford University, 2002.

[12] A. Jonathan, A. Vassilis, Y. Quan and S. Stan, Simultaneous localization and recognition of dynamic hand
gestures, in IEEE Workshop on MVC, 2005.
[13] L. Bonansea, 3D Hand gesture recognition using a
ZCam and an SVM-SMO classifier, in Graduate Theses and Dissertations, 2009.
[14] I. Oikonomidis, N. Kyriazis and A. Argyros, Efficient model-based 3D tracking of hand articulations using Kinect, in BMVC, 2011.
[15] A. Kurakin, Z. Zhang, and Z. Liu, A real time system for dynamic hand gesture recognition with a depth
sensor, in IEEE EUSIPCO, 2012.
[16] A. Corradini, Dynamic time warping for off-line recognition of a small gesture vocabulary, in IEEE ICCVW,
2001.
[17] P. Senin, Dynamic time warping algorithm review, TR
of University of Hawaii at Manoa, 2008.
[18] Sakoe, H. and S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE
TASSP, 1978.
[19] O. Boiman, E. Shechtman and M. Irani, In defense of
nearest-neighbor based image classification, in IEEE
CVPR, 2008.
[20] H. Cheng, R. Yu, Z. Liu and Y. Liu, A pyramid nearest neighbor search kernel for object categorization, in
IEEE ICPR, 2012.

Potrebbero piacerti anche