Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
sensors. First of all, vision based sensors can be less cumbersome and uncomfortable than the other two sensors due
to no physical contact with users. Moreover, the users can
interact with a computer at a distance. Finally, vision based
sensors can be robust for gesture recognition even in noisy environments while sound would be in trouble. Through computational complexity is indeed much higher than the other
sensors, vision based gesture recognition becomes more and
more attractive thanks to the fast development of vision sensors and powerful processors. Especially, the emergence of
the Kinect devices has drawn more attention on 3D vision
based gesture recognition [7, 3].
In this paper, we proposed an Image-to-Class Dynamic
Time Warping (I2C-DTW) approach for 3D hand gesture
recognition. Traditionally, Dynamic Time Warping (DTW)
approaches are used to find an optimal match between two
given signals (e. g. time series) with certain restrictions [8, 9].
Instead of finding an optimal image-to-image match, the proposed approach first searches for the minimal path to warp
two fingerlets, which are from one test image and the specific
class, respectively. Afterwards, the gesture recognition is to
use the ensemble of multiple image-to-class DTW distance
of fingerlets to obtain better recognizing performance. The
contributions are two-folded. First of all, we divide the timeseries curve of a hand gesture into different finger combinations, called fingerlets. The important fingerlets can either
be learned by using data mining techniques or be set manually. Furthermore, the class-dependent fingerlet weights are
learned from data as well. Secondly, in contrast to the Imageto-Image DTW (I2I-DTW), the proposed I2C-DTW approach
can leverage both the local and global gesture information especially when distinguishing two gestures with the same numbers of stretched fingers. The proposed approach is evaluated
on two 3D hand gesture data sets and the experiment results
show that the proposed I2C-DTW approach significantly improves the recognizing performance.
2. RELATED WORK
Vision based hand gesture recognition approaches usually
consist of three modules, detection, tracking and recognition
[6]. The detection modules are to define and extract visual
features which can encode hand gestures in the field of view
of the camera. The tracking modules are the processes of locating a hand over time using a camera. Especially, model
based tracking approaches provide ways to estimate model
parameters which are not directly observable at current moment. Finally, the recognition modules are to cluster the spatiotemporal features from the previous two modules and label
these groups with different gesture classes. Diverse hand gesture approaches have been proposed during the past ten years
by using different types of visual features and classifiers. In
this paper, we focus on extracting visual features and classifying approaches for 3D hand gesture recognition. Hence, we
present a review on the work of 3D gesture recognition approaches. Review on 2D gesture recognition approaches can
be found in[5, 6, 10].
Keskin et al. developed a real time hand tracking and
3D dynamic gesture recognition using Hidden Markov Models (HMM)[11], where the user was wearing a colored glove
and the hand coordinates were obtained via 3D reconstruction
from stereo. Alon et al. proposed the simultaneous localization and recognition of dynamic hand gestures approach
which core was a Dynamic Space Time Warping (DSTW)
algorithm[12], where a pair of query and model gestures in
both space and time were aligned. Bonansea proposed a complete solution for the one-hand 3D gesture recognition problem since it focused on both the 3D gesture recognition and
understanding the scene being presented [13]. Oikonomidis
et al. provided an optimization solution of estimation the 3D
position, orientation and full articulation of a hand obtained
by a Kinect sensor [14]. The problem is solved by minimizing the discrepancy between the appearance and 3D structure
of hypothesized instances of a hand model and actual hand
observation using a variant of Particle Swarm Optimization
(PSO). Kurakin et al. proposed a real-time system for dynamic hand gesture recognition, where action graph is used to
share similar robust properties with standard HMM[15]. This
approach requires less training samples by allowing states
shared among different gestures. Ren et al. proposed a novel
distance metric for hand dissimilarity measure, called FingerEarth Movers Distance (FEMD) to handle the noisy hand
shape obtained from the Kinect sensor[4]. Guo proposed a
3D hand hierarchy model as the hand hypothesis in the tracking algorithm [2], where the hand gesture is extracted first as
static postures and the particle swarm optimization was used
to find the hands 3D articulation comparing with the hand
hypothesis.
Hand gestures are spatio-temporal patterns. Most of researchers focus two important issues, extracting hand gesture
features and classifying hand gestures. It is popular to simply
use time-series curves to represent gestures. In this paper, we
will generate fingerlets from those time-series curves by using different finger combinations. In addition, many gesture
recognition approaches are to use image-to-image classifying
approaches to recognize gestures. For better handling interclass variations, we propose image-to-class DTW approaches.
(2)
Note that the normalized angles are calculated between each
contour vertex Pi and P1 relative to P0 .
4.2. Generating Fingerlets
In this paper, we do not use time-series curve fw RH directly. In contrast, we divide this curve into many fingerlets
by using different finger dividing and combining strategies.
In general, we can extract the feature segmentation from fw
(3)
where fs is a fingerlet consisting of a time-series curve segment and its orientation; s is theindex
of different finger comk
bination strategies; S l=1 Kl , k {1, 2, , K} is the
number of the stretched fingers of different gestures; Usually
K = 5, S = 15. In general, each fingerlet in Eqn. (3) is
segmented from fw . Moreover, the class-dependent fngerlet
weights could be learned from training samples.
(4)
L
c(ail , bjl ).
(5)
l=1
(6)
Fig. 3. Undesired dynamic warping of traditional feature vector fw of the gestures with same number of stretched fingers:
(a) Before DTW; (b) After DTW.
Most of gesture recognition approaches represent each
gesture as one feature vector, and then calculate the similarity
between one test sample and each training sample. Then the
classifier assigns the gesture label of the test sample with that
of the most similar training sample. These approaches belong to image-to-image hand gesture recognition ones, such
as DSTW [12] and FEMD [4], which provide good gesture
recognition only when the test image is very similar to one
of the database images, but dont generalize much beyond the
labelled gesture images. This limitation is especially severer
for gesture classes with large diversity. More specifically, the
classical DTW performs well in handling the gestures which
have different number of stretched fingers, but not very well
(9)
S
s=1
wsc
min
n{1,2, ,Nc }
DTW(fs , fc,n,s ) ,
(10)
Datasets: We have evaluated the proposed gesture recognition approach on two datasets. One is the 10-Gesture
dataset [4] which contains both RGB and depth images. For
further evaluating this approach, we have collected a digit
dataset of American Sign Language (ASL) using a Kinect device, called UESTC-ASL dataset. Since there are no publicly
released ASL dataset directly for 3D hand gesture recognition, we collect a dataset of ASL digit gesture from 1 to 10.
Each gesture is performed 11 times by each of the 10 participants in different orientations, depths, and scales. Thus our
dataset has 1100 samples in total. Our dataset is more challenging than the 10-Gesture dataset since the digits in ASL
are more difficult to be recognized.
Experimental Setup: In this paper, we follow the setup
in [4]. For each gesture, we randomly select 1 case from each
subject as training sample, then we have 100 training samples
in total. And we use the rest cases as testing samples. We set
H = 400. In the following two experiments, we employed
confusion matrix to evaluate the performance of the proposed
I2C-DTW by averaging the results over 10 different trials.
6.1. The 10-Gesture Dataset
The experiment compared the proposed I2C-DTW approach
with the FEMD approach on the 10-Gesture dataset. Fig. 5
(a)(c) and (e) shows the three confusion matrices of FEMD,
I2I-DTW and I2C-DTW, and we can see the recognizing rate
of the I2C-DTW approach is 99.5%. The proposed gesture
recognition approach significantly outperforms the FEMD
approach (93.9%) and traditional DTW approach (94.5%)
thanks to both fingerlet representation and image-to-class
classifiers.
6.2. The UESTC-ASL Dataset
This experiment compared the proposed I2C-DTW approach
with the FEMD approach on the UESTC-ASL dataset, shown
in Fig. 5 (b) (d) and (f), respectively. From this figure, we
can see that we have achieved the average rate 90.5% while
the FEMD approach can only achieve the average rate 82.8%
and the traditional I2I-DTW achieves the average rate 81.8%.
Though this dataset is really challenging due to the higher
similarity among the ASL digits thus resulting in smaller
inter-class variations. However, the I2C-DTW is more robust
to handle most of the variations.
7. CONCLUSIONS
In this paper, we have proposed a novel image-to-class dynamic time warping approach for 3D hand gesture recogni-
10-Gesture Dataset
UESTC-ASL Dataset
(a)
(b)
[8] R. Bellman and R. Kalaba, On adaptive control processes, IEEE TAC, 1959.
[9] C. Myers, L. Rabiner and A. Rosenberg, Performance
tradeoffs in dynamic time warping algorithms for isolated word recognition, IEEE TASSP, 1980.
(c)
(d)
[10] V. Pavlovic, R. Sharma and T. Huang, Visual interpretation of hand gestures for human-computer interaction:
a review, IEEE TPAMI, 1997.
[11] C. Keskin, A. Erkan and L. Akarun, Real time hand
tracking and 3D gesture recognition for interactive interfaces using HMM, in ICANN/ICONIPP, 2003.
(e)
(f)
[12] A. Jonathan, A. Vassilis, Y. Quan and S. Stan, Simultaneous localization and recognition of dynamic hand
gestures, in IEEE Workshop on MVC, 2005.
[13] L. Bonansea, 3D Hand gesture recognition using a
ZCam and an SVM-SMO classifier, in Graduate Theses and Dissertations, 2009.
[14] I. Oikonomidis, N. Kyriazis and A. Argyros, Efficient model-based 3D tracking of hand articulations using Kinect, in BMVC, 2011.
[15] A. Kurakin, Z. Zhang, and Z. Liu, A real time system for dynamic hand gesture recognition with a depth
sensor, in IEEE EUSIPCO, 2012.
[16] A. Corradini, Dynamic time warping for off-line recognition of a small gesture vocabulary, in IEEE ICCVW,
2001.
[17] P. Senin, Dynamic time warping algorithm review, TR
of University of Hawaii at Manoa, 2008.
[18] Sakoe, H. and S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE
TASSP, 1978.
[19] O. Boiman, E. Shechtman and M. Irani, In defense of
nearest-neighbor based image classification, in IEEE
CVPR, 2008.
[20] H. Cheng, R. Yu, Z. Liu and Y. Liu, A pyramid nearest neighbor search kernel for object categorization, in
IEEE ICPR, 2012.