Sei sulla pagina 1di 6

Proceedings ofthe 1996 IEEE

Intemational Conference on Robotics and Automation


Minneapolis. Minnesota - April 1996

Online, Interactive Learning of Gestures for Human/Robot


Interfaces
Christopher Lee and Yangsheng Xu
The Robotics Institute
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213, USA

Abstract robot learns to perform a task by observing a human


teacher. Much effort is being directed toward the re-
We have developed a gesture recognition system, search of systems for gesture/observation-basedpro-
based on Hidden Markov Models, which can interac- gramming of robot systems. Tung and Kak [l]demon-
tively recognize gestures and perform online learning strate automatic learning of robot tasks through a
of new gestures. In addition, it is able to update DataGlove interface. Kang and Ikeuchi [a] developed
its model of a gesture iteratively with each example a system for simple task learning by human demon-
it recognizes. This system has demonstrated reliable stration. Voyles [3] developed a system for gesture-
recognition of 14 different gestures after only one or based programming via a multi-agent model.
two examples of each. The system is currently inter- One capability which is currently lacking in sys-
faced to a Cyberglove for use in recognition ofgestures tems such as these is a mechanism for online teaching
from the sign language alphabet. The system is be- of gestures with symbolic meanings. Most gesture
ing implemented as part of an interactive interface for recognition systems either require some explicit pro-
robot teleoperation and programming by example. gramming, or in the case of neural-nets, require offline
training of model parameters. Such systems are thus
unsuitable for interactive applications where gestures
1 Introduction must be taught and then recognized without wait-
ing for an offline training or programming phase. A
If we are to fully harness the potential of robotic teach-by-demonstration system should also be able to
technology, we will have to move beyond simple learn a new gesture or skill in an online manner and
keyboard/mouse/teach-pendant style robot program- with a very small number of examples. This simplifies
ming and create comprehensive frameworks for pro- the teaching process because it more closely resembles
ductive realtime interaction between robots and hu- the manner of instruction which we commonly use to
mans. The motivations behind this kind of interac- train people.
tion include increasing the effectiveness of teleopera- In this paper, we present a gesture recognition sys-
tion, enabling people to interactively teach robots new tem which can interactively recognize gestures and
tasks or refine their skills, and allowing people to more learn new gestures online with as few as one or two
effectively control systems such as semi-autonomous examples. It is also able to update its model of a
airplanes or systems for automated monitoring of in- gesture iteratively with each example it recognizes.
dustrial plants. As people interact with machines
which are autonomous and highly complex, they must
be allowed t o focus their attention on the content of
their interaction rather than the mechanisms and pro- 2 Approach
tocol through which the interaction occurs. This is
best accomplished by making the style of interaction Our goal is to make a system which can not only in-
more closely resemble that to which they are most teract with a user by accurately recognizing gestures,
accustomed: interaction with other people. but which can learn new gestures and update its un-
One of the first efforts towards realizing this new derstanding of gestures it already knows in an on-
style of human/robot interaction is current research line, interactive manner. Our approach is automated
on programming by example. In this methodology, a generation and iterative training of a set of Hidden

0-7803-2988-4/96 $4.00 0 1996 I € € € 2982

Authorized licensed use limited to: Texas A M University. Downloaded on March 23,2010 at 03:14:14 EDT from IEEE Xplore. Restrictions apply.
Markov models which represent human gestures. Us- 2. The system segments the stream of data from
ing this approach, we have built and tested a system the input device into separate gestures, and in
which recognizes letters from the sign language alpha- realtime, tries t o classify each gesture.
bet using a Virtual Technologies ‘Cyberglove’.
The most basic assumption we make about the na- If the system is certain about its classifica-
ture of human gestures are that they are “doubly tion of a gesture, it immediately performs
stochastic” processes. These are Markov processes an action associated with that gesture (if
whose internal state is not directly observable. For one has been specified). Such an action
each state transition in such a process, the system could be passing the result of the classifi-
generates an observable output signal whose value de- cation t o a higher-level HMM, or sending a
pends on a probability distribution which is fixed for command t o a robot.
each internal state. A model of such a system is called If the system is in any way unsure about its
a Hidden Markov Model (HMM). Modelling human classification of a gesture, it queries the user
gestures as HMMs allows us t o deal with the highly for confirmation of its classification. The
stochastic nature of human gesture performance. user either:
For computational simplicity, we assume that the 0 confirms the system’s classification, or
the HMMs are ‘discrete’ HMMs. These are HMMs
0 corrects the classification, or
whose observable outputs are members of a finite set
of symbols. Before we can use discrete HMMs t o 0 adds a new gesture class t o the system’s
model gestures, we must therefore preprocess the raw lbank of gesture models.
data from the gesture input device into a sequence of
3. The sy,stem adds the symbols of the encoded ges-
discrete symbols.
ture to the list of example sequences of the proper
Gesture recognition is, t o a certain extent, a pro-
gesture model, then updates the parameters of
cess of data compression, where we input a large that model by retraining the HMM on the accu-
amount of raw data and output a signal value that mulate’d example sequences.
represents the classification of the gesture. This com-
pression process is composed of two stages. The great- We have found that in our implementation, recog-
est data-reduction occurs when the preprocessor con- nition of the gesture and automatic update of the
verts the large, multichannel stream of data from the HMM through the Baum-Welch algorithm is fast
gesture input device t o the sequence of discrete ob- enough not t o be noticeable during normal use of the
servable symbols. We reduce about 0.5-1.0 Kbytes of system. This provides the system with a truly in-
data from the Cyberglove t o a sequence of 5-10 ob- teractive character. For example, a user controlling a
servable symbols from a set of 32 and 256 symbols, robot through the gesture system could perform a ges-
which is a reduction of about 1OO:l. Since the on- ture which the system has not seen before, and the
line portion of our gesture learning process is limited system would immediately respond by asking what
t o the modification of the parameters of HMMs, this kind of gesture it is. The user could respond that it
data-reduction greatly increases the speed and sim- is a “halt” gesture, and that the robot should stop
plicity of the online learning process by focussing the its current motion when that gesture is made. The
HMM on modelling specific features of the signal. next time the user performs that gesture, the system
should recognize it and immediately halt the motion
of the robot.
3 Interactive training

Our project is t o build a system which allows for 4 Recognition and Learning with
interactive, online training. In our system, each kind HMMs
of gesture is represented by an HMM, a list of exam-
ple observation sequences, and an optional action to Hidden Markov Models [4]are commonly used
be performed upon recognition of the gesture. Our for speech recognition, but have also been used for
concept of interactive training is currently based on characterizing task information and human skills for
the following general procedure: transfer t o robots in telerobotic applications [5, 61.
A Hidden Markov Model is a representation of a
1. The user makes a series of gestures. Markov process which cannot be directly observed (a

2983

Authorized licensed use limited to: Texas A M University. Downloaded on March 23,2010 at 03:14:14 EDT from IEEE Xplore. Restrictions apply.
“doubly stochastic” system). The discrete form of the rithm, we use a different approach. We begin with
HMM is represented by three matrices, X = ( A ,B, T ) . one or some small number of examples, run BW un-
The matrix A = aa3specifies the probability that the til it converges, then iteratively add more examples,
internal state will change from i to j . B = b,k repre- updating the model with BW after each one. This
sents the probability that the system will generate the allows for an online, interactive style of gesture train-
observable output symbol IC on transition to state j. ing. Section 7 compares the results of this kind of
The third component of a Hidden Markov Model is incremental training to batch-style processing.
a vector x indicating the distribution of probability
that any given state is the initial state of the hidden
Markov process. 5 Signal preprocessing
There are three problems commonly associated
with Hidden Markov models [4]: (1) determining the Since we are using discrete HMMs, we need to rep-
probability with which a given sequence of observable resent gestures as sequences of discrete symbols. We
symbols would be generated by an HMM, (2) deter- must therefore preprocess the raw gesture data, which
mining the most likely sequence of internal states in in our case are values of 20 joint-angles in the hand,
a given HMM which would have given rise to a given estimated from 18 sensors in the Cyberglove at about
sequence of observable symbols, and ( 3 ) generating 10 Hz.
an HMM that best ‘explains’ a sequence or set of se- The first choice we must make in performing
quences of observables. We are directly concerned this preprocessing is whether we want to generate
with the first problem for gesture recognition, and a one-dimensional sequence of symbols or a multi-
the third problem for generating the HMMs used in dimensional sequence. The multi-dimensional se-
gesture recognition. quence may be used as input to a multi-dimensional
HMM. If we assume that the dimensions of the ob-
servable sequence are dependent only on the inter-
Gesture recognition. The problem of recognizing nal state and not on one another, then the multi-
a gesture from a given set of input data is an ex- dimensional HMM will have a single A matrix and
ample of problem 1. First, raw input data from the multiple B matrices, one for each dimension of the
input device is preprocessed into a sequence of dis- output symbols. Although it makes some intuitive
crete observation symbols 0 = 0 1 0 2 0 3 . . .. Then it
sense to train a multi-dimensional HMM which takes
is determined which of a set of HMMs, each modeling as inputs, say, 5 dimensions of symbols (one for each
a different gesture, is most likely to have generated
finger), we chose for simplicity t o look at the hand
that sequence: L = argmax[P(OIXl)]. In addition, as a single entity, and generated a single-dimensional
the system determines if there is an ambiguity be-
sequence of symbols which represent features with re-
tween two or more gestures (the probabilities of the
spect to the entire hand. The specific preprocess-
most likely gestures are too close to one another) or if ing algorithm we chose for this purpose is inspired by
no known gesture is similar to the observed data (the work done with HMMs in the speech community-it
probability of the most likely gesture is too small). is vector-quantization (VQ) of a series of short-time
fast Fourier transforms (STFFTs) of the signal from
Learning gesture models Developing the HMM the input device.
which will be associated with a gesture is an ex- Figure 1 shows the flow of data through the pre-
ample of problem 3. The algorithm which is com- processor. 20 channels of joint-angle data from the
monly used for this purpose is the Baum-Welch (BW) hand are read from the Cyberglove [a]. This data
algorithm. Baum-Welch uses an iterative expecta- is segmented into separate gestures [b], and resam-
tion/maximization process t o find an HMM which is pled by cubic interpolation [c]. Each channel is then
a local maximum in its likelihood to have generated broken into a series of overlapping time-windows of
a set of ‘training’ observation sequences. Although 4-8 readings, and each time-window is smoothed with
the space of possible HMMs for a given HMM struc- a Hamming function before being passed to the FFT
ture is generally full of many of these local maxima, routine [d]. The power spectra of the 20 channels ,are
it has been observed that most work similarly well for then concatenated [e] to form a large vector [f]. This
practical use in modeling doubly stochastic systems. vector represents the position and dynamic character-
While training of HMMs is normally a batch pro- istics of the gesture during the time-window. Each of
cess, where many examples of a given gesture are used these large vectors is turned into an observable sym-
simultaneously as inputs to the Baum-Welch algo- bol by a vector-quantizer [g].

2984

Authorized licensed use limited to: Texas A M University. Downloaded on March 23,2010 at 03:14:14 EDT from IEEE Xplore. Restrictions apply.
the assumptions stated above.

6 Impllementation
Concatination
[e - 20 channels of spectra for each window]
Our gesture segmentation procedure is very simple,
relying on thie hand of the operator t o be still for a
short time between gestures. Another possible tool
for segmentation is an acceleration threshold. This
is useful for segmentation when the hand does not
stop between gestures, and can be combined with the
velocity-based segmentation.
[b - stream segemented into gestures]
Figure 2 shows the organization of our system. A
Cyberglove is interfaced t o a Sun Sparcstation via an
a - 20 channel continuous stream] RS-232 serial connection, and sends readings of joint
angles in the user’s hand t o a data-collection program
1 I at about 10 Hz. Each reading that is part of a gesture
is marked with a timestamp, and sent t o the gesture-
recognitioni program via either a TCP-based UNIX
Figure 1: Data flow in the preprocessor socket or a, temporary file. The socket connection is
used for normal operation, and the temporary file is
The vector quantizer encodes a vector a by return- used for creating databases of gesture data for offline
ing the index K of a vector CK in a set of vectors called generation of vector-codebooks, and for automated
the ‘codebook’ (CK E C) which is closest t o a in the testing ancl tuning of system parameters.
&-norm sense (i.e. K = argmin(ck . a ) ) . The code- After the data for a gesture is preprocessed by the
book is a set of vectors which is believed t o be repre- algorithm specified in Section 5, it is evaluated by
sentative of the domain of vectors t o be encoded. We all HMMs for the recognition process, and then used
generate this codebook using the LBG algorithm [7] t o update the parameters of the proper HMM. For
on a representative sample of gesture data. Codebook recognition OS hand gestures, we used 5-state Bakis
generation is an offline process. HMMs. A Bakis HMM is one where the system is
Because the preprocessor is coded as a data filter, restricted t o only move from a given state t o the same
a symbol is sent t o the gesture-recognition system as state or one of the next 2 states. A 5-state Bakis
soon as enough data has been read from the glove t o HMM may move from state 1 t o states 1, 2, or 3, and
generate it. This allows the system t o eventually be from state 4 to state 4 or 5 (there is no state 6). This
reconfigured t o use HMMs for online segmentation of restriction encodes the assumption that the gestures
continuous gestures rather than performing segmen- we are classifying are in general a simple sequence of
tation as a separate step. motions, and non-cyclical in nature. Using this HMM
This preprocessor is not task specific. It assumes structure, we ]performed tests t o determine the ability
only that all dimensions of the data are related, that of the system t o accurately recognize gestures and t o
the gesture itself is a process which evolves fairly optimize thle parameters of the preprocessor.
smoothly over time, that the interesting features of The data-collection program for the Cyberglove
the gesture are amenable t o clustering in the spectral is written in C, and the interactive HMM recog-
domain and are consistent in nature over time. It nition/training/GUI program is written in a com-
may thus be used without modification for recogniz- bination OS C, Tk, and a language closely related
ing gestures such as handwriting, facial expressions, t o Scheme--Guile. Although matrix operations are
or dance motions as well as hand gestures. Note that coded in C for speed, the outer loops of Baum-Welch,
specially developed, task-specific signal-preprocessors gesture-recognition, and all other code is currently ex-
could be expected t o perform better for recognition of ecuted in interpreted Scheme. Even with the major-
specific kinds of gestures by leveraging expert a pri- ity of the code written in Scheme, gesture-recognition
ori knowledge of the structure of the gestures, and and update of the proper HMM occurs in a fraction
would be necessary for signals which do not satisfy of a second on a Sparcstation.

2985

Authorized licensed use limited to: Texas A M University. Downloaded on March 23,2010 at 03:14:14 EDT from IEEE Xplore. Restrictions apply.
I
_ _
_ -----------
_,

Unzx process I I Unix process


I
ifor data collection f I f o r GUI,
I
I
II I H M M analysis
I I I
I
I
I
I
I (Gesture models) f
I
I II I

-
I I II Preprocessing/
Preprocessing/ I

I CyberG1ovq RS-232 f Segmentation


I
I * Data-collection/ - s;y k e t 01I
Recognition/ 4-4
Learning
( A ,B , 7 r , 0 ) ,[
I

I
I tFmp. fil0 I
I I I
I
I I I
I
I
I
I
I ( A ,B , 7 r , 0),[action]
I
_ _ _ _ _ _ _ _ _ _ _I _ _ _I ___-_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ L _ _ _ _ _ _ _ _ _ _ _ _ _
I
~ _ _ _ _ _ _ _ _ _ _ _ _ _ _ - - _ _ _ _ _ L _ _ _ _ _ _ _ _ _ _ _ _ _

Figure 2: Data flow for learning system

7 Confidence measure in the codebook (the number of observable symbols),


the size of the STFFT windows, and the resampling
We defined a simple evaluation function to exam- timestep for the signal interpolator.
ine the classification power of our algorithm. The For our test, we picked 14 letters from the sign
function indicates misclassifications and their sever- language alphabet which were amenable t o VQ clus-
ity, as well as the system’s confidence in its correct tering and unambiguous with respect to general hand
classifications. orientation, since we did not using the Polhemus 6D
Our evaluation function is position/orientation sensor for the hand. The letters
w e u s e d a r e A , B , C , D , E , F , G , I , K , L , M , U , W,
and Y. The final positions for two of these gestures
are shown in Figure 4.
where Pc is the probability that the observation se- Two plots from our tests are shown in Figure 3 .
quence would be created by the correct gesture model, They demonstrate that the system is very reliable for
and PE, is the probability that the gesture symbols both these sets of preprocessing parameters. The av-
would be created by the ith incorrect model. There- erage value of V in all cases is less than zero, indi-
fore, if V < 0, we have correct classification, and cating reliable classification. Measuring percentage
if V > 0, the gesture is incorrectly classified. If of classification errors, the trail of plot (b) had 1%
-1 < V < 1, the classifier should indicate that its error after 2 examples and less than 0.1% error after
classification is suspect. If V < -2, the system has 4 examples. The trial of plot (a) had 2.4% error af-
made the correct classification and is very confident ter two training examples, and made no classification
of the classification. errors after seeing 6 examples.
We plotted the performance of the system using The iterative training of the HMMs usually results
the following procedure: in models which are close to the quality of batch-
Train each gesture model with one example. trained HMMs when we compare the likelihood that
the training data would be generated by the models.
Test the classification of 20 test examples of each In a few cases, however, batch training did much bet-
kind of gesture (the models are never trained on ter. This is probably the result of early training exam-
these examples). ples biasing an HMM toward a poor local maximum.
Fortunately, batch training from a random HMM is
Plot the average value of the V verses the number
a rapid process, and can be easily used to try to im-
of examples of gestures each model has seen.
prove HMM models during a couple seconds while the
This procedure allows us to judge the performance system is not doing realtime recognition. Our results
of the algorithm with respect to learning rate and show that this is not generally necessary, however, as
overall confidence. It also allows us to test the effect iteratively trained HMMs work well enough in clas-
of varying parameters such as the number of vectors sifing gestures.

2986

Authorized licensed use limited to: Texas A M University. Downloaded on March 23,2010 at 03:14:14 EDT from IEEE Xplore. Restrictions apply.
8 Conclusion and Discussion
We have demonstrated an efficient and reliable sys-
. .
I
I . . . . ,

A-
rl .-. tem for online learning and recognition of gestures.
Online leaimiing of gestures in an interactive system
can be used to make cooperation between robots and
humans easier in applications such as teleoperation
and progra.mming by demonstration.
The number of gestures we can classify this ac-
curately is currently limited by the number of ob-
servable symbols the preprocessor generates. Thus
at this time, systems which perform offline training
now can recognize a larger gesture vocabulary. Fels
and Hinton [8], for example, use offline-trained neu-
I 1 3 4 5 6 7 8 9 10
ral networlks t o recognize 66 root words, each with
Numhuof Vuning e n m p l i r
up t o 6 endings. We are working on increasing the
(a) 128 symbols, 10 Hz resampling, 4 point data windows
potential size of the gesture vocabulary of our online
system by having the preprocessor generate 5 dimen-
I sions of symbols (one for each finger) as input t o a

-:I,
-,,,
A+-
multidimensional HMM. We are also working on a
mode where we can turn off learning of new gestures

=I
K

U
,.*..
wY *-
.+...

.*..
+.-
and the gesture-segmenter, and have the HMMs au-
tomatically segment continuous gestures. Finally, we
are investigating the use of this system for human to
robot skill tra.nsfer and teleoperation.

References
C. P. Tung and A. C. Kak, “Automatic learning of assem-
I bly tasks using a dataglove system,” in Proceedings of the
I ? 3 4 5 6 7 8 9 IO
Numbcroftraining erampkr
I E E E / R S J Conference o n Intelligent Robots and Systems,
(b) 64 symbols, 20 Hz resampling, 8 point data windows pp. 1-8, 1995.
S. B. Kaing and K. Ikeuchi, “Robot task programming by
Figure 3: Evaluation of classification process human demonstration,” in Proceedings of the Image Un-
derstandiing Workshop, 1994.
R. Voylez;, “Tactile gestures for human/robot interaction,”
in Proceedings of the I E E E / R S J Conference o n Intelligent
Robots and Systems, 1995.
L. R. R a t h e r and B. H. Juang, “An introduction t o hidden
markov models,” I E E E ASSP Magazine, pp. 4-16, January
1996.
J. Yang, Y. Xu, and C. Chen, “Hidden markov model ap-
proach to skill learning and its application in telerobotics,”
I E E E Trtzns. o n Robotics and Automation, vol. 10, no. 5,
pp. 621-631, 1994.
Y . Xu and J. Yang, “Towards human-robot coordination:
skill modeling and transferring via hidden markov model,”
in Proceedings of the IEEE International Conference o n
Robotics and Automation, vol. 2 , pp. 1906-1911, 1995.
A . Gersho, “,On the structure of vector quantizers,” I E E E
Transactions o n Information Theory, vol. IT-28, no. 2,
pp. 157-166, 1982.
Figure 4: Final positions of two gestures: A and C S. S. Fels and G. E. Hinton, “Glove-talk: A neural network
interface between a data-glove and a speech synthesizer,”
I E E E ‘Dunsations o n Neural Networks, vol. 4, pp. 2-8,
January :1994.

2987

Authorized licensed use limited to: Texas A M University. Downloaded on March 23,2010 at 03:14:14 EDT from IEEE Xplore. Restrictions apply.

Potrebbero piacerti anche