Sei sulla pagina 1di 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


1

A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
Yun Tie, Member, IEEE, Ling Guan, Fellow, IEEE,

AbstractAutomatic emotion recognition from facial expression is one of the most intensively researched topics in affective computing and human-computer interaction (HCI). However, it is well known that due to the lack of 3D feature and dynamic analysis the functional aspect of affective computing is insufcient for natural interaction. In this paper we present an automatic emotion recognition approach from video sequences based on a ducial point controlled 3D facial model. The facial region is rst detected with local normalization in the input frames. The 26 ducial points are then located on the facial region and tracked through the video sequences by multiple particle lters. Depending on the displacement of the ducial points, they may be used as landmarked control points to synthesize the input emotional expressions on a generic mesh model. As a physicsbased transformation, Elastic Body Spline (EBS) technology is introduced to the facial mesh to generate a smooth warp that reects the control point correspondences. This also extracts the deformation feature from the realistic emotional expressions as well. Discriminative Isomap (D-Isomap) based classication is used to embed the deformation feature into a low dimensional manifold that spans in an expression space with one neutral and six emotion class centers. The nal decision is made by computing the Nearest Class Center (NCC) of the feature space. Index TermsVideo analysis, Elastic Body Spline, Differential Evolution Markov Chain, Discriminative Isomap, Nearest Class Center,

I. INTRODUCTION

ITH the rapid development of Human-Machine Interaction (HMI), affective computing is currently gaining popularity in research and ourishing in the industry domain. It aims to equip computing devices with effortless and natural communication. The ability to recognize human affective state will empower the intelligent computer to interpret, understand, and respond to human emotions, moods, and possibly, intentions. This is similar to the way that humans rely on their senses to assess each others affective state [1]. Many potential applications such as intelligent automobile systems, game and entertainment industries, interactive video, indexing and retrieval of image or video databases can benet from this ability. Emotion recognition is the rst and one of the most important issues in the affective computing eld. It incorporates computers with the ability to interact with humans more naturally and in a friendly manner. Affective interaction can have maximal impact when emotion recognition and expression is
Y. Tie and L. Guan are with the Department of Electrical and Computer Engineering, Ryerson University, Toronto, ON, Canada. e-mail: ytie@ryerson.ca, lguan@ee.ryerson.ca. Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

available to all parties, human and computers [2]. Most of the existing systems attempt to recognize the human prototypic emotions. It is widely accepted from psychological theory that human emotions can be classied into six archetypal emotions: surprise, fear, disgust, anger, happiness, and sadness, which are so-called six-basic emotions that was pioneered by Ekman and Friesen [3]. According to Ekman, the sixbasic emotions are not culturally determined, but universal to human culture and thus biological in origin. There are also several other emotions and many combinations of emotions that have been studied, but they are unconrmed as universally distinguishable. Facial expression regulates face-to-face interactions, indicates reciprocity, interpersonal attraction or repulsion, and enables intersubjectivity between members of different cultures [4]. Recent research in the elds of psychology and neurology has shown that facial expression is a most natural and primary cue for communicating the quality and nature of emotions, and that it correlates well with the body and voice [5]. Each of the six basic emotions corresponds to a unique facial expression. To the objectives of an emotion recognition system, facial expression analysis is considered to be the major indicator of a human affective state. In the past 20 years there has been much research in recognizing emotion through facial expressions. However, challenges still remain. Traditionally, the majority approaches for solving human facial expression recognition problems attempt to perform the task on two dimensional data, either 2D images or 2D video sequences. Unfortunately, such approaches have difculty handling pose variations, lighting illumination and subtle facial behavior. The performance of 2D based algorithms remains unsatisfactory, and often proves unreliable under adverse conditions. Using 3D visual feature to recognize and understand facial expressions has been demonstrated to be a more robust approach for human emotion recognition [6]. However, the general 3D emotion recognition approaches are mainly based on static analysis. A growing body of psychological research supports that the timing of expressions is a critical parameter in recognizing emotions and the detailed spatial dynamic deformation of the expression is important in expression recognition. Therefore the dynamic analysis for the state transitions of 3D faces could be a crucial clue to the investigation of human emotional states. Another weakness of the existing 3D based approaches is the complexity and intensive computation cost to meet the challenge of accuracy. The temporal and detailed spatial information in the 3D visual cues, both at local and global scales,

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2

may cause more difculties and complexities in describing human facial movement. Moreover, automatic detection and the segments based on the facial components with respect to emotion recognition has not been reported so far. Most of the existing works require manual initialization.

Fig. 1.

Overall System Diagram.

In light of these problems, this paper presents an automatic emotion recognition method from video sequences based on a deformable 3D facial expression model. We use the elastic body spline (EBS) based approach for human emotion classication with the active deformation feature extraction depending on the 3D generic model. This model is driven by the key ducial points and thus makes it possible to generate the intrinsic geometries of the emotional space. The block diagram of this method is shown in Fig. 1. The rest of the paper is organized as follows. Section II gives an overview on stateof-the-art for human emotion recognition. We then present the proposed 3D facial modeling and feature extraction from video sequences using EBS techniques in Section III. Discriminative Isomap (D-Isomap) based classication is discussed in Section IV. The experimental results are presented in Section V. Section VI gives our conclusions. II. RELATED WORKS The most commonly used vision-based coding system is the facial action coding system (FACS) proposed by Ekman and Friesen [7] for the manual labeling of facial behavior. To recognize emotions from facial clues, FACS enables facial expression analysis through standardized coding of changes in facial motion in terms of atomic facial actions called Action Units (AUs). The changes in the facial expression are described with FACS in terms of AUs. FACS decomposes the facial muscular actions into 44 basic actions and describes the facial expressions as combinations of the AUs. Many researchers are inspirited by this work and try to analyze facial expressions in image and video processing. Most methods use the distribution of facial features as inputs of a classication system, and the outcome is one of the facial expression classes. Lyons et al. [8] used a set of multi-scale, multi-orientation Gabor lters to transform the images rst. The Gabor coefcients sampled on the grid were combined into one single vector. They tested their system and achieved 75% expression classication accuracy by using Linear Discriminant Analysis (LDA). Silva and Hui [9] determined the eye and lip position using low-pass ltering and edge detection methods. They achieved an average emotion recognition rate of 60% using a neural network (NN). Cohen et al. [10] introduced the

temporal information from video sequences for recognizing human facial expression. They proposed a multi-level hidden Markov model (HMM) classier for dynamic classication, in which the temporal information was also taken into account. Guo and Dyer [11] introduced a linear programming based method for facial expression recognition with a small number of training images for each expression. A pairwise framework for feature selection was presented and three methods were compared in the experimental part. Pantic and Patras [12] presented a method to handle a large range of human facial behavior by recognizing facial muscle actions that produce expressions. The algorithm performed both automatic segmentation into facial expressions and recognition of temporal segments of 27 AUs. Anderson and McOwan [13] presented an automated multistage system for real-time recognition of facial expression. The system used facial motion to characterize monochrome frontal views of facial expressions. It was able to operate effectively in cluttered and dynamic scenes, recognizing the six emotions universally associated with unique facial expressions. Gunes and Piccardi [14] proposed an automatic method for temporal segment detection and affect recognition from facial and body displays. Wang and Guan [15] constructed a bimodal system for emotion recognition. They used a facial detection scheme based on a Hue Saturation Value (HSV) color model to detect the face from the background and Gabor wavelet features to represent the facial expressions. Presently, state-of-the-art 3D facial modeling by physically based paradigm has been recognized as a key research area of emotion recognition for next-generation human-computer interaction (HCI) [16]. Song et al. [17] presented a generic facial expression analogy technique to transfer facial expressions between arbitrary 3D facial models, and between 2D facial images. Geometry encoding for triangle meshes, vertex-tentcoordinates were proposed to formulate expression transfer in 2D and 3D cases as a solution to a simple system of linear equations. In [18], a 3D features based method for human emotion recognition was proposed. 3D geometric information plus colour/density information of the facial expressions were extracted by 3D Gabor library to construct visual feature vectors. The improved kernel canonical correlation analysis (IKCCA) algorithm was applied for nal decision, and the overall recognition rate was about 85%. A static 3D facial expression recognition method was proposed in [19]. The primitive 3D facial expression features were extracted from 3D models based on the principal curvature calculation on 3D mesh models. Classication into one of the six-basic emotions was done based on the statistical analysis of these features, and the best performance was obtained using LDA. Although several methods can achieve a very high recognition rate, most of the existing 3D face expression recognition works are based on static data. Soyel and Demirel [20], [21] used six distance measures from 3D distributions of facial feature points to form the feature vectors. The probabilistic NN architecture was applied to classify the facial expressions. They obtained an average recognition rate of 87.8%. Unfortunately, the authors did not specify how to identify this set of feature points. Tang and Huang [22], [23] used similar distance features based on the change of face shape between

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
3

the emotional expressions. Normalized Euclidean distances between the facial feature points were used for emotion classication. An automatic feature selection method was also proposed based on maximizing the average relative entropy of marginalized class-conditional feature distributions. Using a regularized multi-class AdaBoost classication algorithm, they achieved a 95.1% average recognition rate. However the facial feature points were predened on the cropped 3D face mesh model, and were not generated automatically. Such an approach is, therefore, difcult to be used in real world applications. Thus far, few efforts have been reported exploiting 3D facial expression recognition in dynamic or deformable feature analysis. Sun and Yin [24] extracted sophisticated features of geometric labeling and used 2D HMMs to model the spatial and temporal relationships between the features for recognizing expressions from 3D facial model sequences. However this method requires manual detection and annotation of certain facial landmarks. III. METHODOLOGY In this section we present a fully automatic method for emotion recognition that exploits the EBS features between neutral and expressional faces based on a 3D deformable mesh (DM) model. The system developed consists of several steps. The facial region is rst detected automatically in the input frames using the local normalization based method [25]. We then locate 26 ducial points over the facial region using scalespace extrema and scale invariant feature examination. The ducial points are tracked continuously by multiple particle lters throughout the video sequences. EBS is used to extract the deformation features and the D-Isomap algorithm is then applied for the nal decision. A. Preprocessing Automatic face detection is considered to be the rst essential requirement for our emotion recognition system. Since the faces are non-rigid and have a high degree of variability in location, color and pose, several facial features that are uncommon to other pattern detection issues make facial detection more complex. Occlusion and lighting distortions and illumination conditions can also change the overall appearance of a face. We detect facial regions in the input video sequence consisting of feature selection and classication based on a local normalization technique [25]. Compare to Viola and Johns algorithm [26], the proposed method is adaptive to the normalized input image and designed to complete the segmentation in a single iteration. With the local normalization based method, the proposed emotion recognition system can be more robust under different illumination conditions. Fiducial points are a set of facial salient points, usually located on the corners, tips or mid points of facial components. Automatically detecting ducial points can extract the prominent characteristics of facial expressions with the distances between points and the relative sizes of the facial components and form a feature vector. Additionally, choosing the feature points should represent the most important characteristics on

the face and be extracted easily. The Active Appearance Models (AAM) and Active Shape Models (ASM) are two popular feature localization methods with statistical face models to prevent locating inappropriate feature points. The AAM [27], [28] ts a generative model to the region of interest. The best match of the model simultaneously calculates feature point locations. The ASM algorithm learns a statistical model of shape from manually labeled images and the PCA models of patches around individual feature points. The best local match of each feature is found with constraints on the relative conguration of feature points. They are commonly used to track faces in video. In general, the point to point accuracy is around 85% if the bias of the automatic labeling result to the manual labeling result is less than 20% of the true inter-ocular distance [29]. However, it is not sufcient in the case of facial expression analysis. We choose 26 ducial points [30] on the facial region according to the anthropometric measurement with the maximum movement of the facial components during expressions. To follow the subtle changes in the facial feature appearance, we dene a SUCCESS case if the bias of a detected point to the true facial point is less than 10% of inter-ocular distance in the test image. The proposed method constructs a set of ducial point detectors with scale invariant feature. Candidate points are selected over the facial region by the local scale-space extrema detection. The scale invariant feature for each candidate point is extracted to form the feature vectors for the detection. We use multiple Differential Evolution Markov Chain (DEMC) particle lters [31] to track the ducial points depending on the locations of the current appearance of the spatially sampled features. The kernel correlation based on HSV color histograms is used to estimate the observation likelihood and measure the correctness of particles. We dene the observation likelihood of the color measurement distribution using the correlation coefcient. Starting with mode-seeking procedure, the posterior modes are subsequently detected through the kernel correlation analysis. It provides a consistent way to resolve the ambiguities that arise in associating multiple objects with measurements of the similarity criterion between the target points and the candidate points. The proposed method achieves an overall accuracy of 91% for the 26 ducial points [31]. B. 3D EBS Facial Modeling The EBS is an image morphing technique derived from the Navier equation that describes the deformation of homogeneous elastic tissues. It was developed for matching 3D MRIs of the breast used in the evaluation of breast cancer [32], [33]. Davis et al. [32] designed the EBS to matching 3D magnetic resonance images (MRIs) used in the evaluation of breast cancer. The coordinate transformations are evaluated with different types of deformations and different numbers of corresponding coordinate locations. The EBS is based on a mechanical model of an elastic body, which can approximate the properties of body tissues. The spline maps can be expressed as the linear combination of an afne transformation and a Navier interpolation spline. It allows each landmark to be mapped to the corresponding landmark in

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
4

the other image and provides interpolation of this mapping at intermediate locations. Hassanien and Nakajima [34] used Navier EBS to generate warp functions for facial animation with the interpolating scattered data points. Kuo et al. [35] proposed an iterative EBS algorithm to obtain the elastic property of a facial model for facial expression generation. However, the most feature points in these works were manually localized, and only 2D examples were considered for facial image analysis.

Fig. 2. Proposed 3D mesh model with 26 ducial points (black) and 28 characteristic points (red).

The proposed EBS method automatically generate facial expressions using a 3D physically based DM model according to a deformable feature perspective executed with the control points within an acceptable time for emotion recognition. The mesh wireframe generic facial model consists of characteristic feature points and deformable polygons with EBS structure. We can deform the wireframe model to best t a human face with any expressions. The 3D afne transformation realizes the facial expressions by imitating the facial muscular actions. It formulates the deforming rules according to the FACS coding system using the 26 ducial points as control points. Fig. 2 shows the proposed model based on this standardized coding system. In practical applications, not all feature points in the model can be easily detected from the input sequences, so we use 54 characteristic feature points for facial expression parameterization. Characteristic feature points include: a) the 26 control points based on the ducial points, and b) 28 dependent points which are determined by the control points. We also assume that the physical property of the EBS structure is the same within the facial region. The EBS deformation analysis is presented in following section. Merits of this approach are: a) a physically based DM model of the human face with ducial points for driving facial deformation according to muscle movement parameterization. The face can be modeled as an elastic body that is deformed under a tension force eld. Muscles are interpreted as forces deforming the polygonal mesh of the face. The factors affecting the deformation are tension of the muscle, elasticity of the skin and zone of inuence. Higher-level parameterizations are easier to use for emotional expressions and can be dened in terms of low-level parameters. b) We extend a DM facial model by a set of well-designed polygons with EBS structure which can be efciently modied to establish the

facial expression model. A 3D face is decomposed into area or volume elements, each endowed with physical parameters embedded in an EBS model according to the surface curvature. The deformable element relationships are computed by integrating the piecewise components over the entire face. c) The control points are predened by the landmarked ducial points. The number of control points is small and they can be identied robustly and automatically. Once the control points are adjusted, the emotional facial model can be established using the transform function of EBS and extended to obtain expression parameters for nal recognition. Using EBS transforms we can interpolate the positions of characteristic feature points such that the 3D facial model of non-neutral expressive expression can be generated from the input video frame. Based on the arrangement of facial muscle bers, our EBS model calculates elastic characteristics for each emotional face by modeling the facial muscle ber as an elastic body. The afne elastic body coordinate transformation is tted to the displacements of the facial expression with the continuity condition. The spline obtained by this method is mathematically identical to the computed coefcients of the original displacements from the control points directly. Moreover, the resulting spline is added to the initial mesh of the elastic body transformation to give the overall coordinate transformation. Simulation results show that the facial model generated by our method demonstrates good performance under the availability of control point positions. C. EBS parameterizations EBS is applied for generating different facial expressions with a generic facial model from a neutral face. By varying the position of control points, EBS mathematically describes the equilibrium displacement of the facial expressions subjected to muscular forces using a Navier partial differential equation (PDE). The deformable facial model equations can be expressed in 3D vector form with the interpolation spline relating the set of corresponding control points. The PDE of an elastic body is based on notions of stress and strain. When a body is subject to an external force this induces internal forces within the body which cause it to deform. The integral of the surface forces and body forces must be zero [36]. Let x denote a set of feature points in the 3D facial model of neutral face, yi be the corresponding control points with expressions, we then have the Navier equilibrium PDE as: 2 l ( x ) + ( + )[ l ( x )] + f ( x) =0 (1) where l ( x ) is the displacement of all characteristic feature points within the facial model from the original position (neutral face), and are the Lam coefcients which describe the physical properties of the face. is also referred to as the shear modulus. 2 and denote the Laplacian and gradient operation, respectively, and l ( x ) is the divergence of l ( x ). f ( x ) is the muscular force eld being applied on the face. To nd an appropriate physical property for an expressional model, muscular forces are assumed to distribute on the homogeneous isotropic elastic body of the facial model to obtain

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
5

smooth deformation. So a polynomial radially symmetric force is considered that: f ( x)= w d( x) (2) where w = [ w1 w2 w3 ]T is the strength of the force 2 2 1/2 eld, and d( x ) = |x2 . The PDEs solutions of 1 + x2 + x3 | (1) can be computed as: l ( x ) = E ( x ) w (3) and E ( x ) = [d( x )2 I 3 x x T ]d( x) (4)

where = (11 +5 )/( + ) is the Poissons ratio, I is a 3 3 identity matrix, and x x T is an outer product. It is obtained using the Galerkin vector method [36] to transform the three coupled PDEs into three independencies. The solution can be veried by substituting (3) into (1). The EBS displacement LEBS ( x ) is a linear combination of the PDEs solution in (3) that: LEBS ( x)=
N i=1

+ A E ( x yi ) w x +B i

(5)

where A x + B is the afne portion of the EBS, A = [ a1 a2 a3 ]T is a 3 3 matrix. The coefcients of the spline are determined from the control points yi and the displacement of feature points. The spline relaxes to an afne transformation as the distance from the point approaches innity. The summation in (5) can be expressed in the matrix-vector form as: EEBS ( x ) = HLEBS (6) where H is a (3N + 12) (3N + 12) transfer function as described by Kuo [35], and EEBS is a (3N + 12) 1 vector with all the EBS coefcients as: T ... T a T a T a T bT ]T T w w w E =[
EBS 1 2 N 1 2 3

With different we obtained variant f ( x ). By the principle of superposition for an elastic body, the external forces must be minimized according to the roughness measurement constraints [35]. This ensures that the forces are optimally smooth and sufcient to deform the elastic material so that the EBS equals the given displacements at the control point locations. By varying the values of in (4), we can calculate each corresponding muscular force eld respectively. To nd the minimum muscular force eld | f ( x )|min , we obtain the appropriate physical property and the associate EBS coefcients EEBS . We then construct the deformable visual feature v for classication with and EEBS . The algorithm for deformation feature extraction is summarized as follows. 1) Initialize the feature point positions x in the 3D facial model for neutral face according to the detection results from the 26 ducial points. 2) Set for facial region 0.01 . 3) Update the corresponding control point positions yi in the expressional facial model subject to the tracking results . 4) Calculate the displacements of the control point sets l in the facial region. 5) Solve EBS in (6) to obtain the associate spline coefcient EEBS . 6) Compute the position of nonsignicant points in the facial region based on the EBSs solution in the previous step. 7) Calculate the muscular force eld f ( x ) in (2) from the solution of EBS. 8) Sweep from 0.02, 0.03, ... , to 0.5 and repeat steps 5), 6) and 7) to obtain the new muscular force elds. 9) Find the minimum muscular force eld | f ( x )|min , x and the EBS coefcients EEBS . 10) Construct the deformable visual feature v for classication with EEBS and . IV. D-ISOMAP BASED CLASSIFIER Once the deformable facial features have been obtained with the EBS, we use an isomap based method for emotion classication. Isomap was rst proposed by Tenenbaum [37], and is one of the most popular manifold learning techniques for promising nonlinear dimensionality reduction. It attempts to learn complex embedding manifolds using local geometric metrics within a single global coordinate system. The Isomap algorithm uses geodesic distances between points instead of simply taking Euclidean distances, thus encoding the manifold structure of the input space into distances. The geodesic distances are computed by constructing a sparse graph in which each node is connected only to its closest neighbors. The geodesic distance between each pair of nodes is taken to be the length of the shortest path in the graph that connects them. These approximated geodesic distances are then used as inputs to classical multidimensional scaling (MDS). Yang proposed a face recognition method based on Extended Isomap (EI) [38]. In his work, the EI method was utilized by a Fisher Linear Discriminant (FLD) algorithm. The main difference between EI and the original Isomap is that after a geodesic

(7) In our system, the 26 control points and the displacements of the control point sets are obtained from the ducial detection and tracking steps. We solve (6) from the requirements that the spline displacements equal the control point displacements with a constant all over the facial region. The atness constraints which are expressed in terms of second or higher 2 or ) are set to zero enforcing the order (e.g. xi 2 , x xi x j j conservation of linear and angular momenta for an equilibrium solution. These constraints cause the force eld to be balanced so that the EBS facial model is stationary. The value of the spline for the 28 dependent points are computed from (5) with the spline coefcient EEBS , the spline basis function H and the control point locations. The muscular force eld f ( x ), given by (2), can be calculated from the solutions of EBS according to the displacement of control points such that: f ( x)= f1 f2 f3
T N

=
i=1

f ( x yi ) w i

(8)

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
6

distance is obtained, the EI algorithm uses FLD to achieve the low dimensional embedding while the original Isomap algorithm using MDS. X. Geng [39] proposed an improved version of Isomap to guide the procedure of nonlinear dimensionality reduction. The neighborhood graph of the input data is constructed according to a certain kind of dissimilarity between data points, which is specially designed to integrate the class information. The Isomap algorithm generally has three steps: construct a neighborhood graph, compute shortest paths, and construct d-dimensional embedding. Classical MDS is applied to the matrix of graph distances to obtain a low-dimensional embedding of the data. However, since the original prototype Isomap does not discriminate data acquired from different classes, when dealing with multi-class data, several isolated sub-graphs will result in undesirable embedding. On the other hand, the EI [38] used the Euclidean distance to approximate the distance between two nearest points in two classes. When the number of classes becomes large, the classes may construct their own spatially intrinsic structure. Then the EI and improved version cant recover the classes intrinsic structures of the high-dimensional data. In order to cope with such problems, in this paper, we propose a D-Isomap based method for emotion classication. The discriminative information of facial features [40] are considered so that they can properly represent the discriminative structures of the emotional space on the manifold. The proposed D-Isomap provides a simple way to obtain the low dimensional embedding and discovers the discriminative structure on the manifold. It has the capability of discovering nonlinear degrees of freedom and nding the globally optimal solutions guaranteed to converge for each manifold [41]. There are two general approaches to build the nal classier using dynamic information from video sequences. One is to determine the dependencies based on the joint probability distribution among the score level decisions. The other is based on the distribution of dynamic features, in which case the features can be discrete or continuous. Le et al. [42] proposed a 3D dynamic expression recognition method using spatiotemporal shape features. The HMMs algorithm was adopted for the nal classication. Sandbach et al. [43] also proposed to recognize 3D facial expression using HMM dependent on the motion-based features. In this work, the nal classier is constructed based on the dynamic feature level fusion. We change the facial expression model following the trajectory of the 54 characteristic feature points frame by frame. It explicitly describes the relationship between the motions of the facial feature points and the expression changes. The EBS model sequence v (t) is effectively represented by a sequence of observation from the input video, where t is the time variable. Before the raw data samples in the datasets could be used for training/testing of classication, it is necessary to normalize the sequences such that they were in the format required by the system. The frame rate is reduced to 10 fps and the sequences last 3 seconds in total from a neutral face to the apex of one expression. Since the original displacement of v (t) in each frame depends on each individual, we use the length (distance between the Top point of the head

and the Tip point of the chin) and width (distance between the Left point of the head and the Right point of the head) of the neutral face for scale normalization. We then normalize the feature matrix to regulate the variances from the EBS coefcients and constant lambda using L2 method. The EBS model sequence takes into account the temporal dynamics of the feature vectors, and the labeled graph matching is then applied to determine the category of the sample video. The EBS feature v for each emotional facial model can be seen as one point in a high dimensional space. As we have 54 characteristic feature points in the 3D facial model, each EBS feature, v , has 175 dimensions. Given the variations of facial congurations during emotional expressions, these points can be embedded into a lower dimensional space. We dene the facial EBS feature set V as the input data: V = {vt } RT M (9)

where t = 1, ..., T is the input sample number, M = 175 is the dimensionality of the original data. Let U denote the embedding space of V into a low dimensional manifold with m dimensions such that: U = { u t } R T m (10)

which preserves the manifolds estimated intrinsic geometry. The D-Isomap provides a simple way to obtain the low dimensional embedding and discovers the discriminative structure on the manifold as well [40], [41]. We compute Euclidean distance D between any pairwise points (vt , vt ) from the input space V for the training data with a discriminative weight factor such that: Z (vt ) = Z (vt ) Z (vt ) = Z (vt ) (11) where Z (vt ) denotes the class label which the input data vt belongs to. For pairwise points with the same class labels, the Euclidean distance is shortened by weight factor . The compacting and expanding parameters are empirically calculated for the discriminative matrix. It can solve the impeding problems in [44] when the dimensions of scatter become very high in the real data sets. A neighborhood graph G is constructed according to the discriminative matrix. If one point is one of the closest points or lies within a xed radius of any other point, it is dened as a neighbor of that point. The pairs are connected with paths between points, which are acquired by adding up a sequence of edges equal to the distance between neighbor points . The distances between all point pairs are computed based on a chosen distance metric. We then calculate a distance matrix between all pairwise points by computing the shortest paths in the neighborhood graph. The geodesic distance matrix between all points is set to be: D(vt , vt ) =
2

vt vt vt vt

if if

DG = min (DG , DT G)

(12)

The embedding matrix, Dm , in low dimensional space can be calculated by converting the distance matrix to inner products with a translation mapping [45]. Compute the largest eigenvalue and the top m eigenvectors of DG , we obtain the

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
7

eigenvector matrix E Rnm and the eigenvalue matrix M Rmm . The embedding matrix in low dimensional space can be calculated that: Dm = M1/2 ET (13)

calculated from the training data at this level. The centers of these clusters provide essential information for discriminative subspace, since these clusters are formed according to class labels of emotions. We can simply enforce the mapping to be orthogonal, i.e., we can impose the condition UUT = I (14)

We then use a Nearest Class Center (NCC) algorithm [46] to determine the emotion classes. The NCC algorithm is a centre based method for data classication in a high dimensional space. Many classication methods may be concerned for the nal decision such as nearest neighbors, k-mean or EM algorithms. In the nearest neighbors based classication, the representational capacity and the error rate depends on how the dataset to be chosen to account for possible variations and how many are available. The k-mean method adjusts the center of a class based on the distance of its neighboring patterns. The EM algorithm is a special kind of quasi-Newton algorithm with a searching direction having a positive projection on the gradient of the log likelihood. In each EM iteration, the estimation maximizes a likelihood function which is further rened in each iteration by a maximization step. When the EM iteration converges, it should ideally obtain maximum likelihood estimation of the data distribution A commonality among these methods is that they dene a distance between the dataset and an individual class, then the classication is determined by consisting of isolated points in the feature space. However, since the emotional features in our work are complex and not interpretable, a formal centre for each emotion class may be difcult to determine or misplaced. In many cases, multiple clusters are available within one video sequence. Such property can be utilized to improve the nal decision but has been ignored by other methods. For this reason, we need to nd a more efcient way to generalize the representational capacity with sufcient large number of feature points stored to account for as many variations as possible. Unlike other alternations, NCC considers the centers for the clusters k with known label from the training data and generalizes the class center for each emotion group. The derived cluster centers have more variations than the original input features and thus expands the capacity of the available data sets. The classication for the test data is based on the nearest distance to each class center. The NCC algorithm is applied for the classication of the input video based on the number of clusters k and the embedding matrix Dm . We assume that the clusters can be classied in classes a priori through any viable means and are available within each video sequence. So the distance matrix makes use of such information about classes contained in the clusters of each class. A subspace is constructed out from the entire feature space based on the prior knowledge and the within-class clusters are generalized to represent the variants of that emotion class. Thus the generalization ability of the classier is increased. Let ck be a set of k cluster centers for the feature points belong to a class. The k clusters determine the output class label of the input data. Each cluster approximates the sample population in the feature space for the samples that belong to it. The statistics of every cluster are used to estimate the probability for the dataset. The probability distribution can be

for the feature points on the projected set. In our case, a total of k centers of the clusters give (k 1) discriminative features which span (k 1) dimensional discriminative space. The cluster centers for a test data can be calculated using the objective function:
E (c k ) = ck ck T T

(15)

A dense matrix h = ee that e = [1, ..., 1] is imposed to the distance matrix DG to calculate cluster centers from the training data. Since DG is symmetric, we put the uniform weight 1/N to every single pair for the full graph. Let p denote the sample number in one cluster, l = 7 the emotional space for labeling, and Ut represents the tth-element of the embedded manifold matrix for a test data from (10), the objective function becomes: {Ck }l = 1 p 1 (Dm Ut H DG H ) 2 t=1
p

(16)

1 eeT . The where H is the centering matrix that H = I N labeled class center {Ck }l for the emotional space of a test video can be calculated from (16). Each data sample along with its k cluster lies on a locally manifold. Since D-Isomap seeks to preserve the intrinsic geometric properties of the local neighborhoods, the input data is reconstructed by a linear combination of its nearest centers with the labeled graph matching. For each category of facial expression, we calculate an average class center coordinates Cl from the training samples. Compute the class centers cl for the test data using (16), we can obtain its class label C using the Euclidean distance to the nearest class center coordinates Cl .

C = arg min(cl , Cl , Dm , Ut , )
cl

(17)

V. EXPERIMENT AND RESULTS To evaluate the performance of our proposed method, two facial expression video datasets are used for the experiment: RML Emotion database and Mind Reading DVD database. RML Emotion database [15] was originally recorded for language and context independent emotional recognition with the six fundamental emotional states: happiness, sadness, anger, disgust, fear and surprise. It includes eight subjects in a nearly frontal view (2 Italian, 2 Chinese, 2 Pakistani, 1 Persian, and 1 Canadian) and 520 video sequences in total. The RML Emotion database was originally developed for language and context independent emotional recognition. Each video pictures a single emotional expression and ends at the apex of that expression while the rst frame of every video sequence shows a neutral face. Video sequences from neutral to target display are digitized into 320 340 pixel

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
8

arrays with 24-bit color values. The Mind Reading DVD [47] is an interactive computer-based resource for face emotional expressions, developed by Cohen and his psychologist team. It consists of 2472 faces, 2472 voices and 2472 stories. Each video pictures the frontal face with a single facial expression of one actor (30 actors in total) of varying age ranges and ethnic origins. All the videos are recorded at 30 frames per second, last between 5 to 8 seconds, and are as a resolution of 320 240. A. Facial region detection The Facial region is detected in the input video sequence using the face detection method with local normalization [25]. The normalized results of the original sequences show that the histograms of all input images are widely spread to cover the entire gray scale by local normalization; and the distribution of pixels is not too far from uniform. As a result, dark images, bright images, and low contrast images are much enhanced to have an appearance of high contrast. The overall performance of the system is considerably improved by incorporating local normalization. B. Fiducial points detection and tracking The ducial points are then detected [30] and tracked [31] automatically in the facial region. As the location of each ducial point is at the center of a 16 16 pixel neighborhood window, and the feature vector for point detectors are extracted from this region, we consider detected points displaced within ve pixels from the corresponding ground truth facial points as successful detections. 180 videos of 6 subjects from RML Emotion database and 240 videos of 20 subjects from Mind Reading DVD database are selected for experiment, which constitute a total of 420 sequences of 26 subjects. We randomly divide all the 420 sequences into training and testing subsets containing 210 sequences each. The overall system performance of recall 92.45% and precision 90.93% are achieved simultaneously in terms of false alarm rates. We also implement the AAM method mentioned in [27] for the 26 ducial point detection and tracking, as shown in Fig. 3. The proposed method has a better performance on both efciency and accuracy. C. EBS based emotional facial modeling In this section, we verify the performance of the EBS based method for emotional facial modeling on the aforementioned databases. The positions of 26 ducial points are obtained from the detecting and tracking step and then used for calculating the positions of the 28 dependent points. These positions are 2D data in the video sequences and cant be applied with the 3D EBS analysis directly. All the ducial points need to be aligned to our 3D model rst. We use a exible generic facial modeling (FGFM) algorithm [48] for tting each face image to the 3D mesh model. The geometric values used in FGPFM are obtained from the BU-3DFE database [49]. There are 2500 3D facial expression models for 100 subjects in this database. We use the 3D facial expression model with the associated
Fig. 4. (a) anger faces.

Fig. 3.

Detection and tracking Result

(b) disgust faces.

(c) fear faces. Emotional EBS model construction.

frontal-view texture image as ground truth data to train the 3D model. Initially, we dene a face-centered coordinate system used for FGPFM. All the 3-D coordinates, curvature parameters for every vertex generation function, the weights in the interrelationship functions and the statistical model ratios are recorded in an FGPFM. The clustering process is used to construct the accurate generic facial models from the training 3D data. All the selected typical training examples are used to acquire the geometric values for each FGPFM. The optimal geometric values of FGPFM result in full coincidence between

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
9

superimpositions of the transformed FGPFM and those facial contours of training images. Geometric values of FGPFM are established using the prole matching technique for silhouettes of the training images and the FGPFM with the known view directions. The reconstruction procedure can be regarded as a block function of FGPFM, and the input parameters are 3-D face-centered coordinates of control points. When the control points are accurately modied, the desired 3-D facial model is determined based on the topological and geometric descriptions of FGPFM. To remove the individual differences in the facial expressions, each face shape from the video sequences is normalized to the same scale. The 26 control points on the 3D facial model are initially estimated by the ducial points using the back projection technique with the set of predened unied depth values. The original dependent points are also predened in the model. Classied FGFM ratio features are selected with a minimal Euclidean distance between the estimated and the codebook-like ratios database. The depth values of control points and curvature parameters are obtained for reconstructing the EBS facial model from the selected ratio features classier. Fig. 4 shows some representative sample results for emotional model construction. Our objective here is to nd the positions of dependent points after emotional facial deformation under the availability of the ducial point position. The basic six emotions are analyzed in this experiment. The best-t mesh model of a given face is estimated from the rst input frame with neutral emotion. Based on the known tracking information, the positions of all characteristic feature points are calculated and the EBS model is reconstructed for any particular expression. From experimental results we can see that our method provides good construction following the variations of the control points.

is obtained experimentally. Subjectively, the proposed method provides a good facial model under different people and expressions. D. D-Isomap for nal decision In this section, 280 video sequences of eight subjects from the RML Emotion database and 420 video sequences of 12 subjects from the Mind Reading DVD database are selected for D-Isomap based classier evaluation, which constitute a total of 700 sequences of 20 subjects with six emotions and neutral faces. The facial EBS features are extracted to construct a 175 dimensional vector sequence that is too large to manipulate directly. We use D-Isomap algorithm for dimensionality reduction, as discussed in section 4. Since each feature vector can be seen as one point in a 175 dimensional space, the D-Isomap is utilized to nd the embedding manifold in a low-dimensional space to represent the original data. These representations should cover most of the variances of the observation based on the continuous variations of facial congurations. The low-dimensional space structures are extracted to facilitate the manifolds estimated intrinsic geometry by the DIsomaps capability of nonlinear analysis and the convergence of globally optimal solutions. The geodesic distance graph from (12) is used for D-Isomap based embedding. Fig. 6 shows examples of distance matrices with discriminative weight factors for seven emotional expressions of randomly selected subjects. The distance graph reects the intrinsic similarity of the original expression data and consequently is considered for determining true embedding. From Fig. 6 we can see, by applying the weight factor, the points from the same cluster can be projected closer in low dimensional space, thus the distance is compacted. On the other hand, the distance between different clusters could be expanded by increasing .

(a)

(b)

(a) = 0.1.

(b) = 0.25.

(c)

(d)

Fig. 5. EBS facial model constructions with different Poissons ratio (a) a male anger face (b) a female sadness face (c) a female anger face (d) a male happiness face.

We provide more experimental results in Fig. 5 to verify the consistency of the proposed method. Fig. 5 presents the results of the emotional facial model for different people. The is assumed to be constant for the whole facial region and determined under the condition of minimum muscular force eld generation. Fig. 5(a-d) shows the results when

(c) = 0.5.

(d) = 0.75.

Fig. 6. Distance matrix graph with different weight factor, higher values are shown in red, lower values in blue.

Increasing the dimension of the embedding space, we can calculate the residual variance for the original data. The true

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
10

dimension of data can be found by considering the decreasing trend in the residual value. The embedding results using Isomap and proposed D-Isomap with different k are presented in Fig. 7, which shows the results when k is set to be 7, 12, and 20, respectively. From the results we see that our proposed method achieves an average of 10% improvement when compared with the original Isomap. The best performance can be obtained when k is 12 and the dimension of embedded space is reduced to 20, which covers more than 95 % variances of the observation from the input data. Therefore, these 20 dimensional components are used here to represent facial expressions in the input videos.

on the low dimensional feature space. By applying the NCC algorithm to the embedding results from the D-Isomap using (17), we can determine the emotion class for a test video. We label the emotion class centers on the embedded feature space, shown in Fig. 8.

(a) (a) (b)

(c)

(d)

(b) Fig. 8. Labeled class centers in a 2D space based on the embedding results (a) shows the results using 700 samples (b) shows the results using 7000 samples. (e) (f)

Fig. 7. Dimensionality reduction using Isomap and D-Isomap, (a-e) show the results using Isomap with cluster k are 7, 12, and 20 respectively, (b-f) show the results using D-Isomap.

We also provide expressional congurations to show apparent emotional variation in Fig. 8. For each video sequence from the database, we constructed 10 sub-clips of the samples with different frames from neutral to the apex, which can improve the representational capacity with sufcient large number of feature points stored to account for as many variations in the original data as possible. To show apparent emotional variations, we provided the expressional congurations based on different numbers of samples. In Fig. 8, (a) shows the result using 700 samples with one sample for each video, (b) using 10 samples for each video and 7000 samples in total. From the results we can see that the EBS model sequences are embedded to a discriminative structure

To evaluate the performance of our proposed method, we divide these 700 sequences into ve subsets and 140 sequences for each. Every time, one of the ve subsets is used as a testing set, the other four subsets are used as a training set. The evaluation procedure is repeated until all the subsets have been used as a testing set. A test video sequence is treated as a unit and labeled with a single expression category. The recognition accuracy is calculated as the ratio of the number of correctly classied videos to the total number of videos in the data set. By using the proposed classier, we achieve an overall accuracy of 88.2%. We list the confusion matrix for emotion recognition with numbers representing percentage correct in Table I. From the results we can see that features representing different expressions exhibit great diversity since the distances between different emotions are relatively high. On the other hand, the same expressions collected from different subjects are very similar due to the short distances within the same class.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
11

TABLE I CONFUSION MATRIX OF EMOTION RECOGNITION USING THE PROPOSED METHOD.

Desired Neu. Ang. Dis. Neutral 89 0 2 Angry 0 92 2 Disgust 3 2 83 Fear 2 3 3 Happiness 0 0 2 Sadness 2 1 2 Surprise 3 3 4

Detected Fear Hap. Sad. Sur. 1 1 4 3 3 0 1 2 5 2 2 3 87 0 3 2 1 94 2 1 3 1 89 2 1 2 3 84

Fig. 9.

Recognition results of different classiers.

TABLE II RECOGNITION RESULTS FOR VIDEOES WITH COMPLEX CONDITIONS.

Ang. No. of Videos 13 Proposed 12 [13] 9 [15] 10

Dis. Fear Hap. 11 7 14 8 5 12 6 3 8 5 5 9

Sad. 10 8 6 7

Sur. 6 4 3 3

Overall 61 80.3% 57.3% 63.9%

In order to investigate the impact of the tracking accuracy for the nal classication, we evaluated the system performance using a set of ground truth data from our previous work [31]. The ducial points are located manually in the facial region from the input video sequences, and the results are utilized to generate the 3D EBS features for the nal decision. Using the proposed D-Isomap classication algorithm, we achieve a 94.7% average recognition rate for the seven emotional facial expressions with the manually selected ducial points. Compared with the manual selection, the automatic method results lower recognition accuracy but much less computing cost. The accuracy decrease may due to the variation of SUCCESS case in the detection and tracking step, since we use the bias of a detected point to the true point within a 10% inter-ocular distance. To show the effectiveness of the proposed method over 2D-based approaches, we select 61 video sequences with pose variations and complex illumination conditions from commercial lms and videos available on the Web for the experiment. We labeled the selected clips to create the ground truth data. Table II shows a summary of our results along with that from 2D-based approaches on the same test sets. The results from other methods are taken from Anderson [13] and Wang [15]. As is evident from these results, our method achieves the best overall performance of 80.3% recognition rate, in contrast with the 57.5% and 63.9% overall recognition rate by the other 2D-based approaches. In a natural situation, the recognition rate decrease can be attributed to the inherent difculties of out-of-plan head motion, widely varied lighting conditions, and change of scale. However, unlike 2D-based approaches, the proposed 3D-based method provides a more effective way against the above-mentioned variabilities with less performance decrease. We conduct extensive experiments using different classi-

cation schemes, i.e., PCA, Gaussian Mixture Model (GMM), NN and Fishers Linear Discriminant Analysis (FLDA). In PCA, the dimensionality is reduced by eliminating less signicant features with smaller eigenvalues in the transformed domain. K-nearest neighbors is then used for the classication. Compared with PCA, D-Isomap can generate a smaller number of resulting clusters in the embedded space. So the intrinsic structures of the original data are recovered with lower computing cost. The GMM classier is implemented in a modular architecture. A separate GMM is trained for each individual class. The parameters including the weights, mean and standard deviation of each component are estimated by the Expectation Maximization (EM) algorithm. In our experiments, we try a range of k value, so that the distribution of the data can be modeled as the sum of k Gaussian functions. In NN classication, a three-layer feed-forward neural network is investigated. The number of input layer neurons is equal to the dimension of the input feature set, while the output neurons correspond to the six emotion classes. Back-propagation algorithm is used to train the network. A new input is labeled with the class that produces maximum output value. The applied FLDA classier has six outputs corresponding to the six emotions. An input signal is labeled with the class that gives the maximum output value. The experimental results for the performance comparison with the same data set are drawn in Fig. 9. From the gure we can see our proposed method achieves the best results for the nal emotion recognition.
TABLE III COMPARISON BETWEEN DIFFERENT ISOMAP METHODS.

Par. Dimensions S. D. R. R. OI k = 12 40 16.5% 67.2% EI kw = 5 35 10.2% 78.4% SI k = 10 25 8.3% 85.1% CKNNRCA k = 20, c = 3 30 9.2% 81.6% D-Isomap k = 12 20 7.6% 88.2% We also compare our proposed D-Isomap classier with other Isomap classiers, i.e. Original Isomap (OI) [44], EI [38], supervised Isomap (SI) [39] and CKNNRCA [40]. The parameters are empirically determined to achieve the lowest error rate by each method. In OI, the value of k is 12. In EI, the value of nearest neighbor kw used in a within-class matrix is 5. The test results are summarized in Table III. We compute the reduced dimension of the embedding space,

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
12

standard deviation and average rate of emotion recognition between three Isomap methods. Table III indicates that the proposed algorithm achieves better performance than OI and EI. D-Isomap outperforms the other methods because it can compact the data points from the same cluster on a highdimension manifold to make them closer in the low-dimension space, and make the data points from the different clusters farther as well. This ability could be benecial in preserving the homogeneous characteristics for emotion classication. To demonstrate the discriminative embedding performance of the proposed D-Isomap, we conducted some experiments with state-of-art manifold learning methods i.e. localized LDA (LLDA), the discriminative version of LLE (DLLE) and Laplacian Eigenmap (LE). LLDA [50] is based on the local estimates of the model parameters that approximate the non-linearity of the manifold structure with piece-wise linear discriminating subspaces. The local neighborhood size k = 30 and dimensionality of subspace d = 32 are selected to compute local metrics. DLLE [51] preserves the local geometric properties within each class according to LLE criterion, and the separability between different classes is enforced by maximizing margins between point pairs on different classes. The balance term h = 1, nearest Neighbors k1 = 1 and smallest distances k2 = 100 are used for classication with the closest centroit. LE [52] makes use of incremental Laplacian Eigenmap reducing the dimension and extracting the feature to data points Drawing on the correspondence between the graph Laplacian, the Laplacian Beltrami operator on the manifold, and the connections to the heat equation, a geometrically motivated algorithm is utilized for representing the high dimensional data that has locality preserving properties and a natural connection to clustering. The experiments are conducted with the compression dimensions of 50.
TABLE IV RECOGNITION RATE OF DIFFERENT MANIFOLD LEARNING METHODS.

Method LLDA DLLE LE D-Isomap

Dimensions 32 40 25 20

Recognition Rate 80.5% 85.3% 84.7% 88.2%

In all the experiments, the nal classication after dimensionality reduction is determined by the nearest neighbor criterion. Table IV shows the experimental results of different algorithms. The results demonstrate the greater effectiveness of D-Isomap for both feature reduction and nal recognition. It considers the label information and local manifold structure. When dealing with the multiple classes and the data set distribution is complex, the proposed D-Isomap takes the advantage of weight factor to separate data with different labels farther and cluster data with the same label closer. Thus the proposed algorithm can gain better recognition rate. VI. CONCLUSIONS In this paper we present an automatic emotion recognition method from video sequences using the 3D active deformable

information. From experimental results we nd that the significant features to distinguish one individual emotion from the other emotions are different. Some of the features selected in a global scenario are redundant, while some of the other features might contribute to the classication of specic emotion. Another observation is that there is not single feature which is signicant for all the classes. This actually reveals the nature of human emotion; there are no sharp boundaries between the emotional states. Any single emotion may share similar patterns to other emotions, but not all. The human perception of emotion is based on the integration of different patterns. In the emotion recognition eld, current techniques for the detection and tracking of facial expressions are sensitive to head pose, clutter, and variations in lighting conditions. Few approaches to automatic facial expression analysis are based on deformable or dynamic 3D facial models. The proposed system attempts to solve such problems by using a generic 3D mesh model with D-Isomap classication. The facial region and ducial points are detected and tracked in the input video frames automatically. The generic facial mesh model is then used for EBS feature extraction. D-Isomap based classication is applied for nal decision. The merits of this work are summarized as follows. Facial expressions are detected and tracked automatically in the video sequences, which can alleviate a common problem in conventional detection and tracking methods. Namely, inconsistent performance due to sensitivity to variation in illuminations such as local shadowing, noise and occlusion. We model the face as an elastic body and exhibit different elastic characteristics dependent on different facial expressions. Based on the continuity condition, the elastic property of each facial expression is found, and a complete wireframe facial model can be generated under the availability of some limited feature point positions. An adaptive partition of polygons is embedded in EBS according to the surface curvature through the characteristic feature points. The subtle structural information can be expressed without giving complicated facial features. The generic 3D facial model is established so that the good parameters of EBS can be used for emotion recognition, e.g. the appropriate physical characteristics for face deformations, control points, etc. We propose the use of D-Isomap for emotion recognition. It can compact the data points from the same emotion class on a high-dimension manifold to make them closer in the low-dimension space, and makes the data points from the different clusters farther as well. It results in a high recognition rate when compared with other Isomap methods Experimental results and comparison with several other algorithms demonstrated the effectiveness of the proposed method. R EFERENCES
[1] A.C. Rafael and D. Sidney, Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications, IEEE Transactions on Affective Computing, Vol.1 (1), pp. 18-34, June 2010.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
13

[2] N. Sebe, H. Aghajan, T. Huang, N.M. Thalmann, C. Shan, Special Issue on Multimodal Affective Interaction, IEEE Transactions on Multimedia, Vol.12 (6), pp. 477-480, 2010. [3] P. Ekman, T. Dalgleish, and M.E. Power, Basic emotions, Handbook of Cognition and Emotion, Wiley, Chichester, U.K., 1999. [4] C. Darwin, The Expression of Emotions in Man and Animals, John Murray, 1872, reprinted by University of Chicago Press, 1965. [5] J.F. Cohn, Advances in Behavioral Science Using Automated Facial Image Analysis and Synthesis, Signal Processing Magazine, IEEE, Vol.27 (6), pp. 128-133, 2010. [6] K.I. Chang, K.W. Bowyer, and P.J. Flynn, Multiple Nose Region Matching for 3D Face Recognition under Varying Facial Expression, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.28 (10), pp. 1695-1700, October 2006. [7] P. Ekman, W.V. Friesen, and J.C. Hager, The Facial Action Coding System: A Technique for the Measurement of Facial Movement, San Francisco, Consulting Psychologist, 2002. [8] M.J. Lyons, J. Budynek, A. Plante, and S. Akamatsu, Classifying facial attributes using a 2-D Gabor wavelet representation and discriminant analysis, Proceedings of the 4th International Conference on Automatic Face and Gesture Recognition, pp. 202-207, March 2000. [9] L.D. Silva and S.C. Hui, Real-time facial feature extraction and emotion recognition, Proceedings of 4th International Conference on Information, Communications and Signal Processing, Vol.3, pp. 1310-1314, Singapore, December 2003. [10] I. Cohen, N. Sebe, Y. Sun, M. S. Lew, and T.S. Huang, Evaluation of expression recognition techniques, Proceedings of International Conference on Image and Video Retrieval, pp. 184-195, IL, USA July 2003. [11] G. Guo and C.R. Dyer, Learning from examples in the small sample case: face expression recognition, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol.35 (3), pp. 477-488, June 2005. [12] M. Pantic and I. Patras, Dynamics of facial expression: recognition of facial actions and their temporal segments from face prole image sequences, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol.36 (2), pp. 433-449, April 2006. [13] K. Anderson and P.W. McOwan, A real-time automated system for the recognition of human facial expressions, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol.36 (1), pp. 96-105, February 2006. [14] H. Gunes and M. Piccardi, Automatic Temporal Segment Detection and Affect Recognition from Face and Body Display, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol.39 (1), pp. 64-84, February 2009. [15] Y. Wang and L. Guan, Recognizing Human Emotional State from Audiovisual Signals, IEEE Transactions on Multimedia, Vol.10 (5), pp. 659-668, August 2008. [16] Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions, IEEE Transactions on Pattern Analysis and Machine Intelligent, Vol.31 (1), pp. 39-58, January 2009. [17] M. Song, Z. Dong, C. Theobalt, H.Q. Wang, Z.C. Liu, and H.P. Seidel, A General Framework for Efcient 2D and 3D Facial Expression Analogy, IEEE Transactions on Multimedia, Vol.9 (7), pp. 1384-1395, November 2007. [18] T. Yun and L. Guan, Human Emotion Recognition Using Real 3D Visual Features from Gabor Library, IEEE International Workshop on Multimedia Signal Processing, pp. 481-486, Saint Malo, October 2010. [19] J. Wang, L. Yin, X. Wei, and Y. Sun, 3D facial expression recognition based on primitive surface feature distribution, IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1399-1406, New York, June 2006. [20] H. Soyel and H. Demirel, Facial expression recognition using 3D facial feature distances, International Conference on Image Analysis and Recognition, Vol.4633, pp. 831-838, Montreal, August 2007. [21] H. Soyel and H. Demirel, Optimal feature selection for 3D facial expression recognition using coarse-to-ne classication, Turkish Journal of Electrical Engineering and Computer Sciences, Vol.18 (6), pp. 10311040, 2010. [22] H. Tang and T. S. Huang, 3D facial expression recognition based on automatically selected features, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1-8, Anchorage, June 2008. [23] H. Tang and T. Huang, 3D facial expression recognition based on properties of line segments connecting facial feature points, IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1-6, Amsterdam, The Netherlands, 2008. [24] Y. Sun and L. Yin, Facial expression recognition based on 3D dynamic range model sequences, Computer Vision - ECCV 08, pp. 58-71, 2008.

[25] T. Yun and L. Guan, Automatic face detection in video sequences using local normalization and optimal adaptive correlation techniques, Pattern Recognition, Vol.42 (9), pp. 1859-1868, September 2009. [26] P. Viola and M. Jones, Robust Real Time Object Detection, Proceedings 2nd International Workshop on Statistical and Computational Theories of Vision, Vancouver, Canada, July 2001. [27] J. Xiao, S. Baker, I. Matthews, T. Kanade,Real-time combined 2d+3d active appearance models, Computer Vision and Pattern Recognition Conference, Vol.2, pp. 535-542, July 2004. [28] R.Gross, I.Matthews, S.Baker, Constructing and Fitting Active Appearance Models With Occlusion, IEEE Workshop on Face Processing in Video, pp. 72, 2004. [29] D. Vukadinovic, M. Pantic, Fully Automatic Facial Feature Point Detection Using Gabor Feature Based Boosted Classiers, IEEE International Conference on Systems, Man and Cybernetics Waikoloa, Vol. 2, pp.1692 - 1698, October 2005. [30] T. Yun and L. Guan, Automatic ducial points detection for facial expressions using scale invariant feature, IEEE International Workshop on Multimedia Signal Processing, pp. 1-6, Rio de Janero, Brazil, October 2009. [31] T. Yun and L. Guan, Fiducial Point Tracking for Facial Expression Using Multiple Particle Filters with Kernel Correlation Analysis, IEEE International Conference on Image Processing, pp. 373-376, Hongkong, September 2010. [32] M.H. Davis, A. Khotanzad, D.P Flamig, and S.E Harms, A physicsbased coordinate transformation for 3-D image matching, IEEE Transactions on Medical Imaging, Vol.16 (3), pp. 317 -328, June 1997. [33] M.H. Davis, A. Khotanzad, D.P. Flamig, S.E. Harms, Elastic body splines: a physics based approach to coordinate transformation in medical image matching, IEEE Symposium on Computer-Based Medical Systems, pp. 81 - 88, 1995. [34] A.Hassanien, M.Nakajima, Image morphing of facial images transformation based on Navier elastic body splines, Computer Animation, pp. 119 - 125, 1998. [35] C.J. Kuo, J. Hung, M. Tsai, and P. Shih, Elastic Body Spline Technique for Feature Point Generation and Face Modeling, IEEE Transactions on Image Processing, Vol.14 (12), pp. 2159-2166, December 2005. [36] P. C. Chou and N. J. Pagano,Elasticity: Tensor, Dyadic, and Engineering Approaches, Dover, New York, 1992. [37] J.B. Tenenbaum, V. Silva, and J.C. Langford, A global geometric framework for nonlinear dimensional reduction, Science, Vol.290, pp. 2319-2323, December 2000. [38] M.H. Yang, Face recognition using extended isomap, International Conference on Image Processing, Vol.2, pp.117-120, New York, September 2002. [39] X. Geng, D.C. Zhan, and Z.H. Zhou, Supervised Nonlinear Dimensionality Reduction for Visualization and Classication, IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol.35 (6), pp. 10981107, December 2005. [40] Y. Wu, K.L. Chan, and L. Wang, Face recognition based on discriminative manifold learning, International Conference on Pattern Recognition, Vol.4 pp. 171-174, Cambridge, UK, September 2004. [41] D. Zhao and L. Yang, Incremental Isometric Embedding of HighDimensional Data Using Connected Neighborhood Graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.31, pp. 86-98, January 2009. [42] V. Le, H. Tang, and T.S. Huang, Expression recognition from 3D dynamic faces using robust spatio-temporal shape features, Automatic Face Gesture Recognition and Workshops, pp. 414-421, 2011. [43] G. Sandbach, S. Zafeiriou, M. Pantic, and D.Rueckert, A dynamic approach to the recognition of 3D facial expressions and their temporal models, Automatic Face Gesture Recognition and Workshops, pp. 406413, 2011. [44] T. Friedrich, Nonlinear Dimensionality Reduction with Locally Linear Embedding and Isomap, University of Shefeld, 2002. [45] E. Kokiopoulou and Y. Saad, Orthogonal Neighborhood Preserving Projections: A Projection- Based Dimensionality Reduction Technique, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.29, pp. 2143-2156, December 2007. [46] J. Handl and J. Knowles, An Evolutionary Approach to Multiobjective Clustering, IEEE Transactions on Evolutionary Computation, Vol.11 (1), pp. 56-76, February 2007. [47] S.B. Cohen, Mind Reading: The Interactive Guide to Emotions, London, Jessica Kingsley, 2004. [48] S.Y. Ho and H.L. Huang, Facial Modeling from an Uncalibrated Face Image Using Flexible Generic Parameterized Facial Models, IEEE

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
14

Transactions on Systems, Man and Cybernetics, Part B, Vol.31 (8), October 2005. [49] L. Yin, X. Wei, Y. Sun, J. Wang, M.J. Rosato,3D Facial Expression Database for Facial Behavior Research, Automatic Face and Gesture Recognition, pp. 211 - 216, 2006. [50] L. Zhu, F. Yun, Y. Junsong, T.S. Huang, and W. Ying, Query Driven Localized Linear Discriminant Models for Head Pose Estimation, IEEE International Conference on Multimedia and Expo,pp. 1810 - 1813, July 2007. [51] X. Li, S. Lin, S. Yan, and D. Xu, Discriminant Locally Linear Embedding With High-Order Tensor Data, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,Vol. 38 (2), pp. 342 - 352, April 2008. [52] W. Luo, Face recognition based on Laplacian Eigenmaps, International Conference on Computer Science and Service System, pp 416 419, June 2011.

Yun Tie (S07) received his B.Sc. Degree from Nanjing University of Science and Technology, China, M.A.Sc Degree of Computer Science from Kwangju Institute of Science and Technology (KJIST), Korea, and Ph.D Degree from Ryerson University, Canada. He is currently a Post Doctoral Fellow in Ryerson Multimedia Lab at Ryerson University, Toronto, Canada. His research interests include image/video processing, pattern recognition, 3D data modeling and intelligent classication and their applications.

Ling Guan (S88-M90-SM96-F08) received his B.Sc. degree in Electronic Engineering from Tianjin University, China, M.A.Sc degree in Systems Design Engineering from University of Waterloo, Canada and Ph.D. degree in Electrical Engineering from the University of British Columbia, Canada. He is currently a professor and a Tier I Canada Research Chair in the Department of Electrical and Computer Engineering at Ryerson University, Toronto, Canada. He also held visiting positions at British Telecom (1994), Tokyo Institute of Technology (1999), Princeton University (2000), Hong Kong Polytechnic University (2008) and Microsoft Research Asia (2002, 2009). He has published extensively in multimedia processing and communications, human-centered computing, pattern analysis and machine intelligence, and adaptive image and signal processing. He is a recipient of the 2005 IEEE Trans. on Circuits and Systems for Video Technology Best Paper Award and an IEEE Circuits and Systems Society Distinguished Lecturer.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

Potrebbero piacerti anche