Sei sulla pagina 1di 5

Human Ear Surface Reconstruction Through Morphable Model

Deformation
Salah Eddine KABBOUR1 Pierre-Yves RICHARD2

Abstract— In this paper, a novel fully automated method is In order to solve the problem of human ear low texture
developed to acquire an accurate surface 3D reconstruction Liu et al. [6] use an ear sampling device which consists
of the human ear by using multi-view stereo vision and of a mini-room in the shape of half cylinder with a fixed
morphable model without texture. As the results show, our
method outperform state of the art approaches. illumination, where people put their ear in a whole then
Our method is based on using a template to estimate several photos are taken, this device allows for a precise ear
the pose and orientation of the camera without relying on segmentation and feature extraction ; they also used Harris
correspondences, and after dense reconstruction is done, the ear corner detection paired with RANSAC for filtering outliers
morphable model is fitted on this point cloud by minimizing the and a semi automatic approach for further correspondence
distance between them, the form of the model can be transform
as wished by its coefficients, and it only uses shape without detection.
relying on texture to converge its coefficients. Zenget al [7] also use a device for photo acquisition, in
this case, a binocular stereo cameras are used to obtain a
I. I NTRODUCTION 3D ear points. From only two photos, SIFT [8] matching
The 3D reconstruction of the human ear have been the method combined with the knowledge of epipolar constraints
subject of different studies for a while now, but the interest is utilized in order to produce sparse correspondences.Then,
have peaked in late years due to the ever-growing number of a match propagation algorithm based on ZNCC is used to
applications using this technology. First of all, the current produce a semi-dense result.
human face recognition algorithms are not fail-proof and In our work, we focus on an approach that can be used by
also are very limited when the face is occluded or partially the masses without the need of purchasing a pricey device,
invisible in the image, so coupling it with ear recognition is also we went beyond just dense points, our results are in the
a preferred approach, As Pflug et al. [1] showed in their form of accurate ear representing meshes, which is much
survey, the 3D based methods outperform the others in more useful in simulations.
human identification via ear shape. Even in the absence of the
human face, ear biometrics alone are capable of accurately II. M ARKER A SSISTED M ULTIVIEW STRUCTURE
identifying people, this was firstly demonstrated by Alfred Reconstruction the 3D form of an object is not cutting
Iannarelli [2] results after conducting his experiments on edge science, as there exist a dozen of methods that achieve
large ear database, his success lead to his work being used this purpose, however the problem with the uniform texture
in criminal investigation. of ear makes it hard for traditional systems like Bundler with
An other emerging field of study that uses the human 3D SIFT [9] to give an accurate result, this is one of the reasons
ear model is the three dimensional sound. This is possible why state of the art methods use specialized devices.
because the way sound waves are deformed by the ear shape, Our proposed method consists of using a template that can
these changes are described by the Head Related Transfer be printed by anyone and a smart phone camera, this template
function (HRTF) which is unique for each person, so in order is used for accurate acquisition of photos which allows
to find the HRTF for a specific person, we can simulate the for precise estimation of camera positions and orientation
deformation of the sound waves once we have the 3D shape relative to the ear position.
of the ear, this has been shown to be possible in [3] [4].
Numerous studies focus on the problem of human ear A. The Template
reconstruction. Cadavid et al. [5] approach consisted of The template consists of 36 AprilTags [10] which has
taking several photos(or frames of video) of the person been proven to give robust corner detection, sometimes even
ear, then they applying different shape from shading (SFS) in tough illumination conditions, each tag has a unique
techniques on each photo, this method is prone to a lot of identifier and measures 1 centimetre in length and width,
inaccuracy in determining the 3D form, this is why they the template is made in a way that the ear would be placed
combined all the results by selecting the one which has in the center, meaning even if some tags are not visible in
the greatest similarity with the rest of the 3D models. This images, the camera will still take the whole ear (the template
method require fixed illumination, and the slightest variation is shown in figure 1).
in brightness threatens the success of the method. An adaptive threshold must be used for detection in order
to mitigate the problems of illumination variation. Each tag
*CentraleSupelec
1 salah7ddine@gmail.com detected in image provides us with 4 corners which adds up
2 pierre-yves.richard@centralesupelec.fr to 144 points in each image at max.

978-1-5386-6602-9/18/$31.00 ©2018 European Union


reprojection of each 3D point back to the image :
x̃ij = Kj [Rj |tj ]Xi (3)
This minimization can be achieved by nonlinear least-squares
algorithms. The number of parameters to be changed can
be set to fit the level of precision desired by the user, for
example the distortion parameters and the intrinsic property
(a) (b) of each camera can be optimized for more precision. We
rather chose a simpler approach, our optimization work under
Fig. 1: (a) The template used to acquire ear images. Each the assumption that all views have the same intrinsic matrix,
tag measures 1cm in width and height, the spacing between so we only optimize the focal length, rotation parameters and
the tags is equal to 1cm and 0.5cm. (b) Different samples of translation, which make the algorithm simple to implement
human ear photos while using the template. and can run in real time.
Moreover, we transform the rotation matrix into its
Rodrigues’ representation (rotation is represented by and
B. 3D structure retrieval angle and an axis vector, it can be encoded by only three
Although we start with few corner (144 at max), these parameters, by choosing a vector with a norm equal to the
points are accurately detected, and thus, no need to use angle of rotation ). Given a rotation matrix R its angle can be
techniques to filter outliers like RANSAC. Also, using the found by the relation θ = cos−1 ((tr(R) − 1)/2) and axis by
tags make it possible to find the position of each tag solving Ru = u. Inversely, from an angle θ and axis vector
corner, camera rotation and translation without searching for r = [rx , ry , rz]T the rotation matrix can be found by :
correspondences.
R = cos(θ).I + (1 − cos(θ)).rrT + sin(θ).[r]× (4)
Each image will be handled alone, in more details : We set
the point of reference of the 3D space at the first corner of Each camera provides us with 7 parameters for optimization :
the top left tag in the template. the relation that describes the 1 focal length, 3 rotation parameters and 3 translation
projection of each corner to the camera is as follows (with parameters, a total of 7 × M which make the optimization
the assumption that we work with a pinhole camera, which extremely fast and can be done in real time.
the majority of cell phone cameras) :
C. Dense Reconstruction
xij = Kj [Rj |tj ]Xi (1)
The next step after accurately estimating camera poses
Where xij is the homogeneous coordinates of the corner and directions is the dense representation of the human ear,
detected in the images (i ∈ [1, N ] where N is the number there exist several methods to achieve this purpose. Patch
of corners, N = 144 in our case, and j ∈ [1, M ] M is based multi-view stereo (PMVS) [14] is a well established
the number of images), Xi is the homogeneous coordinate semi-dense reconstruction algorithm which scores high in
of the physical 3D point, Kj is the intrinsic matrix which Middleburry multi-view stereo benchmark. PMVS is also
can be estimated by using the EXIF data or by calibrating adapted for our problem since it is based on a very
the camera, Rj and tj are respectively the rotation and simple detection algorithm (Harris and DoG) which gives
translation of the camera which is left to be determined. a reasonable large amount of matches, yet it achieves high
In equation 1 we know all the variables except Rj and tj , accuracy because of it filtering procedure.
estimating these two variable is called the PnP problem, and
III. E AR M ODEL F ITTING
there exist several methods to tackle it [11] [12] [13].
After solving the PnP problem for each image, we will end In order to find the perfect surface that represent the human
up with M camera position and rotation. It is worth noting ear captured in photos common surface estimation techniques
thatRj and tj are expressed in global system coordinates like poison are not enough for accurate results. Thus, we
because all of the 3D points Xi are already fixed and developed a novel method that uses morphable models to
global scale does not change each time we solve the PnP fit an ear surface to the already reconstructed 3D dense
problem. This approach allows for quick structure and points. Although, we are interested in ear reconstruction,
camera parameters estimation and can be done in real time. there is no reason to prevent using the same method for
For more precision, a common approach for these kind other reconstructions like the human face.
of problems is bundle adjustment, which consist of reducing
A. Database
the following error :
The database used to reconstruct the morphable model

N ∑
M
contain 95 ears of adults that was scanned using eFit scanner.
min ||x̃ij − xij ||22 (2)
f,r,t Although the scanner gives an accurate reconstruction, some
i=1 j=1
pre-processing was needed to remove parts of the skull or
With xij is the corner detected on each image and x̃ij is the erroneous surface before using it for the morphable model.
One common method to reduce the dimensionality of this
problem is to use principal component Analysis (PCA),
this maps previous ear vertices Si′ into a new orthogonal
coordinate system represented by the eigenvectors si :
∑m
S = S̄ + αi si (7)
i=1
This representation also has m parameters, but the difference
is thatsi are uncorrelated and ordered according to their
eigenvalues σi2 , in other words, the first vectors has most
of the information about the ear shape and the rest is noise.
And thus we can focus on the first l parameters and leave
(a) (b) the rest equal to zero. An other perk to this representation
Fig. 2: (a) A sample image used in the 3D reconstruction is the possibility to know the probability of existence of an
after detecting the Apriltags, as the photo shows, the tags ear shape by utilizing its coefficients αi :
can be detected even with the present of a shadow behind the ( )
1 ∑ αi
m−1
ear. (b) The final result of the dense reconstruction, points −

that are not in the ear are automatically filtered since we p( α ) ≃ exp − (8)
2 i=1 σi
know exactly the coordinates of the hole where the ear would
be (cameras are colored from red to blue and pink is the C. Fitting the model to the dense point cloud
template Apriltags that were detected in the whole scene). Fitting the morphable model is the process of finding the
right values for the coefficients αi that best fits a desirable
result, this is done through optimization. In our case, the
B. Reconstructing The Morphable Model aim is to find a model that looks like the dense cloud
reconstruction of the ear. Our proposed method consists
An ear morphable model consists of a shape shifting 3D
taking each vertex of the model and search for its nearest
ear that changes the position of its vertices in order to give
neighbour from the dense cloud, then try to minimize this
a new form, while keeping the same faces that relates these
distance for all the vertices in the model.
vertices. These changes are governed by coefficients which
Before performing the optimization, the model needs to
are left to be determined based on the result that needs to
be aligned with the dense cloud. Since we know that the 3D
be achieved, the new ear is a linear combination of the ears
reconstruction of the ear is done in centimetres, we apply
used in the database.
an initial scale sini , then we align the vertices of the mean
In order to reconstruct a morphable model from a set 3D ear model on the dense cloud using Iterative Closest Point
ears each one of these must be in a full correspondence algorithm (ICP) point-to-plane, after a successful alignment
with the rest of the ears in the database, in other words, we store ICP resulting rotation and translation Ricp ticp for
each vertex in each ear has a correspondent vertex in the later use.
other ears. This correspondence can be achieved by utilizing Now that the mean ear in place, the fitting optimization
a modified version of optical flow for three dimensional can be performed, by trying to minimize the following error :
matching [15] [16].
Under the assumption that any human ear can be formed min (Edist + Ecoef + Eparams ) (9)
αi ,λi
as a linear combination of the already acquired ears in the
dataset (this hypothesis is reasonably true for large databases) The global error contain three terms, the first being the
by the following relation : distance error that characterizes the overall distance between
each point and its nearest neighbour from the dense cloud :

m ∑
m
1 ∑
n
S= ai Si , ai = 1 (5) Edist = 2 ∥ Xi − cp(Xi ) ∥22 (10)
i=1 i=1 σd i=1
The morphable model under this representation has m The second term is used to constrain the variation of the
parameters which is equal to the number of ears in the coefficients αi during the optimization, coefficients with
database, it also lacks a way to judge the plausibility of bigger eigen values are naturally allowed to vary more since
the result. Thus, we reformulate equation 5 in order to they contain more information. This term ensures that the
include the mean position of each vertex, which enables us found result at the end of optimization will have a relatively
to quantify the likelihood of a new ear by measuring how high plausibility calculated by equation 8 (σi2 are the eigen
far it is from the mean ear : values) :

m ∑
m−1
αi
S = S̄ + a′i Si′ (6) Ecoef = (11)
i=1 i=1
σi
Finally, the last term constrains the model overall rotations
λ1,2,3 , translations λ4,5,6 and scale λ7 from diverging too
much in comparison to what was found during ICP before
the optimization :

7
λi − λi
Eparams = (12)
i=1
σλ.i

(a) (b)
Fig. 3: The morphable model is deformed in order to fit
the human ear, as (b) shows, it can have a different scale
compared to starting point (a) which the average ear before
optimization.

IV. R ESULTS
The entire process is automated and does not need any
human intervention or any special device except of the
template which can be printed and used.
The images used in all experience have the size of
1836x3264 but were reduced to 459x816 in order to ensure
a proper functionality of PMVS, these images were taken
by a cell phone. The algorithm were developed entirely
on python except PMVS, and it took around 4 minutes on
average for the whole process on a machine with CPU v3
3.30 Ghz and RAM of 8 Go, the run time can be drastically
Fig. 4: the heatmap is calculated by searching for the nearest
reduced through algorithm optimisation and implementation
neighbor of each point from the morphable model to in the
with faster languages like C++, the whole algorithm could
set of points in the ground truth , (b) shows the model
run close to real time.
The number of images taken is between 10 and 20 images heatmap distance error (average 0.0685 cm error), (c) show
with no pre-processing, images that are too close or too far the mean ear heatmap error before fitting (average error
are filtered automatically (due to absence of tags). 0.0808 cm), (a) is the eFit scan
A variation of the Sequential Least SQuares Programming
[17] was used to minimize the error function. The results
are shown in figures 3 and 4, quantifying the result is done ear, also it works well for bigger or smaller ears compared
through comparing the model at the end of the optimization to the ones in the database, since the optimization scales the
with the eFit scans which were considered as ground truths, morphable model to fit the size of the subject ear. but still
this is done by aligning the model with the ear scan using ICP there is room for improvement, for example the last part of
and then calculating the distance to the nearest neighbour. the ear (Lobule) has more error compared to the rest, and
Our results show an overall successful fitting of the ear, this is due to the fact that this part is not solid and thus
and this is especially true when compared with the mean vulnerable to the slightest movement caused by placing the
template. After talking with the experts in biometrics and
3D sound generation, it turns out that this does not affect
the result in any meaningful way.
V. C ONCLUSION
We have proposed a novel method to reconstruct the
surface of the ear that best fit dense cloud reconstruction, we
also showed the entire pipeline to automatize and acquire the
surface reconstruction of the human ear from only cell phone
images using a printed template while most other method
are not fully automatic. future work can be done to remove
the usage of template entirely while maintaining the same
level of accuracy and fast running time. This can be done
by looking into novel matching algorithms aiming to find
accurate and rich matches for texture-less objects like human
ears.
R EFERENCES
[1] A. Pflug and C. Busch, “Ear biometrics : a survey of detection, feature
extraction and recognition methods,” IET biometrics, vol. 1, no. 2, pp.
114–129, 2012.
[2] A. V. Iannarelli, Ear identification. Paramont Publishing Company,
1989.
[3] S. Ghorbal, T. Auclair, C. Soladié, and R. Séguier, “Pinna
morphological parameters influencing hrtf sets.”
[4] S. Ghorbal, R. Séguier, and X. Bonjour, “Process of hrtf
individualization by 3d statistical ear model,” in Audio Engineering
Society Convention 141. Audio Engineering Society, 2016.
[5] S. Cadavid and M. Abdel-Mottaleb, “3-D ear modeling and recognition
from video sequences using shape from shading,” IEEE Transactions
on Information Forensics and Security, vol. 3, no. 4, pp. 709–718,
2008.
[6] H. Liu and J. Yan, “Multi-view Ear Shape Feature Extraction and
Reconstruction.” IEEE, Dec. 2007, pp. 652–658.
[7] H. Zeng, Z.-C. Mu, K. Wang, and C. Sun, “Automatic 3d ear
reconstruction based on binocular stereo vision,” in Systems, Man
and Cybernetics, 2009. SMC 2009. IEEE International Conference
on. IEEE, 2009, pp. 5205–5208.
[8] D. G. Lowe, “Object recognition from local scale-invariant features,”
in Computer vision, 1999. The proceedings of the seventh IEEE
international conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
[9] N. Snavely, S. Seitz, and R. Szeliski, “Photo tourism : Exploring
image collections in 3d (2006),” URL http ://www. cs. cornell. edu/˜
snavely/bundler.
[10] E. Olson, “Apriltag : A robust and flexible visual fiducial system,” in
Robotics and Automation (ICRA), 2011 IEEE International Conference
on. IEEE, 2011, pp. 3400–3407.
[11] B. M. Haralick, C.-N. Lee, K. Ottenberg, and M. Nölle, “Review and
analysis of solutions of the three point perspective pose estimation
problem,” International journal of computer vision, vol. 13, no. 3, pp.
331–356, 1994.
[12] F. Moreno-Noguer, V. Lepetit, and P. Fua, “Accurate non-iterative o
(n) solution to the pnp problem,” in Computer vision, 2007. ICCV
2007. IEEE 11th international conference on. IEEE, 2007, pp. 1–8.
[13] S. Li, C. Xu, and M. Xie, “A robust o (n) solution to the perspective-
n-point problem,” IEEE transactions on pattern analysis and machine
intelligence, vol. 34, no. 7, pp. 1444–1450, 2012.
[14] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview
stereopsis,” IEEE transactions on pattern analysis and machine
intelligence, vol. 32, no. 8, pp. 1362–1376, 2010.
[15] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d
morphable model,” IEEE Transactions on pattern analysis and
machine intelligence, vol. 25, no. 9, pp. 1063–1074, 2003.
[16] T. Vetter and V. Blanz, “Estimating coloured 3d face models from
single images : An example based approach,” in European Conference
on Computer Vision. Springer, 1998, pp. 499–513.
[17] D. Kraft, “A software package for sequential quadratic programming,”
Forschungsbericht- Deutsche Forschungs- und Versuchsanstalt fur
Luft- und Raumfahrt, 1988.

Potrebbero piacerti anche