Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract Google, are posed to enter the market with their own
In this paper, we introduce a novel approach to bypass solutions, having already acquired several facial recog-
modern face authentication systems. More specifically, nition software companies1 . While the market is seg-
by leveraging a handful of pictures of the target user mented based on the type of technology offered (e.g.,
taken from social media, we show how to create realistic, 2D facial recognition, 3D recognition, and facial analyt-
textured, 3D facial models that undermine the security ics/face biometric authentication), Gartner research esti-
of widely used face authentication solutions. Our frame- mates that the overall market will grow to over $6.5 bil-
work makes use of virtual reality (VR) systems, incor- lion in 2018 (compared to roughly $2 billion today) [13].
porating along the way the ability to perform animations With this push to market, improving the accuracy of
(e.g., raising an eyebrow or smiling) of the facial model, face recognition technologies remains an active area of
in order to trick liveness detectors into believing that the research in academia and industry. Google’s FaceNet
3D model is a real human face. The synthetic face of the system, which achieved near-perfect accuracy on the La-
user is displayed on the screen of the VR device, and as beled Faces in the Wild dataset [47], exemplifies one
the device rotates and translates in the real world, the 3D such effort. Additionally, recent advances with deep
face moves accordingly. To an observing face authenti- learning algorithms [38, 53] show much promise in
cation system, the depth and motion cues of the display strengthening the robustness of the face identification
match what would be expected for a human face. and authentication techniques used today. Indeed, state-
We argue that such VR-based spoofing attacks con- of-the-art face identification systems can now outper-
stitute a fundamentally new class of attacks that point form their human counterparts [36], and this high accu-
to a serious weaknesses in camera-based authentication racy is one of the driving factors behind the increased use
systems: Unless they incorporate other sources of verifi- of face recognition systems.
able data, systems relying on color image data and cam- However, even given the high accuracy of modern face
era motion are prone to attacks via virtual realism. To recognition technologies, their application in face au-
demonstrate the practical nature of this threat, we con- thentication systems has left much to be desired. For
duct thorough experiments using an end-to-end imple- instance, at the Black Hat security conference in 2009,
mentation of our approach and show how it undermines Duc and Minh [10] demonstrated the weaknesses of pop-
the security of several face authentication solutions that ular face authentication systems from commodity ven-
include both motion-based and liveness detectors. dors like Lenovo, Asus, and Toshiba. Amusingly, Duc
and Minh [10] were able to reliably bypass face-locked
1 Introduction computers simply by presenting the software with pho-
tographs and fake pictures of faces. Essentially, the secu-
Over the past few years, face authentication systems have rity of these systems rested solely on the problem of face
become increasingly popular as an enhanced security detection, rather than face authentication. This widely
feature in both mobile devices and desktop computers. publicized event led to subsequent integration of more
As the underlying computer vision algorithms have ma- robust face authentication protocols. One prominent ex-
tured, many application designers and nascent specialist ample is Android OS, which augmented its face authen-
vendors have jumped in and started to offer solutions for 1 See, for example, “Apple Acquires Face Recognition, Expression
mobile devices with varying degrees of security and us- Analysis firm, Emotient”, TechTimes, Jan, 2016; “Google Acquires
ability. Other more well-known players, like Apple and Facial Recognition Software Company PittPar,” WSJ, 2011.
(e.g., when the photo is taken from the side). The tex- dian alignment error of 2.7 pixels on well-known datasets
ture of these occluded parts must be estimated in a man- [1] and outperforms other commonly used techniques
ner that does not introduce too many artifacts (stage ). (e.g., [5]) for landmark extraction.
Once the texture is filled, we have a realistic 3D model
of the user’s face based on a single image.
However, despite its realism, the output of stage is
still not able to fool modern face authentication systems.
The primary reason for this is that modern face authenti-
cation systems use the subject’s gaze direction as a strong
feature, requiring the user to look at the camera in order
to pass the system. Therefore, we must also automati- Figure 2: Examples of facial landmark extraction
cally correct the direction of the user’s gaze on the tex-
tured mesh (stage ). The adjusted model can then be de- For our needs, SDM works well on most online im-
formed to produce animation for different facial expres- ages, even those where the face is captured at a low res-
sions, such as smiling, blinking, and raising the eyebrows olution (e.g., 40 × 50 pixels). It does, however, fail on
(stage ). These expressions are often used as liveness a handful of the online photos we collected (less than
clues in face authentication systems, and as such, we 5%) where the pose is beyond the tolerance level of the
need to be able to automatically reproduce them on our algorithm. If this occurs, we simply discard the image.
3D model. Finally, we output the textured 3D model into In our experiments, the landmark extraction results are
a virtual reality system (stage ). manually checked for correctness, although an automatic
Using this framework, an adversary can bypass both scoring system could potentially be devised for this task.
the face recognition and liveness detection components Example landmark extractions are shown in Figure 2.
of modern face authentication systems. In what follows,
we discuss the approach we take to solve each of the var-
ious challenges that arise in our six-staged process.
3.2 3D Model Reconstruction
The 68 extracted 3D point landmarks from each of the
N input images provide us with a set of coordinates
3.1 Facial Landmark Extraction si, j ∈ R2 , with 1 ≤ i ≤ 68, 1 ≤ j ≤ N. The projection of
Starting from multiple input photos of the user, our first the 3D points Si, j ∈ R3 on the face onto the image coor-
task is to perform facial landmark extraction. Follow- dinates si, j follows what is called the “weak perspective
ing the approach of Zhu et al. [63], we extract 68 2D projection” (WPP) model [16], computed as follows:
facial landmarks in each image using the supervised de- si, j = f j PR j (Si, j + t j ) , (1)
scent method (SDM) [59]. SDM successfully identifies
where f j is a uniform scaling factor; P is the projection
facial landmarks under relatively large pose differences
(±45 deg yaw, ±90 deg roll, ±30 deg pitch). We chose 100
matrix ; R j is a 3 × 3 rotation matrix defined by
the technique of Zhu et al. [63] because it achieves a me- 010
When combining Eqs. (1) and (2), we inevitably run Figure 4: 3D facial model (right) built from facial landmarks
into the so-called “correspondence problem.” That is, extracted from 4 images (left).
tos to social media in the past 3 years. These varieties all 7 http://www.1uapps.com
4.2 Seeing Your Face Is Enough training sets, we repeat our classification experiments 4
times, with random train/test splits in each trial, and we
Our approach not only defeats existing commercial sys- report the average performance over all four trials.
tems having liveness detection — it fundamentally un-
dermines the process of liveness detection based on color Training Data Real Video VR
images, entirely. To illustrate this, we use our method Real+Video 19.50 / 20 0.25 / 20 9.75 / 10
to attack the recently proposed authentication approach Real+Video+VR 14.00 / 20 0.00 / 20 5.00 / 10
of Li et al. [34], which obtains a high rate of success Real+VR 14.75 / 20 — 5.00 / 10
in guarding against video-based spoofing attacks. This
system adds another layer to motion-based liveness de- Table 3: Number of testing samples classified as real users.
tection by requiring that the movement of the face in the Values in the first column represent true positive rates, and the
captured video be consistent with the data obtained from second and third columns represent false positives. Each row
the motion sensor of the device. Fortunately, as discussed shows the classification results after training on the classes in
in §3, the data consistency requirement is automatically the first column. The results were averaged over four trials.
satisfied with our virtual reality spoofing system because
the 3D model rotates in tandem with the camera motion. The results of our experiments are shown in Table 3.
Central to Li et al. [34]’s approach is to build a classi- For each class (real user data, video spoof data, and VR
fier that evaluates the consistency of captured video and data), we report the average number (over 4 trials) of test
motion sensor data. In turn, the learned classifier is used samples classified as real user data. We experiment with
to distinguish real faces from spoofed ones. Since their three different training configurations, which are listed in
code and training samples have not been made public, the first column of the table. The first row shows the re-
we implemented our own version of Li et al. [34]’s live- sults when using real user data as positive samples and
ness detection system and trained a classifier with our video spoof data as negative samples. In this case, it
own training data. We refer the reader to [34] for a full can easily be seen that the real-versus-video identifica-
overview of the method. tion is almost perfect, matching the results of [34]. How-
Following the methodology of [34], we capture video ever, our VR-based attack is able to spoof this training
samples (and inertial sensor data) of ∼4 seconds from configuration nearly 100% of the time. The second and
the front-facing camera of a mobile phone. In each sam- third rows of Table 3 show the classification performance
ple, the phone is held at a distance of 40cm from the when VR spoof data is included in the training data. In
subject and moved back-and-forth 20cm to the left and both cases, our approach defeats the liveness detector in
right. We capture 40 samples of real subjects moving 50% of trials, and the real user data is correctly identified
the phone in front of their face, 40 samples where a pre- as such less than 75% of the time.
recorded video of a user is presented to the camera, and All three training configurations clearly point to the
30 samples where the camera is presented with a 3D re- fact that our VR system presents motion features that are
construction of a user in our VR environment. For train- close to real user data. Even if the liveness detector of
ing, we use a binary logistic regression classifier trained [34] is specifically trained to look for our VR-based at-
on 20 samples from each class, with the other samples tack, 1 out of every 2 attacks will still succeed, with the
used for testing. Due to the relatively small size of our false rejection rate also increasing. Any system using
[9] B. Chu, S. Romdhani, and L. Chen. 3d-aided face recognition [26] H.-N. Kim, A. El Saddik, and J.-G. Jung. Leveraging personal
robust to expression and pose variations. In Computer Vision and photos to inferring friendships in social network services. Expert
Pattern Recognition (CVPR), Conference on, pages 1907–1914, Systems with Applications, 39(8):6955–6966, 2012.
2014.
[27] S. Kim, S. Yu, K. Kim, Y. Ban, and S. Lee. Face liveness detec-
[10] N. Duc and B. Minh. Your face is not your password. In Black tion using variable focusing. In Biometrics (ICB), 2013 Interna-
Hat Conference, volume 1, 2009. tional Conference on, pages 1–6, 2013.
[17] P. Ilia, I. Polakis, E. Athanasopoulos, F. Maggi, and S. Ioanni- [33] Y. Li, K. Xu, Q. Yan, Y. Li, and R. H. Deng. Understanding
dis. Face/off: Preventing privacy leakage from photos in social osn-based facial disclosure against face authentication systems.
networks. In Proceedings of the 22nd ACM Conference on Com- In Proceedings of the ACM Symposium on Information, Com-
puter and Communications Security, pages 781–792, 2015. puter and Communications Security (ASIACCS), pages 413–424.
ACM, 2014.
[18] Intel Security. True KeyTM by Intel Security: Security white pa-
per 1.0, 2015. URL https://b.tkassets.com/shared/ [34] Y. Li, Y. Li, Q. Yan, H. Kong, and R. H. Deng. Seeing your
TrueKey-SecurityWhitePaper-v1.0-EN.pdf. face is not enough: An inertial sensor-based liveness detection for
face authentication. In Proceedings of the 22nd ACM Conference
[19] H.-K. Jee, S.-U. Jung, and J.-H. Yoo. Liveness detection for em- on Computer and Communications Security, pages 1558–1569,
bedded face recognition system. International Journal of Biolog- 2015.
ical and Medical Sciences, 1(4):235–238, 2006.
[35] Y. Liu, K. P. Gummadi, B. Krishnamurthy, and A. Mislove. Ana-
[20] L. A. Jeni, J. F. Cohn, and T. Kanade. Dense 3d face align- lyzing facebook privacy settings: user expectations vs. reality. In
ment from 2d videos in real-time. In Automatic Face and Gesture Proceedings of the 2011 ACM SIGCOMM conference on Internet
Recognition (FG), 2015 11th IEEE International Conference and measurement conference, pages 61–70. ACM, 2011.
Workshops on, volume 1, pages 1–8. IEEE, 2015. [36] C. Lu and X. Tang. Surpassing human-level face verifica-
tion performance on LFW with GaussianFace. arXiv preprint
[21] O. Jesorsky, K. J. Kirchberg, and R. W. Frischholz. Robust face
arXiv:1404.3840, 2014.
detection using the hausdorff distance. In Audio-and video-based
biometric person authentication, pages 90–95. Springer, 2001. [37] J. Määttä, A. Hadid, and M. Pietikainen. Face spoofing detection
from single images using micro-texture analysis. In Biometrics
[22] I. Jolliffe. Principal component analysis. Wiley Online Library, (IJCB), International Joint Conference on, pages 1–7, 2011.
2002.
[38] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recog-
[23] I. Kemelmacher-Shlizerman. Internet based morphable model. In nition. In Proceedings of the British Machine Vision Conference
Proceedings of the IEEE International Conference on Computer (BMVC), 2015.
Vision, pages 3256–3263, 2013.
[39] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A
[24] I. Kemelmacher-Shlizerman and R. Basri. 3D face reconstruction 3d face model for pose and illumination invariant face recogni-
from a single image using a single reference face shape. Pattern tion. In Proceedings of the 6th IEEE International Conference
Analysis and Machine Intelligence, IEEE Transactions on, 33(2): on Advanced Video and Signal based Surveillance (AVSS) for Se-
394–405, 2011. curity, Safety and Monitoring in Smart Environments, 2009.