Sei sulla pagina 1di 5

A 3D NIR Camera for Gesture Control of Video

Game Consoles
Dan Ionescu, Viorel Suse, Cristian Gadea, Bogdan
Solomon
School of Electrical Engineering and Computer Science
University of Ottawa, Ottawa, Canada
{dan, viorel, cgadea, bsolomon}@ncct.uottawa.ca
Bogdan Ionescu, Shahidul Islam, Marius Cordea
Mgestyk Technologies, Inc.
Ottawa, Canada
{bogdan, shahid, marius}@mgestyk.com



AbstractGesture-based human-computer interaction is
presently an important area of research that aims to make
reliable touch-free user interfaces a reality. More recent gesture
detection technologies use cameras that rely on near-infrared
(NIR) illumination to obtain 3D depth information for objects
within the camera's field-of-view. These cameras use either
structured light, time-of-flight (ToF), or stereoscopy. Depth
images allow a person's body and hands to be separated from the
background, thereby permitting modern image processing
algorithms to be used for greatly improved gesture detection.
This paper presents a new depth generation principle that uses a
monotonic increasing and decreasing function to control NIR
illumination pulses. Reflected light pulses are captured as a series
of images and the depth map of the visible objects is calculated in
real-time using reconfigurable hardware. Measurements and
results are given to explain how the depth map is built and how
the camera allows gestures to be used to control a video game
console.

Keywordshuman computer interfaces; real-time 3D camera
technology; gesture control; video game consoles
I. INTRODUCTION
Human-computer interaction through gesture control is a
growing research area with significant implications for future
user interfaces. Although gesture control research began with
terminals attached to computers [1], the large scale
implementation and utilization of gesture control continues to
be infeasible today. In this paper, the presented material will
focus on gesture control generated by the analysis of camera
based images, rather than gesture control based on touch or on
sensors specialized in measuring acceleration (either linear or
rotational) such as those found in the Nintendo Wii controllers
[2]. Similarly, the paper does not elaborate on gesture
techniques that use sensors placed on various hand muscles [3],
or on sensors positioned around the display area using
technologies such as ultrasound or proximity sensors [4]. The
focus will remain on image-sensor based cameras that operate
using active illumination in the near-infrared (NIR) range.
It was demonstrated in 2008 that, through the use of a 3D
IR camera, gestures based on hand and finger movements can
be robustly understood by computers [5], thereby allowing
users to play games and interact with computer applications in
natural and immersive ways. Microsoft introduced the Kinect
camera in 2010, which was based on PrimeSense technology
that worked by projecting a structured light pattern [7].
Similar techniques have been shown in [6] for 3D object
reconstruction. By combining the resulting depth map with
the color 2D images obtained from the Kinects additional
camera, the computational unit of the Xbox 360 console was
capable of providing a fairly robust Gesture recognition
solution. Numerous applications were produced around
Kinect by various groups of image processing researchers
[8].
Cameras providing depth data are limited in the number
of principles used [9][10]. The gesture recognition range is
divided in two main categories: i) hand and finger based
gestures, and ii) full body or body part movement gestures.
For the first category, the computer has to concentrate on
detecting hands, fists, palms, and fingers. The second
category has, at its core, the detection of the median axis of
the parts of the body, a method known under the name of
skeleton. A research group at Microsoft is leading the way
in this field [11].
The field of 3D object reconstruction and methods to
realize it can be categorized into 3 basic techniques: i) stereo
or triangulation-based techniques, also known as stereo-
vision; ii) structured light techniques, where depth images
are obtained by projecting a known pattern of light and
recording the deformations of the pattern relative to a
reference image which contains the pattern projected on a
planar surface; iii) LIDAR and time-of-flight (ToF) imagers,
which obtain a depth map by measuring the time or the phase
taken or shifted, respectively, for a series of generated pulses
of light that return to the camera. The time-sensitive gating
techniques required by LIDAR and ToF systems are
typically implemented using image intensifiers or other
exotic components such as photovoltaic cells or shutters with
nanosecond timings [12][13][14][15][18].
In addition, there are depth-mapping cameras which
combine a distance measuring device with a scanning
method and system. This category includes 3D laser micro-
vision for micro objects [12][13] and laser scanners for
macro objects [14][15][16]. There have also been attempts to
capture 3D information from 2D images by using the flow of
the movements in consecutive frames, as well as by using
human cues in perceiving the depth from a single view [17].
978-1-4799-2614-5/14/$31.00 2014 IEEE
All of the above methods, however, require intensive
calculations and filtering, and others are specific for SLAM
(Simultaneous Localization and Mapping) where the real-time
requirements might not be as tight as the ones imposed by
gesture recognition. Any processing that takes longer than
33.33ms (30fps) might impede the user from being in sync with
the actions triggered by the gesture. In addition, it has been
found that in order for gesture control to be viable, images
provided for gesture interpretation have to be independent of
lighting conditions, provide a third dimension (depth map) of
the user gesture in at least a VGA (640x480) resolution, and be
complemented by a solid library of gestures and algorithms. To
detect, track, and recognize gestures and their movement,
algorithms have to be well optimized such that the entire
detection, tracking, and recognition process takes place in real-
time (i.e. 33.33ms, the time needed for the acquisition,
processing and display of the result).
By taking advantage of the strengths offered by depth
cameras, gesture-based control can be used to provide a natural
and intuitive interface for a variety of computer applications
and electronic devices such as video game consoles, notebooks,
tablets and Smart TVs. The use of touch-free gestures to
control console-based video games has been popularized by the
Kinect sensor, which connects to the Xbox 360 console to
allow a variety of modern games to be played.
Kinects full-body interactions have proven to be popular
for fitness and dancing games played within the open space of
the living room. However, these games were designed and
developed with Kinect support in mind. That is, owners of the
game console cannot play any game they wish by using the 3D
camera. This paper will introduce an intelligent and real-time
3D camera that can be connected to existing game consoles and
allow for motion control of any game typically played with a
consoles physical controller.
A new system for real-time measurements of depth is
introduced in this paper. The technique, known as space
slicing, is based on a variable infrared beam and inverse
gating. The novel approach devised opens new directions that
eliminate the need for costly and/or exotic devices such as
image intensifiers, while also supporting high image
resolutions available from off-the-shelf image sensors. Unlike
for stereo-based approaches, this approach requires only one
image sensor. The space slicing principle is based on
modulated near infrared (NIR) light whose intensity is
controlled throughout the process of building basic images. The
images go through an image processing algorithm for
calculating depth information. The method also uses a
relatively high frame rate CMOS sensor which performs the
gating, thereby allowing the desired depth precision to be
obtained. This depth information is then used to control a video
game console.
The remainder of this paper is organized as follows: Section
II discusses the architecture of the 3D camera and how it
differs from the current state-of-the-art, including its ability for
game console and Smart TV control. Section III then offers a
closer look at the performance of the 3D camera and how
effectively it can be used to enable natural gesture control of a
gaming console system. Finally, Section IV reflects on the
contributions of this paper and provides topics for future
research.
II. ARCHITECTURE
The basis of the depth measurement for the new camera
is accomplished by generating tightly controlled and
synchronized light pulses, which is done by a module called
the illuminator of the camera. The illuminator is controlled
using a high but variable frequency, as well as a variable
duty cycle. A cone of diffused near-infrared (NIR) is
therefore projected onto the scene by a number of lasers. For
the camera described in this paper, the infrared light was
chosen to be in the 850nm range, while the total pulsed
power was set to 150 mW to obtain the desired maximum
depth distance of 5m (the light is dispersed upon leaving the
illuminator). Gesture interactions aimed at the living room,
such as for game console control, typically require people to
be located at a variable distance from one to five meters such
that either fingers or the whole body can be acquired
properly by the combination of optics and the image sensor.
The light source is generated as a square laser pulse of short
and variable duration. The infrared light produced is an
expanding spherical surface of finite width determined by the
square pulse of the laser module (a light wall). The wall is
also controlled in intensity and a special intensity function is
generated based on a monotonically variable frequency and
duty-cycle pulse (PWM) which are precisely controlled and
synchronized by a function implemented by the camera
controller program. The infrared light is reflected back to the
camera by the real-world 3D scene, and provides information
slice by slice of the object, which is translated into a depth
image by a depth image video processor.
The duration of the illumination, which is proportional to
the distance of the object to the camera, is set by the video
processor according to distances desired and controlled by a
reconfigurable hardware architecture unit. The illuminator
therefore sends controlled pulses to the object, thereby
exploring it in a spatial manner. By repeating this slicing
process a number of times and by controlling the camera
parameters in real-time, a depth image of the object will be
produced. The real-time nature of the camera requires that a
new depth image be obtained every 33.33ms, which
minimizes the delay between the real-world object
movement and the frame received by the media processor
which processes the images frame by frame in real-time.
One of the most important features of the depth camera is
the ability to change the parameters of the depth
measurements on an as-needed basis. It can therefore be
adjusted to process nearby hand gestures, distant full-body
movements, or both, as needed by the application. This is an
important feature as there is presently no other camera which
can be used for close-up finger detection as well as far-back
full-body detection. The camera can therefore produce depth
images which will include certain objects, while disregarding
others before or beyond the limits of the defined wall. This
is made possible by the novel space slicing principle,
which is used to build the depth map of the objects present in
the field of view (FOV).
Fig. 1 summarizes the electrical design of the cameras
depth engine and how the various components interact to
produce a depth image. The Depth Engine, which is
implemented using a reconfigurable hardware architecture unit,
is composed of the Depth Engine Video Processor, a
DDR2/DDR3 Memory Controller, a USB Controller, a SPI
Controller, a Command Decode and Execute Module, an
Illumination Controller, and a DSP Microcontroller Interface.
In Fig. 1, D is used to signify the exchange of data, while
C represents control and S represents status.
The Depth Image Video Processor receives the space sliced
images from the image sensor in groups of eight or four
depending on the camera resolution used. All depth
calculations take place in parallel in the hardware engine. The
hardware engine controls the illuminator and keeps the
synchronization between the image sensor and the illuminator.
The control takes into account the synchronization necessary
for the production of the images such that the depth image can
be calculated. After the distance from the camera to the object
is calculated, the resulting depth images are mapped into a
pseudo-colored image. The resulting precision of the depth is
mapped on 10 bits. In the same engine resides also the
algorithm for face detection, which is the starting point of the
skeleton detection algorithm.
The resulting image is then processed by the finger, hand
and full-body detection/recognition/tracking software
algorithms. The algorithms are implemented by the DSP such
that the 3D camera exports only the controls which correspond
to the specific gestures to which the application is responding.
III. EXPERIMENTS & RESULTS
This section will present the results obtained when the 3D
camera and video console control systems were implemented.
The camera was implemented as described and a variety of
experiments were performed on the resulting data.
A. 3D Camera Experiments & Results
When characterizing the sensitivity of the camera pixels,
the idea is to obtain depth values that are as close as possible
to each other for objects of different reflectivity (albedo
effect). Two objects, one white with high reflectivity and
one gray with moderate reflectivity, were positioned at
distances ranging between 1m and 4m from the camera and
the resulting pixel intensity was recorded for each distance.
In each case, the intensity was obtained as a 10-bit value,
where I(i, j) represents the intensity of pixel location (i, j).
The intensity of the middle pixel of an m by n window is
therefore represented as follows:

2
,
2
n m
I

(1)
The average of the pixel intensities within the window is
therefore represented as per Equation 2:
n m
j i I
A
m
i
n
j
I
K

= = 1 1
) , (

(2)
In this case, K=2 (one high reflectivity and one moderate
reflectivity) and m=n=20 so that a good portion of the object
is covered.


The results were plotted in Fig. 2. A highly nonlinear
response was observed, to which a curve fitting algorithm of
the 7
th
order was then applied. By placing the same good
reflector in front of the camera, the absolute distance for
that object could now be determined. Based on the obtained
distance value, Fig. 3 shows the amount of error that was
observed when the object was kept still for a few seconds at
three different distances (0.4cm at 3m, 0.75cm at 4m,
2.0cm at 5m).

Fig. 1. Block diagram of 3D camera depth engine.
A software application was designed to detect objects with
different reflectivity within the cameras field of view. Their
average pixel center values are calculated using Equation 2.
The absolute distance for the detected objects was obtained by
using the data in Fig. 2 and the 7
th
order polynomial derived
from it. This technique is used to compensate for the albedo
effect. The compensated and uncompensated images are shown
in Fig. 4.
B. Video Game Console Experiments & Results
The electronic gaming area is a key domain in which
gesture-based control can be used. The gesture control camera
presented in this paper can be used with any game as the
camera sends the controller commands which the game console
is responsive to. Fig. 5 shows how the USB connection is made
between the 3D camera and the Sony PlayStation 3 video game
console. Once connected, the camera is detected by the console
as a controller and gestures can be used to move around in
menus and play games.
Fig. 6 shows the user controlling the video game WipEout
HD on the PlayStation 3 video game console. Whenever the
camera detects two fists, the signal for the X button is
transmitted via USB, causing the anti-gravity craft to
accelerate. The player can collect power-ups and fire them by
showing one thumb. When a player brings up two thumbs, the
brakes are applied, as is shown in Fig. 7 with the racing game
MotorStorm. In the case of MotorStorm, the acceleration
button had to be remapped to the right trigger button. Different
control profiles were therefore required to be loaded onto the
camera depending on the game that was to be played.
Since the camera can operate for close distances as well as
far, Fig. 8 shows WipEout HD being controlled from a more
typical living room distance of 3m, where fingers can still be
used reliably to activate in-game commands. Gaming
experiences can therefore be improved beyond what is possible
with the Kinect camera, as the Kinect is only able to detect the
players skeleton at such distances.
Ten test users who were asked to try out the system liked
that they needed only to step in front of the camera and hold
our their hands to see their craft accelerate. While the thumb-
based commands had to be explained more crefully, players
quickly got used to applying brakes around corners and using
power-ups to pass the other vehicles. One player mentioned
how he had played a racing game on the Kinect which would
be greatly improved if similar gestures were possible on that
system.
Using a high-speed video recording, the delay between the
moment the gesture was performed to the moment the on-
screen result appeared was observed to be less than 120ms.
Such low latency is especially important for high-speed console
gaming. Using the same method, the latency of the Kinect was
measured to be 200ms.
IV. CONCLUSION
There is no doubt that the consumer industry has created a
demand for 3D depth cameras to enable gesture-based control
of electronic devices such as video game consoles. Existing
NIR depth cameras require exotic components or produce
depth images of relatively low resolutions. This paper has
presented a new 3D camera and its reliable method for
acquiring 3D depth data based on a new space slicing
technique. The real-time acquisition of the depth image, as
well as its interpretation and understanding, have been shown
to allow for reliable and intuitive control of off-the-shelf
PlayStation 3 games. The present solution can produce
dynamic-range depth images and minimize the variations of
the depth due to albedo effects. Future research will continue
to improve the compensation techniques and explore areas
such as Smart TV control using gestures, as well as control
of the new generation of video game consoles.

Fig. 2. 10-bit Camera Output for two Reflectors vs. Absolute Distance


Fig. 3: Obtaining Absolute Distance and Error Observations


Fig. 4: Reflectivity Compensation Algorithm
REFERENCES
[1] I. E. Sutherland, Sketchpad: A man-machine graphical communication
system, in AFIPS Spring Joint Computer Conference 23, 1964, pp.
329346.
[2] M. Hoffman, P. Varcholik, and J. LaViola, Breaking the Status Quo:
Improving 3D Gesture Recognition with Spatially Convenient Input
Devices, in Virtual Reality Conference. IEEE Computer Society, March
2010, pp. 5966.
[3] (2013) Introducing Myo. Thalmic Labs. [Accessed: November 2013].
[Online] Available: https://www.thalmic.com/en/myo/
[4] (2013) Ultrasonic Gesture Technology. Elliptic Labs. [Accessed:
November 2013]. [Online] Available:
http://www.ellipticlabs.com/?page_id=2211
[5] (2008) Mgestyk Videos. Mgestyk Technologies Inc. [Accessed: October
2013]. [Online] Available: http://mgestyk.com/videos.html
[6] P. Lavoie, D. Ionescu, and E. Petriu, 3-D Object Model Recovery From
2-D Images Using Structured Light, in Proc. IMTC/96, IEEE Instrum.
Meas. Technol. Conf., Brussels, Belgium, 1996, pp.377-382.
[7] (2010) Xbox.com Kinect. Microsoft Corp. [Accessed: November 2013].
[Online]. Available: http://www.xbox.com/en-US/kinect
[8] K. Lai, J. Konrad, and P. Ishwar. A gesture-driven computer interface
using Kinect. in Proc. of IEEE Southwest Symp. on Image Analysis and
Interpretation 2012, April 2012, pp. 185-188.
[9] C. Dal Mutto, P. Zanuttigh and G. M. Cortelazzo, Time-of-Flight
Cameras and Kinect, in SpringerBriefs in Electrical and Computer
Engineering, Springer, 2012.
[10] M. Hansard, S. Lee, O. Choi, and R. P. Horaud, Time-of-Flight
Cameras: Principles, Methods and Applications, in SpringerBriefs in
Computer Science, Springer, 2013.
[11] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,
A. Kipman, and A. Blake, Real-Time Human Pose Recognition in Parts
from a Single Depth Image, in Proc. of IEEE Computer Vision and
Pattern Recognition (CVPR), 2011, pp. 1297-1304.
[12] H. Shimotahira, K. Iizuka, S.-C. Chu, C. Wah, F. Costen, and Y.
Yoshikuni, Three-dimensional laser microvision, in Appl. Opt. 40,
2001, pp.1784-1794.
[13] H. Shimotahira, K. Iizuka, F. Taga, and S. Fujii, 3D laser microvision,
in Optical Methods in Biomedical and Environmental Science, Elsevier,
New York, 1994, pp.113-116.
[14] T. Kanamaru, K. Yamada, T. Ichikawa, T. Naemura, K. Aizawa, and T.
Saito, Acquisition of 3D image representation in multimedia ambiance
communication using 3D laser scanner and digital camera" in Three-
Dimensional Image Capture and Applications III, Proc. SPIE 3958,
2000, pp.80-89.
[15] D. A. Green, F. Blais, J.-A. Beraldin, and L. Cournoyer, MDSP: a
modular DSP architecture for a real-time 3D laser range sensor, in
Three-Dimensional Image Capture and Applications V, Proc. SPIE
4661, 2002, pp.9-19.
[16] V. H. Chan and M. Samaan, Spherical/cylindrical laser scanner for
geometric reverse engineering, in Three-Dimensional Image Capture
and Applications VI, Proc. SPIE 5302, 2004, pp.33-40.
[17] A. Saxena, S. H. Chung, and A. Y. Ng, 3-D Depth Reconstruction from
a Single Still Image, in International Journal of Computer Vision,
ACM, 2008, vol. 76, pp.53-69.
[18] R. A. Jarvis, A perspective on range finding techniques for computer
vision, in IEEE Trans. Pattern anal. Machine Intel., vol. 5, March
1983, pp. 122-139.

Fig. 5. USB connection of the 3D camera to the PlayStation 3 console.

Fig. 6. Two fists to accellerate in the console racing game WipEout HD.

Fig. 7. Two thumbs up to brake in the console racing game MotorStorm.

Fig. 8. Finger and full-body control from 3m (9ft).

Potrebbero piacerti anche