Real-Time Gesture Recognition System for Game Playing

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Abhilash Sunder Raj and Ameya Joshi

Introduction

Motivation

Virtual Reality (VR) is the field which deals with creating a virtual environment that the user can interact with. In recent years, the use of VR has been gaining popularity in a wide variety of applications. One of the most popular applications is the use of VR for playing video games. With the incorporation of Virtual Reality, the user can enjoy a fully immersive gaming experience. We were inspired by the idea of building a real-time system that can be used to play any existing video game in such a fully immersive gaming environment. We believe that this project is the first step towards achieving such a goal.

Fig. 1 - Popular Car Racing Game - Need For Speed

Problem Definition

Most computer games today have keyboard controls. In order to enjoy a fully immersive gaming experience, we must first eliminate the need for a keyboard. A natural way to do so would be to use hand gestures to play the game instead. The main aim of this project is to build a real-time system which can achieve this. The goals of such a system are three-fold:

  • Recognize the user's hand gestures.
  • Map the gestures to the keyboard controls of the game.
  • Override the keyboard controls so that the user can play the game without the help of a keyboard.

In addition, the following constraints also have to be met:

  • The system must not have very high computational complexity. It should run on any ordinary laptop/desktop.
  • The user should not have to wear special sensors/markers/gloves to facilitate gesture recognition.

For this project, we have chosen to implement a gesture recognition system for playing car racing games. Two major aspects common to all car racing games are steering (turning the car left/right) and acceleration. Currently, we have included these functions in our gesture recognition system.


Hardware

This project was implemented on a Macbook Pro laptop using the Intel® RealSense™ Camera (F200).

Fig.2 - Intel® RealSense™ Camera (F200)

The Intel® RealSense™ F200 is a front facing, depth imaging camera. It uses structured light techniques to create a depth image of the scene. It produces three image streams:

  • Depth stream: 640x480 resolution at 60fps
  • IR stream: 640x480 resolution at 60fps
  • RGB stream: 1080p at 30fps


Implementation

The crux of this project is designing an efficient gesture recognition system. The user's gestures are identified solely from the image stream captured by the camera.

In this project, we implement steering and acceleration functions used for playing car racing games. We have tried to make the system as intuitive as possible. In order to play the game, the user visualizes a steering wheel in his/her hands. He/She turns the car left or right by turning the wheel. Whenever the user wants to accelerate, he/she raises his/her thumb. The system should detect these gestures and translate it into equivalent commands in the game.

The user's gestures can be detected using image processing techniques.

  • Steering: The direction the user wants to turn can be found by measuring the angle subtended by the hands with the horizontal. When the user turns left, his/her hands subtend an acute angle with the horizontal. On the other hand, when turning right, an obtuse angle is subtended with the horizontal.
  • Acceleration: The user's intent to accelerate can be found by detecting the user's thumbs.

One of the major challenges in this project is the time constraint. Sophisticated image processing techniques are computationally expensive and will drastically drop the frame rate if implemented on an ordinary laptop. This will result in a very noticeable lag between the user's actions and the response on the screen. Using the depth image feed from the Intel® RealSense™ Camera, we have managed to implement a fast and efficient gesture recognition system.


Pipeline for Gesture Recognition

Fig.3 - Pipeline for the Gesture Recognition System

The gesture recognition pipeline was implemented using the OpenCV package in C++. Each step of the pipeline is briefly explained below:

Input

The input to the system is the depth stream from the camera. This is a grayscale image.

Fig.4 - Depth stream from the Intel® RealSense™ Camera


Depth-Based Hand Segmentation

In the depth image, regions at different depths have different pixel intensities. In order to segment out the hands (whose pixel intensities are different from the rest of the scene), we filter out all pixels whose value is above an upper threshold or below a lower threshold. This is done using the threshold function available in OpenCV. These thresholds determine the range of distances for which the hand segmentation works. Hence, they must be calibrated in order to maximize the range.

Fig.5 - Thresholding to segment out the hands

After removing the unwanted portions, the image is inverted so that the background is white and the hands are black. This will be useful in the later steps of the pipeline.

Fig.6 - Inversion of image

This thresholding is a computationally inexpensive process and can be done very fast. This is the main advantage of using a depth camera. If we were using an ordinary camera, we would have to implement sophisticated and computationally expensive image processing algorithms to do the same task.


Detection of steering direction

This section describes the algorithm implemented to determine the steering direction (that is, whether the user wants to turn left/right or drive straight). It corresponds to the right half of the flow diagram. The output of the hand segmentation stage acts as the input to this stage.

It involves the following steps:

Blurring/Smoothing

Blurring/smoothing of an image is an operation in which we apply a filter to the image. The most common type of filters are linear, in which an output pixel’s value is determined as a weighted sum of input pixel values. In our project, we apply the simplest method of smoothing, namely homogeneous smoothing using the blur function in OpenCV. This method simply takes the average of the neighborhood (kernel) of a pixel and assigns that value to it. The size of this kernel (neighborhood) has to be optimized.

Fig.7 - Image after blurring

The image is blurred in order to facilitate the next step, namely, blob detection.

Blob detection

A blob is a group of connected pixels in an image that share some common property (e.g grayscale value). In our implementation, the blurred images of the hands are the two blobs. We have used the built-in SimpleBlobDetector function available in OpenCV. This can be used to filter and detect the kind of blobs that we want.

We would like to detect the positions of the two hands. In order to do so, we filter blobs by area. This means that we set a minimum and a maximum threshold on the area of the blobs. Only those blobs which lie within these thresholds are detected. These thresholds have to be optimized for correct detection. The blurring step ensures that each hand is detected as one big blob rather than multiple smaller blobs. Therefore, blurring is an integral step in the process.

Fig. 8 - Blob detector : To find positions of the hands as well as the angle subtended with the horizontal

Once the blob detector detects the positions of the hands, we can find the angle between the centres of the blobs by basic trigonometry. This gives us the angle subtended by the hands with the horizontal.

Once we obtain the angle, the steering direction is determined using the following decision rule:

  • If Angle > 10 degrees, then the player wants to turn left.
  • If Angle < -10 degrees, then the player wants to turn right.
  • Otherwise, the player wants to drive straight


Detection of acceleration

This section describes the algorithm implemented to determine the acceleration (that is, whether the user wants to accelerate or not). It corresponds to the left half of the flow diagram. The output of the hand segmentation stage acts as the input to this stage.

It involves the following steps:

Erosion

The segmented hand is not a smooth, connected entity. Instead, it has small patches of white noise as can be seen from the figure below.

Fig. 9 - Image before erosion

To facilitate the next step, contour detection, we eliminate these patches of white noise using erosion (erode function in OpenCV). The basic idea of erosion is just like soil erosion. It erodes away the boundaries of foreground object. In erosion, a kernel slides through the image (as in 2D convolution). A pixel in the original image (either 1 or 0) will be considered 1 only if all the pixels under the kernel is 1, otherwise it is eroded (made to zero). The size of this kernel has to be optimized.

Fig. 10 - Image after erosion

As seen above, erosion fills in the white spaces without blurring the boundaries of the hands. This is required for the next step.

Finding Contours

The next step in the pipeline is finding the contours of the hands. This is done by using the findContours function in OpenCV. This function returns a set of points that correspond to the contours of the hands.

Then, we use the fitEllipse function in OpenCV to fit ellipses to these contours.

Fig. 11 - Finding the contours of the hands

Once the ellipses are fitted, we find the perimeters of the elliptical contours using the arcLength function. If the user has raised his/her thumb, the perimeter of the ellipse is larger than the perimeter in the case of a closed fist. Therefore, if the perimeter is larger than a particular threshold, the user has raised his thumb and wants to accelerate the car. Otherwise, the user does not want to accelerate. This area threshold has to be calibrated.

Fig.12 - Contour of a closed fist : The perimeter is smaller than the threshold
Fig.13 - Contour of an open fist : The perimeter is larger than the threshold

In a real game, the user would want to accelerate as well as steer the car simultaneously. In order to ensure this, the steering detection and acceleration detection are done in parallel.


Emulation of Keyboard Functions

The final step of the gesture recognition system is emulation of keyboard functions. We map the steering and acceleration functions to the keyboard controls (W-A-S-D). We have used Apple Script to override the keyboard and press these keys in the game whenever the user desires it.

One major challenge of this step was that calling the Apple Script in the same thread as the gesture recognition was too time consuming. In fact, doing so drastically dropped the frame rate to around 15 fps. We got around this problem by using multi-threading to reduce latency. This means that the gesture detection and keyboard emulation are executed in parallel on different threads. Doing so once again restored the frame rate.


Results

The above procedure was used to build an efficient, real-time gesture recognition system. Currently, the range of the system is 1.5 feet. It recognizes the user's hand gestures when the hands are in between 0.5 feet and 2 feet from the depth camera.

Variation of Frame Rates in the Pipeline

The change in frame rates of the camera with each step of the process is shown below:

Fig.14 - Change in frame rate of the camera with the inclusion of each step in the pipeline

As can be seen:

  • The original frame rate of the camera is 60 fps.
  • After incorporating the gesture recognition system, the minimum frame rate turned out to be 55 fps.
  • When keyboard emulation is included in the same thread, the frame rate drops to 15 fps.
  • When gesture recognition and keyboard emulation are done in parallel, a frame rate of 55 fps is restored.

Variation of Frame Rates with Movement

The minimum frame rate of the system is 55 fps. The actual frame rate depends on the movements of the user's hands. Currently, our system can detect six states:

  • Go Straight.
  • Turn Right.
  • Turn Left.
  • Go Straight and Accelerate.
  • Turn Right and Accelerate.
  • Turn Left and Accelerate.

When the user remains in one of the above states, we say that he/she is in steady state. When the user is switching from one state to another, we say that he/she is in transition

Fig.15 - Change in frame rate of the camera with movement

The above graph shows that:

  • When the user is in steady state, that is, when the user is currently in one of the six states, the frame rate is 58 fps.
  • When the user is in transition, that is, when the user switches from one state to another, the frame rate drops to 55 fps.

This means that when the user's hands move a lot, frame rate drops to a minimum of 55 fps. However, this is still a very high value. As a result, there is no noticeable lag between the user's movements and the computer's response.


Conclusion

In this project, we designed and implemented a real-time system that can recognize the user's hand gestures and translate it to equivalent keyboard commands. Both the gesture-recognition step and the keyboard emulation step have been optimized to work as efficiently as possible. The frame rate of the entire system is 55 fps, which means that there is no noticeable lag between the user's movements and the response on the computer. Currently, the system includes gesture recognition for steering and acceleration and can be used to play simple car racing games. It has been tested on a Macbook Pro laptop. The only additional hardware used is the Intel® RealSense™ F200 depth camera.

The following links are YouTube videos which demonstrate our system in action:

Future work

This project gives a framework for implementing efficient, real-time image processing algorithms. The following avenues can be explored in the future:

  • A calibration system can be set up to automatically calibrate the system to the size of each user's hands, the user's range preferences and so on.
  • The system can be extended to recognize a larger set of gestures. Once that is done, it can be used to play any computer game or even implement any general application (like zooming in/out) without using a keyboard or a mouse.
  • Finally, the system can be integrated with a head mounted display to give the user a fully immersive experience.


References

[1] http://www.intel.com/content/www/us/en/architecture-and-technology/realsense-overview.html
[2] https://software.intel.com/en-us/RealSense/F200Camera
[3] http://opencv.org/
[4] http://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html?highlight=threshold#threshold
[5] http://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html?highlight=blur#blur
[6] http://opencv.itseez.com/2.4/modules/features2d/doc/common_interfaces_of_feature_detectors.html?highlight=blob#simpleblobdetector
[7] http://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html?highlight=erode#erode
[8] http://docs.opencv.org/2.4/modules/imgproc/doc/structural_analysis_and_shape_descriptors.html#findcontours
[9] http://docs.opencv.org/2.4/modules/imgproc/doc/structural_analysis_and_shape_descriptors.html?highlight=fitellipse#fitellipse
[10] http://docs.opencv.org/2.4/modules/imgproc/doc/structural_analysis_and_shape_descriptors.html#arclength
[11] https://en.wikipedia.org/wiki/Virtual_reality


Acknowledgements

We are extremely grateful to Dimitri Diakopoulos for mentoring us in this project. We would also like to thank Prof. Wendell, Prof. Farrell and Prof. Bhowmik for their guidance and support through the course of the project. We would like to extend special thanks to Prof. Wendell and Prof. Bhowmik for helping us get access to the internal Intel software development library.


Appendix I

The C++ code used for the project can be found here: Code

Please note that many of the libraries included in the code are internal Intel software development libraries, which cannot be shared publicly. You will need access to those libraries in order to run the code.

Appendix II

Most of the work was done jointly. The distribution of work is as follows:

  • Abhilash : Depth-based Hand Segmentation, Detection of Steering Direction, Emulation of Keyboard Functions, Testing the System, Slide Presentation and Report.
  • Ameya : Depth-based Hand Segmentation, Detection of Steering Direction, Detection of Acceleration, Emulation of Keyboard Functions, Multi-Threading and Slide Presentation.