Real-Time Gesture Recognition System for Game Playing

From Psych 221 Image Systems Engineering
Revision as of 10:59, 11 December 2015 by imported>Projects221
Jump to navigation Jump to search

Abhilash Sunder Raj and Ameya Joshi

Introduction

Motivation

Virtual Reality (VR) is the field which deals with creating a virtual environment that the user can interact with. In recent years, the use of VR has been gaining popularity in a wide variety of applications. One of the most popular applications is the use of VR for playing video games. With the incorporation of Virtual Reality, the user can enjoy a fully immersive gaming experience. We were inspired by the idea of building a real-time system that can be used to play any existing video game in such a fully immersive gaming environment. We believe that this project is the first step towards achieving such a goal.

Popular Car Racing Game - Need For Speed

Problem Definition

Most computer games today have keyboard controls. In order to enjoy a fully immersive gaming experience, we must first eliminate the need for a keyboard. A natural way to do so would be to use hand gestures to play the game instead. The main aim of this project is to build a real-time system which can achieve this. The goals of such a system are three-fold:

  • Recognize the user's hand gestures.
  • Map the gestures to the keyboard controls of the game.
  • Override the keyboard controls so that the user can play the game without the help of a keyboard.

In addition, the following constraints also have to be met:

  • The system must not have very high computational complexity. It should run on any ordinary laptop/desktop.
  • The user should not have to wear special sensors/markers/gloves to facilitate gesture recognition.

For this project, we have chosen to implement a gesture recognition system for playing car racing games. Two major aspects common to all car racing games are steering (turning the car left/right) and acceleration. Currently, we have included these functions in our gesture recognition system.


Hardware

This project was implemented on a Macbook Pro laptop using the Intel® RealSense™ Camera (F200).

Intel® RealSense™ Camera (F200)

The Intel® RealSense™ F200 is a front facing, depth imaging camera. It uses structured light techniques to create a depth image of the scene. It produces three image streams:

  • Depth stream: 640x480 resolution at 60fps
  • IR stream: 640x480 resolution at 60fps
  • RGB stream: 1080p at 30fps


Implementation

The crux of this project is designing an efficient gesture recognition system. The user's gestures are identified solely from the image stream captured by the camera.

In this project, we implement steering and acceleration functions used for playing car racing games. We have tried to make the system as intuitive as possible. In order to play the game, the user visualizes a steering wheel in his/her hands. He/She turns the car left or right by turning the wheel. Whenever the user wants to accelerate, he/she raises his/her thumb. The system should detect these gestures and translate it into equivalent commands in the game.

The user's gestures can be detected using image processing techniques.

  • Steering: The direction the user wants to turn can be found by measuring the angle subtended by the hands with the horizontal. When the user turns left, his/her hands subtend an acute angle with the horizontal. On the other hand, when turning right, an obtuse angle is subtended with the horizontal.
  • Acceleration: The user's intent to accelerate can be found by detecting the user's thumbs.

One of the major challenges in this project is the time constraint. Sophisticated image processing techniques are computationally expensive and will drastically drop the frame rate if implemented on an ordinary laptop. This will result in a very noticeable lag between the user's actions and the response on the screen. Using the depth image feed from the Intel® RealSense™ Camera, we have managed to implement a fast and efficient gesture recognition system.

Pipeline for Gesture Recognition

Pipeline for the Gesture Recognition System

The gesture recognition pipeline was implemented using the OpenCV package in C++. Each step of the pipeline is briefly explained below:

Input

The input to the system is the depth stream from the camera. This is a Grayscale image.

Depth stream from the Intel® RealSense™ Camera

Depth-Based Hand Segmentation

In the depth image, regions at different depths have different pixel intensities. In order to segment out the hands (whose pixel intensities are different from the rest of the scene), we filter out all pixels whose value is above an upper threshold or below a lower threshold. These thresholds determine the range of distances for which the hand segmentation works. Hence, they must be calibrated in order to maximize the range.

Thresholding to segment out the hands

After removing the unwanted portions, the image is inverted so that the background is white and the hands are black. This will be useful in the later steps of the pipeline.

Inversion of image

This thresholding is a computationally inexpensive process and can be done very fast. This is the main advantage of using a depth camera. If we were using an ordinary camera, we would have to implement sophisticated and computationally expensive image processing algorithms to do the same task.

Detection of steering direction

This section describes the algorithm implemented to determine the steering direction (that is, whether the user wants to turn left/right or drive straight). It corresponds to the right half of the flow diagram. The output of the hand segmentation stage acts as the input to this stage.

It involves the following steps:

Blurring/Smoothing the image

Blurring / smoothing of an image is an operation in which we apply a filter to the image. The most common type of filters are linear, in which an output pixel’s value is determined as a weighted sum of input pixel values. In our project, we apply the simplest method of smoothing, namely homogeneous smoothing. This method simply takes the average of the neighborhood (kernel) of a pixel and assigns that value to it. The size of this kernel (neighborhood) has to be optimized.

Image after blurring

The image is blurred in order to facilitate the next step, namely, blob detection.

Results

Conclusions

Appendix