Automated Attendance System Using Deep Learning

Introduction

Edge devices such as drones, IoT devices and cameras are becoming ubiquitous and playing an important role in our daily lives. For example, these devices are used more and more frequently for sensing tasks such as target tracking or object detection. These tasks require low latency and tend to be computationally and power consuming. Due to the real-time nature of the tasks and the limited power and computation capabilities of the edge devices, the edge devices offload the large amounts of collected sensory raw data to the cloud or to a centralized data-center, where it is processed and analyzed.

In this project we are interested in building an automated attendance prototype system. The system is constructed from two main hardware components, the camera (edge device) and the cloud. The camera takes photos of all the students entering the class and sends them to the cloud. In the cloud, these images are processed using state-of-the-art Convolutional Neural Networks (CNN). This design is optimal when the data transfer rate is high, and the raw data reaches the cloud in a timely manner. However, in real scenarios, the data is sent over bandwidth-constrained and fluctuating networks. Therefore, we are interested in analyzing the relationship between image compression (such as JPEG) and the accuracy of inference models.

Background

Lossy compression, also called perceptually lossless compression is an algorithm that takes advantage of the limitations of the human eye. Compression and decompression algorithms, such as JPEG, leverages the fact that slight modifications and loss of information often do not affect the perceived quality of the image. The degree of compression can be adjusted, allowing a selectable tradeoff between image quality and storage size.

The JPEG Encoding algorithm consists of five successive stages described below:

RGB color space to YC_BC_R color space conversion

The Y is the luma component and C_B and C_R are the blue-difference and red-difference chroma components respectively.

Preprocessing for the Discrete Cosine Transformation (DCT)

The spatial resolution of the data is reduced to compress images more efficiently. Humans can perceive smaller differences in brightness than hue and color, so the C_B and C_R components can be downsampled.

DCT

The DCT is used to convert the image containing spatial information into numeric data of its frequency or spectral information. The image now exists in a quantitative form and can be manipulated for compression.

Coefficient Quantization

The Coefficient Quantization process reduces the number of bits needed to store the image by reducing the precision of the integer, by scaling each of the components in the frequency domain by a constant and then rounding to the nearest integer. Values near to 0, especially high frequency components are converted to 0 to reduce storage space. The scaling constant is the user-defined quality factor, specified on a scale from 0 to 100 where 100 is the best image quality with the least quantization. The algorithm is considered to be lossy because of this step, which is the most lossy operation of the process.

Lossless Encoding

A lossless compression procedure is used to significantly reduce the size of the image without loosing any detail of it.

Methods

Data set and Inference model

For this project we used Labeled Faces in the Wild (LFW) data-set. This data set contains more than 13000 images of 5749 different people [1]. 1680 people have two images or more, whereas the other 4069 people have only one image. In order to train an accurate CNN, one needs to have multiple images of the same person. Therefore, in our project we chose the 10 people in the LFW data set who have the most images, over 50 images each. In order to create more images, we used augmentation techniques such as shifting and rotation.

We used a pre-trained Inception V3 CNN model. We used transfer learning techniques and modified the last couple of layers in order to adjust to the recognition task. These layers were trained using the data-set specified above. Our code is based on this project. We made slight modifications to achieve higher accuracy.

Image manipulation

In order to modify the images we used the OpenCV library in Python. Using this library, we modified the original image, and created different JPEG compressed images.

Results

We curated a new collection of 50 JPEG images downloaded from the internet that the algorithm would predict correctly with 100% accuracy. These images are already compressed and have a average image size of 52.48KB, which we used as a baseline for size comparison and accuracy. We compressed the JPEG images from the prediction dataset, with a quality factor ranging from 10 to 80, with increments of 10. An example of an original image is shown below:

After resizing and compressing the image with a quality factor of 10:

Compressed image with a quality factor of 10 and a size reduction of 86.2% (4KB).

We can see the effects that compression has on each individual pixel of image. Below is shown a comparison between a zoomed in area of size 24x24 pixels of the original image, and the same area of the compressed image with a quality factor of 10. This loss of information could have a significant impact on the ability of the inference model to make accurate predictions, since the information of relevant facial features and landmarks is compromised.

Zooming in on a 24x24 pixel area the original image.
Zooming in on compressed image with quality factor of 10.

After compression, we ran the images through the inference network to find the accuracy rates with different levels of compression to find a correlation between accuracy, file size reduction and quality factor.

We can see a correlation between the percentage of image size reduction and the accuracy of the recognition algorithm. For images with a quality factor of 10, the image size reduces by ~90% while keeping a 80.62% accuracy. On the other hand, when the image is reduced by ~70% with a quality factor of 80, the achieved accuracy is almost 90%. We believe these results are promising, because they show that even with very high compression factors, the inference accuracy remains high.

Conclusions

Data compression is paramount in sharing data between devices in an efficient and quick manner. Therefore, data compression is used everywhere over the internet. In this project, we were interested in understanding how different JPEG compression quality affect the accuracy of inference models. We were surprised to find out that with 90% in image size reduction, one can achieve over 80% of inference accuracy. We believe that with a more comprehensive dataset and by improving the inference model, we can achieve higher accuracy levels for significantly compressed images. Such compression will enable the automated attendance system to scale-up efficiently to numerous cameras taking images of students simultaneously, without degrading the performance of any edge device or being too computationally expensive.

Appendix

In this project, we also analyzed how different blur techniques affect the inference accuracy. As with JPEG compression, we used OpenCV library to blur the images. We used three types of blurring techniques: average, median and Gaussian. Average blurring is done by convolving the image with a kernel of size $k\times k$ . This type of blurring replaces the center element with the average of the pixel values under the kernel area. Median blurring is similar to average blurring, but, the central value is replaced with the median value under the kernel area. Gaussian blurring convolves the image with a Gaussian kernel. In this part we tried to convolve the image with different kernel sizes and compare between the different blurring techniques.

	Accuracy According to Blur Type (%)
Kernel Size	Average	Median	Gaussian
5	85.7142	77.5510	89.7959
7	67.3469	63.2653	79.5918
9	61.2245	57.1429	75.5102
11	53.0612	44.8979	69.3877
15	38.7755	34.6939	67.3469
21	30.6122	32.6530	51.0204
27	20.4081	30.6122	38.7755

The blur kernel size ranged from 5 to 27 in increments of 2. With the smallest kernel size of 5, the blurring effect of each particular filter is almost indiscernible to the human eye, whereas with a kernel size of 27 we can clearly see the effects of the average, median and Gaussian filters. The inference model achieved the highest overall accuracy with the Gaussian filter, and the lowest with the median. The inference model recognizes landmarks and features of the face to compute the prediction, such as relative position size and/or shape of eyes, nose, cheekbones, and jaw. The median filter performs well in algorithms where edge detection is needed, because it keeps the edges sharp, but as it can be seen from the images, all facial landmarks are blurred with a high enough kernel size. On the other hand, the Gaussian filter smoothes out the image and removes high frequency components effectively blurring edges and reducing contrast, but keeping the facial features in the same location which works to the advantage of the facial recognition algorithm.

(From left to right) Average, Median and Gaussian blur, (from top to bottom) blur level 5 and 27.

Code and Dataset

To download the repository:

git clone https://username@bitbucket.org/jpergament/psych221-project.git

This repository contains a README file which describes the code and the used images.

References

Gary B. Huang and Manu Ramesh and Tamara Berg and Erik Learned-Miller. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. University of Massachusetts, Amherst. 2007. 07-49. October. [2]

Carl Salvaggio. What is Inside a JPEG. Rochester Institute of Technology. 2015.[3]

Dias, Danoja. JPEG Compression Algorithm. May 2017. [4]

Chang, Ellen; Fernando, Udara; Hu, Jane. Data Compression. [5]

Facial Recognition Using Google’s Convolutional Neural Network. [6]

OpenCV. [7]

Labeled Faces in the Wild [8]

Automated Attendance System Using Deep Learning

Contents

Introduction