# Detecting Super Smash Bros characters using Tensorflow detection API

## Apr 20, 2019 11:32 · 1542 words · 8 minute read tutorial deep learning computer vision

For this article, we will train a model to detect Super Smash Bros Ultimate (one of the best game in the world) characters. However, there are too many of them, and the detector will become out of date if another character comes out, I am also lazy 🤷‍. So we will limit my dataset only to my favorite character Rob.

But why detecting characters in SSBU? Well, first of all, it is fun. But we can also, thanks to the detection, have the knowledge of the positions of every character over time. We can then use it for esports analytics. This has been already tackled for League of Legends via the project DeepLeague. For SSBU, it can be difficult because of the moving camera. However, this example is only limited to one character, so it is up to you to improve it or use it for another purpose.

The code that goes with this article can be found on my specific github repository.

## Introduction to Deep learning for object detection

Detection is a computer vision task where the aim is to locate and identify an object in an image. It is between the classification task and instance segmentation. The latter consists in classifying each pixel of detected objects while detection only needs to draw a bounding box.

Since this is the first article in this blog talking about deep learning, here is a brief introduction to it. Deep learning is part of the machine learning (ML) area. An ML algorithm uses data to infer new pieces of information, it learns from the data. More data it has, more it can be powerful and accurate. \
A technique in machine learning is to use a neural network (can also be called “model”) which is composed of hidden layers. As the computation power grows, we can stack more and more layers so that the adjective “deep” in deep learning refers to the depth of the neural network. Each layer contains variables called “weights” that are initialized randomly. A neural network is, in fact, a big mathematical function $f(x, weights) = y$. In our case, $x$ is an image and $y$ the detections.

To make a neural network work, we need to find the adequate weights so that the output becomes meaningful. For that, we feed the neural network a collection of labeled data, called a dataset. A labeled data $x$ comes with its annotation $y$. The training phase will tune the weights according to the dataset via an algorithm called “backpropagation”. The following section gives details on how to create the dataset.

This description is still a bit vague about how concretely it works, but, despite that, we do not have to understand the nuts and bolts to use deep learning. I recommend you to check other resources (like a MOOC in Coursera) if you would like to have more details.

Deep learning revolutionized computer vision, achieving state of the art results with specific architectures. Thankfully, nowadays it is very easy to get a pre-trained model for detection. There is one module in Tensorflow for this purpose, called the TF object detection API.

## Creating the dataset

First of all, in any machine learning task, we need to create our dataset.

You can clone my repository to get all the scripts necessary to run this tutorial.

git clone https://github.com/Coac/tf-detection-example.git
cd tf-detection-example/


### Setting up the folder structure

We begin by creating a folder named dataset with the following structure:

dataset/
├── train/
│   ├── images/
│       ├── image1.jpg
│       ├── image2.jpg
│   ├── annotations/
│       ├── image1.xml
│       ├── image2.xml
├── label_map.pbtxt


The label_map.pbtxt is a file which indicates the classes (categories of object to detect) located in our dataset. For instance:

item {
id: 1
name: 'rob'
}


### Annotating the images

Now we need to fill the images/ and annotations/ folders. To do so, we first gather images containing the objects we would like to detect and we put them into images/. For our Rob detection task, we can simply record some gameplay as video, then do some screenshots. Then we will need to label them, one by one, detection by detection.

We can use the LabelImg tool that writes the annotations file as PASCAL VOC dataset, that is used by Tensorflow. We open the program and do the following:

• Click on Open Dir and set it to dataset/images
• Click on Change Save Dir and set it to dataset/annotations

Finally, we can start the boring annotation task.

Pro tips:

• Use hotkeys for faster annotation (A for the previous image, D for next image, W to draw a box)
• You can set default label to avoid retype each time the name after drawing a box

### Splitting into train set and test set

Once we get our labeled data, we need to split them in two, one set for training and another for testing. This allows us to evaluate correctly the performance of our model by calculating the metrics only on the test set.

We use the split_train_test.py to move some images and annotations to the test/ folder.

python split_train_test.py


In the end, we should have this folder structure for the dataset:

dataset/
├── train/
│   ├── images/
│   ├── annotations/
├── test/
│   ├── images/
│   ├── annotations/
├── label_map.pbtxt


### Using an existing dataset

For the laziest, you can also use an existing dataset. If your existing dataset is in YOLO format, use the script convert_dataset_format.py to convert to a PASCAL VOC style dataset. You can also download my ready-to-use dataset for detecting Rob here, it contains ~200 images, 100 with Rob and 100 without.

### Creating TFRecords

A TFRecord is a serialized dataset for Tensorflow. It allows the whole training pipeline to load efficiently the data. Once we have our dataset ready, we use the script create_pascal_tf_record.py to convert our images and annotations folder to TFRecords.

python create_pascal_tf_record.py --output_path=./train.record --data_dir=dataset/train
python create_pascal_tf_record.py --output_path=./test.record --data_dir=dataset/test


## Installing TF detection

You need first to clone the repository. It contains the necessary scripts to train, export and use detection models.

git clone https://github.com/tensorflow/models.git


Then, I will let you follow the well-written installation instructions in the README. On my side, I installed all the dependencies in a Conda environment, I also recommend you to do so.

Do not forget to add it to the PYTHONPATH. Set the TF_MODELS_PATH env var to the Tensorflow detection folder. For example, if you cloned the repository into your home directory do:

export TF_MODELS_PATH=~/models


or cd into the directory and do:

export TF_MODELS_PATH=$(pwd)  ## Choosing and downloading a pre-trained model Take a look at the model zoo available in Tensorflow object detection API, there are many trained models available, there is one for everyone’s need. For us, we will choose the one with the faster inference time and not so bad mAP (mean Average Precision) as we would like to do real-time detection. We will use the ssdlite_mobilenet_v2_coco_2018_05_09 model. Once we get the neural network name, we need to download the pre-trained model. For that, we use a simple script. python download_model.py  We also need a pipeline configuration .config file. In the repository there is already a ssdlite_mobilenet_v2_coco_modified.config, but if you changed the model, download the adequate file from here and edit the file by replacing all the PATH_TO_BE_CONFIGURED. It defines the whole training pipeline, including data augmentation and evaluation. Data augmentation is used to generate on-the-fly modified images from the dataset. This virtually increases the number of training examples and makes the neural network generalize better. I kept the default data augmentation techniques random_horizontal_flip and ssd_random_crop. The evaluation phase calculates the metrics of the model on the test set. They will be used to assess the performance of the model. ## Fine-tuning This part is called fine tuning as we get an already pre-trained model that works out of the box on general classes. We only need to retrain on our specific dataset. For that, we run the following command: python${TF_MODELS_PATH}/research/object_detection/model_main.py \
--pipeline_config_path=ssdlite_mobilenet_v2_coco_modified.config \
--model_dir=./finetuned_model \
--alsologtostderr \
--num_train_steps=30000 \
--num_eval_steps=100


### Monitoring via Tensorboard

We can follow the training using Tensorboard. It is a very useful tool to monitor any scalars and even images over the training. By default, the Tensorflow detection API already logs lots of useful summaries like the mAP and recall metrics.

To run Tensorboard, we use the following command:

tensorboard --logdir=finetuned_model


Then open the browser and go to localhost:6006.

After 16 hours of training (34k steps) using a GTX 1080, the model gets ~0.49 mAP and ~0.48 Recall on the test set.

## Exporting the model

After the training completion, export the graph for inference:

INPUT_TYPE=image_tensor
PIPELINE_CONFIG_PATH=ssdlite_mobilenet_v2_coco_modified.config
TRAINED_CKPT_PREFIX=finetuned_model/model.ckpt-30000
EXPORT_DIR=finetuned_model/export_frozen
python ${TF_MODELS_PATH}/research/object_detection/export_inference_graph.py \ --input_type=${INPUT_TYPE} \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \ --trained_checkpoint_prefix=${TRAINED_CKPT_PREFIX} \
--output_directory=\${EXPORT_DIR}


It will create the file finetuned_model/export_frozen/frozen_inference_graph.pb.

## Testing the model

Using the exported graph, we run an inference on a single image:

python inference.py


The model works quite well, it is fast enough to do real-time detection achieving ~10 FPS on my CPU-only laptop. The limitation is that it is not detecting well when Rob does a 360 rotate using the neutral aerial. To fix this, we can simply add more of these examples in our dataset. We could also use the exported model in the browser via tfjs or even in a mobile device.