Semantic Segmentation using Fully Convolutional Networks

Semantic Segmentation using Fully Convolutional Networks

Semantic Segmentation is the task of classifying each pixel in an image into a category and is an extremely important task in computer vision as it has many real life uses such as in autonomous driving, medical imaging, and satellite imaging. In this project, my group and I implemented a Fully Convolutional Network (FCN) to perform semantic segmentation on the VOC2007 dataset.

Dataset

The PASCAL VOC 2007 dataset is a dataset for object detection, segmentation, and classification. It consists of 20 classes (21 including the backgrond class). It is also pre-split into 2 sets, a training set and a validation set. The dataset can be found here.

Example Image from the Dataset Full-width

Fully Convolutional Networks (FCNs)

To understand FCNs, one should first be introduced to what a Convoluted Neural Network (CNN) is. A CNN is a type of neural network that is used for image classification and object detection. It is made up of convolutional layers and pooling layers which are used to extract only the relevant information from images by applying a filter to the image and then pooling the output to reduce the size of the image. Finally the output is passed through a fully connected layer to classify the image, hence giving us a FCN.

CNN Visual Representation Full-width

Implementation

During the implementation of the FCN, we trained 4 different architectures in order to see which would perform the best on the dataset. The architectures were:

  1. Baseline FCN
  2. Custom FCN a simplified U-Net
  3. A Res-Net Transfer Learning FCN
  4. A fully connected U-Net

Baseline

For the baseline we used a simple architecture which uses convolutions to upsample the image to a size of 512x512 to make the features more prominent, and then we down size it down to the number of classes and classifing the pixels individually.

For some improvements that we made to the baseline, we included a cosine annealing learning rate, we also included batch normalizations after each convolutions. In addition to that, we performed data augementation on the dataset to increase effictively giving us more training data, without needing more images. Finaly since looking at the example above we can see that the background class is the majority class by a lot, which can easily cause the model to predict everything as the background class, so we add a class weight to the loss function to help the model learn the other classes better.

More detailed information on this can be found in the report

Custom FCN

The custom FCN, was out implementation of a simplified U-Net which offers a balance between computational efficiency and segmentation accuracy. It uses a encoder and a decoder to extract features and then upsample the image to the original size. The architecture is as follows:

LayerIn ChannelsOut ChannelsKernelStridePaddingActivation
enc conv1364311ReLU
enc conv264128311ReLU
enc conv3128256311ReLU
bottleneck conv256512311ReLU
dec upconv1512256220ReLU
dec conv1256256311ReLU
dec upconv2256128220ReLU
dec conv2128128311ReLU
dec upconv312864220ReLU
dec conv36464311ReLU
final conv6421110-

Simplified U-Net Architecture

Res-Net Transfer Learning FCN

For this architecture we used a pre-trained Res-Net 50. Which is pre-trained on ImageNet to develop a semantic segmentation model. Using Res-Net as a backbone, it serves as a feature extractor which we then append a decoder that upsamples the feature which then classifies the pixels. The architecture is as follows:

LayerIn ChannelsOut ChannelsKernelStridePaddingActivation
backbone------
conv1204810241--ReLU
conv210245121--ReLU
deconv1512256321ReLU
deconv2256128321ReLU
deconv312864321ReLU
bn1-----BatchNorm2d
deconv46464321ReLU
bn2-----BatchNorm2d
classifier64211---

Res-Net Transfer Learning FCN Architecture

Fully Connected U-Net

The U-Net is a popular architecture for semantic segmentation. It is made up of an encoder and a decoder, where the encoder extracts features from the image and the decoder upsamples the image to then classify the pixels. It also implements a crop and copy between the encoder and decoder to help the model learn better. The architecture is as follows:

LayerIn ChannelsOut ChannelsKernelStridePaddingActivation
Conv11364311ReLU
Conv16464311ReLU
MaxPool1--220-
Conv2164128311ReLU
Conv2128128311ReLU
MaxPool2--220-
Conv31128256311ReLU
Conv3256256311ReLU
MaxPool3--220-
Conv41256512311ReLU
Conv4512512311ReLU
MaxPool4--220-
Bottleneck15121024311ReLU
Bottleneck210241024311ReLU
ConvTransposed11024512220ReLU
upConv111024512311ReLU
upConv12512512311ReLU
ConvTransposed2512256220ReLU
upConv21512256311ReLU
upConv22256256311ReLU
ConvTransposed3256128220ReLU
upConv31256128311ReLU
upConv32128128311ReLU
ConvTransposed412864220ReLU
upConv4112864311ReLU
upConv426464311ReLU
softmax6421110-

Here is a visual representation of the U-Net Architecture: Full-width

Results

Among the 4 architectures, the Res-Net Transfer Learning FCN performed the best as we were able to train it to segment the images correctly while the others still seemed random after approximately 500 epochs.

Example of the Res-Net Transfer Learning FCN’s output: Full-width Example of the Res-Net Transfer Learning FCN’s output

Discussion

The Res-Net Transfer Learning FCN performed the best as it was able to segment the images correctly. The other architectures were not able to segment the images correctly as they seemed random after approximately 500 epochs. This could be due to the fact that the Res-Net Transfer Learning FCN was pre-trained on ImageNet which is a large dataset and hence it was able to learn the features of the images better. Compared to the others where we trained the locally from scratch which could imply that some of the other models did not have enough training to learn properly. For example the U-Net was able to segment and get the shape of the objects but it was not able to get the correct colors of the objects, which could imply that more training is required.

U-Net’s output: half-width Image of a bird segmented by the U-Net

Deliverables

For a more in-depth look at the project, you can download the project below, and you can also view the code by clicking the image below.