Semantic Segmentation using Fully Convolutional Networks

26 Jan 2024 in Projects / Deep Learning

Semantic Segmentation is the task of classifying each pixel in an image into a category and is an extremely important task in computer vision as it has many real life uses such as in autonomous driving, medical imaging, and satellite imaging. In this project, my group and I implemented a Fully Convolutional Network (FCN) to perform semantic segmentation on the VOC2007 dataset.

Dataset

The PASCAL VOC 2007 dataset is a dataset for object detection, segmentation, and classification. It consists of 20 classes (21 including the backgrond class). It is also pre-split into 2 sets, a training set and a validation set. The dataset can be found here.

Example Image from the Dataset Full-width

Fully Convolutional Networks (FCNs)

To understand FCNs, one should first be introduced to what a Convoluted Neural Network (CNN) is. A CNN is a type of neural network that is used for image classification and object detection. It is made up of convolutional layers and pooling layers which are used to extract only the relevant information from images by applying a filter to the image and then pooling the output to reduce the size of the image. Finally the output is passed through a fully connected layer to classify the image, hence giving us a FCN.

CNN Visual Representation Full-width

Implementation

During the implementation of the FCN, we trained 4 different architectures in order to see which would perform the best on the dataset. The architectures were:

Baseline FCN
Custom FCN a simplified U-Net
A Res-Net Transfer Learning FCN
A fully connected U-Net

Baseline

For the baseline we used a simple architecture which uses convolutions to upsample the image to a size of 512x512 to make the features more prominent, and then we down size it down to the number of classes and classifing the pixels individually.

For some improvements that we made to the baseline, we included a cosine annealing learning rate, we also included batch normalizations after each convolutions. In addition to that, we performed data augementation on the dataset to increase effictively giving us more training data, without needing more images. Finaly since looking at the example above we can see that the background class is the majority class by a lot, which can easily cause the model to predict everything as the background class, so we add a class weight to the loss function to help the model learn the other classes better.

More detailed information on this can be found in the report

Custom FCN

The custom FCN, was out implementation of a simplified U-Net which offers a balance between computational efficiency and segmentation accuracy. It uses a encoder and a decoder to extract features and then upsample the image to the original size. The architecture is as follows:

Layer	In Channels	Out Channels	Kernel	Stride	Padding	Activation
enc conv1	3	64	3	1	1	ReLU
enc conv2	64	128	3	1	1	ReLU
enc conv3	128	256	3	1	1	ReLU
bottleneck conv	256	512	3	1	1	ReLU
dec upconv1	512	256	2	2	0	ReLU
dec conv1	256	256	3	1	1	ReLU
dec upconv2	256	128	2	2	0	ReLU
dec conv2	128	128	3	1	1	ReLU
dec upconv3	128	64	2	2	0	ReLU
dec conv3	64	64	3	1	1	ReLU
final conv	64	21	1	1	0	-

Simplified U-Net Architecture

Res-Net Transfer Learning FCN

For this architecture we used a pre-trained Res-Net 50. Which is pre-trained on ImageNet to develop a semantic segmentation model. Using Res-Net as a backbone, it serves as a feature extractor which we then append a decoder that upsamples the feature which then classifies the pixels. The architecture is as follows:

Layer	In Channels	Out Channels	Kernel	Stride	Padding	Activation
backbone	-	-	-	-	-	-
conv1	2048	1024	1	-	-	ReLU
conv2	1024	512	1	-	-	ReLU
deconv1	512	256	3	2	1	ReLU
deconv2	256	128	3	2	1	ReLU
deconv3	128	64	3	2	1	ReLU
bn1	-	-	-	-	-	BatchNorm2d
deconv4	64	64	3	2	1	ReLU
bn2	-	-	-	-	-	BatchNorm2d
classifier	64	21	1	-	-	-

Res-Net Transfer Learning FCN Architecture

Fully Connected U-Net

The U-Net is a popular architecture for semantic segmentation. It is made up of an encoder and a decoder, where the encoder extracts features from the image and the decoder upsamples the image to then classify the pixels. It also implements a crop and copy between the encoder and decoder to help the model learn better. The architecture is as follows:

Layer	In Channels	Out Channels	Kernel	Stride	Padding	Activation
Conv11	3	64	3	1	1	ReLU
Conv1	64	64	3	1	1	ReLU
MaxPool1	-	-	2	2	0	-
Conv21	64	128	3	1	1	ReLU
Conv2	128	128	3	1	1	ReLU
MaxPool2	-	-	2	2	0	-
Conv31	128	256	3	1	1	ReLU
Conv3	256	256	3	1	1	ReLU
MaxPool3	-	-	2	2	0	-
Conv41	256	512	3	1	1	ReLU
Conv4	512	512	3	1	1	ReLU
MaxPool4	-	-	2	2	0	-
Bottleneck1	512	1024	3	1	1	ReLU
Bottleneck2	1024	1024	3	1	1	ReLU
ConvTransposed1	1024	512	2	2	0	ReLU
upConv11	1024	512	3	1	1	ReLU
upConv12	512	512	3	1	1	ReLU
ConvTransposed2	512	256	2	2	0	ReLU
upConv21	512	256	3	1	1	ReLU
upConv22	256	256	3	1	1	ReLU
ConvTransposed3	256	128	2	2	0	ReLU
upConv31	256	128	3	1	1	ReLU
upConv32	128	128	3	1	1	ReLU
ConvTransposed4	128	64	2	2	0	ReLU
upConv41	128	64	3	1	1	ReLU
upConv42	64	64	3	1	1	ReLU
softmax	64	21	1	1	0	-

Here is a visual representation of the U-Net Architecture: Full-width

Results

Among the 4 architectures, the Res-Net Transfer Learning FCN performed the best as we were able to train it to segment the images correctly while the others still seemed random after approximately 500 epochs.

Example of the Res-Net Transfer Learning FCN’s output: Full-width Example of the Res-Net Transfer Learning FCN’s output

Discussion

The Res-Net Transfer Learning FCN performed the best as it was able to segment the images correctly. The other architectures were not able to segment the images correctly as they seemed random after approximately 500 epochs. This could be due to the fact that the Res-Net Transfer Learning FCN was pre-trained on ImageNet which is a large dataset and hence it was able to learn the features of the images better. Compared to the others where we trained the locally from scratch which could imply that some of the other models did not have enough training to learn properly. For example the U-Net was able to segment and get the shape of the objects but it was not able to get the correct colors of the objects, which could imply that more training is required.

U-Net’s output: half-width Image of a bird segmented by the U-Net

Deliverables

For a more in-depth look at the project, you can download the project below, and you can also view the code by clicking the image below.

Code

Report

Semantic Segmentation using Fully Convolutional Networks

Dataset

Fully Convolutional Networks (FCNs)