PSPNet
Last updated
Last updated
For Image scene semantic segmentation PSPNet performs better than other semantic segmentation nets like FCN,U-Net,Deeplab.
In this article,we’ll discuss about PSPNet and implementation in Keras.
Input image of any shape usually dimensions greater than (256, 256) feed to the network.
Takes input image and constructs feature maps for image. Feature maps are extracted by feeding image using transfer learning or scratch network with dilated convolutions. As large size kernels extracts more useful information than small size kernel but computation cost is higher, dilated convolutions gathers large size area information with small size kernel for higher dilation_rates to keep dimensions same as input Image.Generally residual blocks with dilations are used to construct feature maps.No.of feature maps N is a hyperparameter and needs to be tuned for good result.
An Image contains objects of sizes ranging from small area to large area in different regions.Fully Convolution Network(FCN),U-Net and other networks constructs feature maps by upsampling and doing segmentation at different levels for segmentation of all objects in all regions.But in PSPNet to correctly segment all size objects, feature maps are pooled average pooled at different pool size.
Output of Pyramid Pooling Module is concatenation of base feature maps from b and upsampled average pooled feature maps from c.
Dataset we are applying semantic segmentation in PSPNet is on Kaggle’s Cityscapes Image Pairs dataset of size 106 Mb.
Dataset contains two folders Training images (train) folder and Validation images (val) folder.In each folder each image is of shape (512, 256, 3 ) and width=512, height=256, channels=3. Where each image contains original image without segmentation and segmented image. Separate normal image into train_imgs, valid_imgs and segmented image into arrays train_masks, valid_masks.
A simple PSPNet architecture with following parameters,
Module b constructed with 3 layers of Residual blocks with Dilation Convolutions outputs 256 feature maps.
Module C defined with pool_sizes GlobalAveragePooling, (22), (44), (8*8) and for Upsampling, bilinear_interpolation used. Each 256 pooled feature maps are reduced to 64 feature maps combining 512 feature maps total.
Module D is simply a convolution layer of 512 feature maps output (256,256,3) dimensional feature maps which are flattened into a single array of size (256_256_3).
Sub-region average pooling is done at different scales like Global Average Pooling,. After average pooling of N feature maps with n different sizes, feature maps at each level reduced to feature maps by performing convolutions.
For instance, if N=512 feature maps and n=4 sizes like Global Average Pooling, then at each level 512 feature maps are reduced to 126 feature maps.
feature maps at each level upsampled to have the same dimensions of input Image. For upsampling , bilinear_interpolation or convolution_transpose methods used instead simple upsampling.
feature maps are feed to convolution layer and final prediction of classes are generated on how output layer is constructed say different channels for different objects or single channel.