ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Pyramid Scene Parsing Network (PSPNet)
  • a) Input Image
  • b) Feature Map
  • c) Pyramid Pooling Module
  • d) Final Prediction
  • Implementation in Keras/Tensorflow
  • Dataset
  • Network Architecture
  • Reference

Was this helpful?

  1. Computer Vision
  2. Segmentation

PSPNet

PreviousPanoptic SegmentationNextFaceNet

Last updated 3 years ago

Was this helpful?

For Image scene semantic segmentation PSPNet performs better than other semantic segmentation nets like FCN,U-Net,Deeplab.

In this article,we’ll discuss about PSPNet and implementation in Keras.

Pyramid Scene Parsing Network (PSPNet)

a) Input Image

Input image of any shape usually dimensions greater than (256, 256) feed to the network.

b) Feature Map

Takes input image and constructs feature maps for image. Feature maps are extracted by feeding image using transfer learning or scratch network with dilated convolutions. As large size kernels extracts more useful information than small size kernel but computation cost is higher, dilated convolutions gathers large size area information with small size kernel for higher dilation_rates to keep dimensions same as input Image.Generally residual blocks with dilations are used to construct feature maps.No.of feature maps N is a hyperparameter and needs to be tuned for good result.

c) Pyramid Pooling Module

An Image contains objects of sizes ranging from small area to large area in different regions.Fully Convolution Network(FCN),U-Net and other networks constructs feature maps by upsampling and doing segmentation at different levels for segmentation of all objects in all regions.But in PSPNet to correctly segment all size objects, feature maps are pooled average pooled at different pool size.

Output of Pyramid Pooling Module is concatenation of base feature maps from b and upsampled average pooled feature maps from c.

d) Final Prediction

Implementation in Keras/Tensorflow

Dataset

Dataset contains two folders Training images (train) folder and Validation images (val) folder.In each folder each image is of shape (512, 256, 3 ) and width=512, height=256, channels=3. Where each image contains original image without segmentation and segmented image. Separate normal image into train_imgs, valid_imgs and segmented image into arrays train_masks, valid_masks.

train_folder="/kaggle/input/cityscapes-image-pairs/cityscapes_data/cityscapes_data/train/"
valid_folder="/kaggle/input/cityscapes-image-pairs/cityscapes_data/cityscapes_data/val/"

def get_images_masks(path):
    names=os.listdir(path)
    img_g,img_m=[],[]
    for name in names:
        img=cv2.imread(path+name)
        img=cv2.normalize(img,None,0,1,cv2.NORM_MINMAX,cv2.CV_32F)
        img=img[:,:,::-1]
        img_g.append(img[:,:256])
        img_m.append(np.reshape(img[:,256:],(256*256*3)))
        del img
    del names
    return img_g,img_m

train_imgs,train_masks=get_images_masks(train_folder)
valid_imgs,valid_masks=get_images_masks(valid_folder)

Network Architecture

A simple PSPNet architecture with following parameters,

Module b constructed with 3 layers of Residual blocks with Dilation Convolutions outputs 256 feature maps.

Module C defined with pool_sizes GlobalAveragePooling, (22), (44), (8*8) and for Upsampling, bilinear_interpolation used. Each 256 pooled feature maps are reduced to 64 feature maps combining 512 feature maps total.

Module D is simply a convolution layer of 512 feature maps output (256,256,3) dimensional feature maps which are flattened into a single array of size (256_256_3).

def conv_block(X,filters,block):
    # resiudal block with dilated convolutions
    # add skip connection at last after doing convoluion operation to input X

    b = 'block_'+str(block)+'_'
    f1,f2,f3 = filters
    X_skip = X
    # block_a
    X = Convolution2D(filters=f1,kernel_size=(1,1),dilation_rate=(1,1),
                      padding='same',kernel_initializer='he_normal',name=b+'a')(X)
    X = BatchNormalization(name=b+'batch_norm_a')(X)
    X = LeakyReLU(alpha=0.2,name=b+'leakyrelu_a')(X)
    # block_b
    X = Convolution2D(filters=f2,kernel_size=(3,3),dilation_rate=(2,2),
                      padding='same',kernel_initializer='he_normal',name=b+'b')(X)
    X = BatchNormalization(name=b+'batch_norm_b')(X)
    X = LeakyReLU(alpha=0.2,name=b+'leakyrelu_b')(X)
    # block_c
    X = Convolution2D(filters=f3,kernel_size=(1,1),dilation_rate=(1,1),
                      padding='same',kernel_initializer='he_normal',name=b+'c')(X)
    X = BatchNormalization(name=b+'batch_norm_c')(X)
    # skip_conv
    X_skip = Convolution2D(filters=f3,kernel_size=(3,3),padding='same',name=b+'skip_conv')(X_skip)
    X_skip = BatchNormalization(name=b+'batch_norm_skip_conv')(X_skip)
    # block_c + skip_conv
    X = Add(name=b+'add')([X,X_skip])
    X = ReLU(name=b+'relu')(X)
    return X

def base_feature_maps(input_layer):
    # base covolution module to get input image feature maps

    # block_1
    base = conv_block(input_layer,[32,32,64],'1')
    # block_2
    base = conv_block(base,[64,64,128],'2')
    # block_3
    base = conv_block(base,[128,128,256],'3')
    return base

def pyramid_feature_maps(input_layer):
    # pyramid pooling module

    base = base_feature_maps(input_layer)
    # red
    red = GlobalAveragePooling2D(name='red_pool')(base)
    red = tf.keras.layers.Reshape((1,1,256))(red)
    red = Convolution2D(filters=64,kernel_size=(1,1),name='red_1_by_1')(red)
    red = UpSampling2D(size=256,interpolation='bilinear',name='red_upsampling')(red)
    # yellow
    yellow = AveragePooling2D(pool_size=(2,2),name='yellow_pool')(base)
    yellow = Convolution2D(filters=64,kernel_size=(1,1),name='yellow_1_by_1')(yellow)
    yellow = UpSampling2D(size=2,interpolation='bilinear',name='yellow_upsampling')(yellow)
    # blue
    blue = AveragePooling2D(pool_size=(4,4),name='blue_pool')(base)
    blue = Convolution2D(filters=64,kernel_size=(1,1),name='blue_1_by_1')(blue)
    blue = UpSampling2D(size=4,interpolation='bilinear',name='blue_upsampling')(blue)
    # green
    green = AveragePooling2D(pool_size=(8,8),name='green_pool')(base)
    green = Convolution2D(filters=64,kernel_size=(1,1),name='green_1_by_1')(green)
    green = UpSampling2D(size=8,interpolation='bilinear',name='green_upsampling')(green)
    # base + red + yellow + blue + green
    return tf.keras.layers.concatenate([base,red,yellow,blue,green])

def last_conv_module(input_layer):
    X = pyramid_feature_maps(input_layer)
    X = Convolution2D(filters=3,kernel_size=3,padding='same',name='last_conv_3_by_3')(X)
    X = BatchNormalization(name='last_conv_3_by_3_batch_norm')(X)
    X = Activation('sigmoid',name='last_conv_relu')(X)
    X = tf.keras.layers.Flatten(name='last_conv_flatten')(X)
    return X

Reference

Sub-region average pooling is done at different scales like Global Average Pooling,(2∗2),(3∗3),(4∗4),(6∗6),(8∗8)(2*2),(3*3),(4*4),(6*6),(8*8)(2∗2),(3∗3),(4∗4),(6∗6),(8∗8). After average pooling of N feature maps with n different sizes, feature maps at each level reduced to N/nN/nN/n feature maps by performing 1∗11*11∗1 convolutions.

For instance, if N=512 feature maps and n=4 sizes like Global Average Pooling,(2∗2),(4∗4),(8∗8)(2*2),(4*4),(8*8)(2∗2),(4∗4),(8∗8) then at each level 512 feature maps are reduced to 126 feature maps.

N/nN/nN/n feature maps at each level upsampled to have the same dimensions of input Image. For upsampling , bilinear_interpolation or convolution_transpose methods used instead simple upsampling.

2∗N2*N2∗N feature maps are feed to convolution layer and final prediction of classes are generated on how output layer is constructed say different channels for different objects or single channel.

Dataset we are applying semantic segmentation in PSPNet is on Kaggle’s dataset of size 106 Mb.

Cityscapes Image Pairs
https://medium.com/analytics-vidhya/semantic-segmentation-in-pspnet-with-implementation-in-keras-4843d05fc025