Instance segmentation using Mask-RCNN

Posted on
November 21, 2024
webhooks Staple AI
Posted by
Team Staple AI

Table of contents

A pragmatic guide to training a Mask-RCNN model on your custom dataset

In the field of computer vision, image segmentation refers to classifying the object category and extracting the pixel-by-pixel mask of the objects from an image. So in addition to object classification and object localization, we also obtain information about which exact pixels are part of the object of interest in the image.

Mask-RCNN is a deep-neural network (an extension of Faster-RCNN) that carries out instance segmentation and was released in 2017 by Facebook. This blog post aims to provide brief and pragmatic guidance on the Tensorflow implementation of Mask-RCNN. So if you are not familiar with CNNs or Mask-RCNN, I recommend going through this blog or reading the original paper by Facebook AI Research before proceeding further.

In general, Mask R-CNN has two stages after a CNN feature extractor extracts image features:

  • The first stage involves scanning the image to generate region proposals i.e., which areas have a higher likelihood to contain an object. This is also called RPN (Region Proposal Network).
  • In the second stage, the region proposals are classified with the prediction of bounding boxes and masks of the object of interest.

At the end of this blog, you will be able to do the following tasks:

  • Web Scraping: Scrape images from google images for collecting your own data.
  • Data Labeling: Label the objects and the masks in the images using VIA labeling tool.
  • Configure the model: Set up the project for image segmentation in Google Colab with Matterport implementation of Mask-RCNN on Tensorflow 2.
  • Training and Inference: Train and make predictions with your model on your custom dataset.

For the ease of labeling objects, I chose the task of recognizing and extracting masks of footballs in the images. The first step involves collecting your dataset if you don’t already have one for your application. Feel free to skip this step if you already have your own dataset.

Step 1: Web Scraping images from Google images

For scraping images, we need a tool that can simulate a human user who manually downloads images from Google. Selenium is an open-source web-based automation tool that can get the job done by connecting Python with your web browser (we will use Google Chrome here). You also need a web driver for your corresponding Google Chrome version (find out your version by clicking ‘About Google Chrome’). You can download the driver here. For example, I had Version 85.0.4183.121 so I downloaded Chrome version 85.

  • Install selenium with pip3 install selenium
  • Download Google Chrome if you already don’t have it.
  • Identify your chrome version and download the corresponding web driver chromedriver. Place it in the same directory as the python script scrape_images.py below.

In the script below, selenium launches a dummy Chrome browser which searches google for the query words (given by the queries in the code below), fetches the URLs of the images from google, downloads, and saves them to your local folder scraped_images. For each query, it can download about 100 images given by the parameter NUM_IMAGES. Run the script using the python3 scrape_images.py in the terminal and it will save the downloaded images in the folder scraped_images. Remember to change your query words and the number of images you want to download for each query. You might get an error for some of the images as some images might be just inaccessible and some might give you errors while downloading but you can just add more query words if you want more images. You can filter the appropriate images according to the application that you are building. Here is a snippet from the code but you can download the full script here.

---------------

1from selenium import webdriver
2import io, os, time
3import requests
4import hashlib
5from PIL import Image
67def search_and_download(query_item ,driver_path ,max_images_to_fetch=10):
89    with webdriver.Chrome(executable_path=driver_path) as mydriver:
10        res = get_image_urls(query_item, max_images_to_fetch, wd=mydriver, sleeptime=0.5)
1112    # download images from fetched urls
13    for elem in res:
14        download_image(target_folder,elem)
1516if __name__ == '__main__':
1718    QUERIES = ['football', 'soccer', 'soccer kick']
19    NUM_IMAGES = 100
2021    curr_dir = os.getcwd()
22    driver_path = os.path.join(curr_dir, 'chromedriver')
23    target_folder = os.path.join(curr_dir, 'scraped_images')
2425    if not os.path.exists(target_folder):
26        os.makedirs(target_folder)
2728    for query in QUERIES:
29        search_and_download(query_item = query,driver_path = driver_path, max_images_to_fetch = NUM_IMAGES)

---------------

Step 2: Label the dataset

Although there are a number of labeling tools available, I used VIA as it is very intuitive, light-weight, and easy to use. You can just open it in your browser with this link, upload your dataset and begin labeling. You can choose different region shapes for labeling such as circle, oval, rectangle, or polygon. Since my object of interest was football, I chose a circle for a few and polygons for others. You should set the region_attributes (for example class_name, class_id, image_quality, etc) in the beginning before you start labeling. Learn a few keyboard shortcuts to speed up the training process. I prepared two batches for labeling, one for training, and one for validation. You can preview the annotations using AnnotationsPreview Annotations and export them as a JSON file.

Snapshot of VIA labelling tool

At the end of the labeling process, my folder structure had two folders train and val with their respective images and annotations in files annot.json

Step 3: Set up the project

Now that we have our dataset and the labels ready, we are ready to set up our Mask-RCNN project. A well-known implementation of Mask-RCNN can be found here by Matterport. Although it is only compatible with Tensorflow 1.3 and Keras 2.1.0, a lot of people have come with a version that is adapted to work with Tensorflow 2.0 which has its newly incorporated Keras. I used a version by TomGross and you can access it by cloning the git repo below.

This will make a folder named Mask_RCNN which will have your files and sample scripts. You can copy the folder named mrcnn and coco weights file mask_rcnn_coco.h5 inside Mask_RCNN folder and placed them along with football_data folder so the directory structure looked like this where football_segmentation.ipynb is the jupyter notebook that we will use for training and logs is a folder that we make to hold the trained model checkpoints and tensorboard information.

Step 4: Model Training

With the directory structure already set up in Step 3, we are ready to train the Mask-RCNN model on the football dataset. In football_segmentation.ipynb below, import the necessary packages and modules from mrcnn folder, define the root directory, coco weight path, and logs directory. The modules model and utils from mrcnn have functions and classes for constructing the Mask-RCNN architecture.

-----------------

1import os, sys, json, datetime
2import numpy as np
3import skimage.draw
45from mrcnn.config import Config
6from mrcnn import model as modellib, utils
78# Root directory of the project
9ROOT_DIR = os.getcwd()
10sys.path.append(ROOT_DIR)
1112# Path to trained weights file
13COCO_WEIGHTS_PATH = os.path.join(ROOT_DIR, "mask_rcnn_coco.h5")
1415# Directory to save logs and model checkpoints
16DEFAULT_LOGS_DIR = os.path.join(ROOT_DIR, "logs")

-----------------

The implementation by Matterport already has a base configuration that we import in the code above on line 5. However, for our custom dataset, we need to add a new class and overwrite the configuration in the parent class that it inherits. To see the values of the hyperparameters in the base configuration (for example, LEARNING_RATE, STEPS_PER_EPOCH, etc), refer to the parent class in the script Config.py. The custom configuration is defined in the FootballConfig class in the code below.

-----------------

1class FootballConfig(Config):
2    """Configuration for training on the football dataset.
3    Derives from the base Config class and overrides some values.
4    """
5    # Give the configuration a recognizable name
6    NAME = "football"
78    # Adjust this according to your GPU's memory
9    IMAGES_PER_GPU = 1
1011    # Number of classes (including background).
12    NUM_CLASSES = 1 + 1  # Background + football
1314    # Number of training steps per epoch
15    STEPS_PER_EPOCH = 100
1617    # Skip detections with < 90% confidence
18    DETECTION_MIN_CONFIDENCE = 0.9
1920class FootballDataset(utils.Dataset):
2122    def load_Football(self, dataset_dir, subset):
23        """Load a subset of the Football dataset.
24        dataset_dir: Root directory of the dataset.
25        subset: Subset to load: train or val
26        """
27        # Add football class
28        self.add_class("football", 1, "football")
2930        # Choose from train or validation dataset
31        assert subset in ["train", "val"]
32        dataset_dir = os.path.join(dataset_dir, subset)
3334        annotations = json.load(open(os.path.join(dataset_dir, "annot.json")))
35        annotations = list(annotations.values())  # don't need the dict keys
3637        # The VIA tool saves images in the JSON even if they don't have any annotations. Skip unannotated images.
38        annotations = [a for a in annotations if a['regions']]
3940        # Add images
41        for a in annotations:
42            # Get the x, y coordinaets of points of the polygons that make up
43            # the outline of each object instance. These are stores in the
44            # shape_attributes (see json format above)
45            # The if condition is needed to support VIA versions 1.x and 2.x.
46            if type(a['regions']) is dict:
47                polygons = [r['shape_attributes'] for r in a['regions'].values()]
48            else:
49                polygons = [r['shape_attributes'] for r in a['regions']]
5051            # load_mask() needs the image size to convert polygons to masks.
52            # Unfortunately, VIA doesn't include it in JSON, so we must read
53            # the image. This is only managable since the dataset is tiny. Else you could
54            # also add the image sizes in the annotation JSONS seperately after VIA labeling.
55            image_path = os.path.join(dataset_dir, a['filename'])
56            image = skimage.io.imread(image_path)
57            height, width = image.shape[:2]
5859            self.add_image(
60                "football",
61                image_id=a['filename'],  # use file name as a unique image id
62                path=image_path,
63                width=width, height=height,
64                polygons=polygons)
65            
66    def load_mask(self, image_id):
67        """Generate instance masks for an image.
68       Returns:
69        masks: A bool array of shape [height, width, instance count] with
70            one mask per instance.
71        class_ids: a 1D array of class IDs of the instance masks.
72        """
73        # If not a football dataset image, delegate to parent class.
74        image_info = self.image_info[image_id]
75        if image_info["source"] != "football":
76            return super(self.__class__, self).load_mask(image_id)
7778        # Convert polygons to a bitmap mask of shape
79        # [height, width, instance_count]
80        info = self.image_info[image_id]
8182        mask = np.zeros([info["height"], info["width"], len(info["polygons"])],
83                        dtype=np.uint8)
8485        for i, p in enumerate(info["polygons"]):
86            # Get indexes of pixels inside the polygon and set them to 1
87            # if you end up using any other shape apart from circle or polygon, add the shape format here
88            if 'r' in p:
89                rr, cc = skimage.draw.circle( p['cx'], p['cy'], p['r'])
90                mask[cc, rr,i] = 1
91            else:
92                rr, cc = skimage.draw.polygon(p['all_points_y'], p['all_points_x'])
93                mask[rr, cc, i] = 1
9495        # Return mask, and array of class IDs of each instance. Since we have
96        # one class ID only, we return an array of 1s
97        return mask.astype(np.bool), np.ones([mask.shape[-1]], dtype=np.int32)
98

-----------------

You can visualize the mask and the corresponding image using the code below. Remember to replace the mask pixel value of 1 with 255 in lines 90 and 93 above. The image and the corresponding mask are shown below.

-----------------

-----------------

Image and the corresponding mask

For training, we need to load the configuration and functions defined above in FootballDataset class, load, and prepare the dataset for training and validation using prepare function. The model is imported as modellib and the configuration defined by FootballConfig is used to initialize it. We train the last layers of the model with coco weights (line 16) for 30 epochs and the default learning rate. You can tune these hyperparameters. There are default callbacks for TensorBoard and saving of checkpoints for each epoch in the logs folder.

1def train(model, dataset):
2    """Train the model."""
3    # Training dataset.
4    dataset_train = FootballDataset()
5    dataset_train.load_Football(dataset, "train")
6    dataset_train.prepare()
78    # Validation dataset
9    dataset_val = FootballDataset()
10    dataset_val.load_Football(dataset, "val")
11    dataset_val.prepare()
1213    # Since we're using a very small dataset, and starting from
14    # COCO trained weights, we don't need to train too long. Also,
15    # no need to train all layers, just the heads should do it.
16    model.train(dataset_train, dataset_val,
17                learning_rate=config.LEARNING_RATE,
18                epochs=30,
19                layers='heads')
20    
21config = FootballConfig()
22model = modellib.MaskRCNN(mode="training", config=config, model_dir=DEFAULT_LOGS_DIR)
2324weights_path = COCO_WEIGHTS_PATH
25dataset = os.path.join(ROOT_DIR , "football_data")
2627model.load_weights(weights_path, by_name=True, exclude=[
28    "mrcnn_class_logits", "mrcnn_bbox_fc",
29    "mrcnn_bbox", "mrcnn_mask"])
3031train(model, dataset)
32

The training shows the different loss values at each epoch. Since Mask-RCNN involves RPN (Regional Proposal network) that predicts region proposals, along with class labels, bounding boxes, and mask predictions, we can see five different types of losses that have to be minimized. For more information about losses, see the Mask-RCNN architecture in this article or read the original paper here or see the script model.py.

  • rpn_class_loss = RPN anchor classifier loss
  • rpn_bbox_loss = RPN bounding box loss graph
  • mrcnn_class_loss = loss for the classifier head of Mask R-CNN
  • mrcnn_bbox_loss = loss for Mask R-CNN bounding box refinement
  • mrcnn_mask_loss = mask binary cross-entropy loss for the masks head

I would suggest training on a GPU because Mask-RCNN is computationally heavy to train on a CPU. If you have a machine with GPU, you can just start training right away but if you don’t have it, you can use Google Colab as they provide free GPU access. Just mount your google drive into your Colab notebook using the code below. You will be asked to click on a link to authorize the mount. Copy the authorization code back into the notebook and you can obtain access to the data in your google drive in Google Colab.

from google.colab import drivedrive.mount('/content/gdrive')

The entire training for 30 epochs took about 30min for me (78 train + 21 validation images). If you are training on your own machine, the training can be visualized using this command in the terminal: tensorboard --logdir=/path_to_logs and then opening localhost:6006 in your browser. For training on Google Colab, you can visualize the losses through tensorboard using this command below.

%load_ext tensorboard%tensorboard --logdir=<path_to_logs>  --host=127.0.0.1

Tensorboard visualization of losses

Step 5: Inference

For inference, you can load the model weights from one of the checkpoints saved in the logs folder and visualize the predictions for unseen instances. The first element ofresults on line 29 below gives a dictionary with keys about the following predictions:

  • rois: Regions-of-interest (ROI) for detected objects.
  • masks: Masks for the detected objects.
  • class_ids: Class integers for the detected objects .
  • scores: Confidence for each predicted class.
1import matplotlib.pyplot as plt
2from mrcnn import visualize
34ROOT_DIR = os.getcwd()
5MODEL_DIR = os.path.join(ROOT_DIR,"logs")
6weights_path = os.path.join(MODEL_DIR, "mask_rcnn_football_0030.h5")
78class InferenceConfig(FootballConfig):
9    # Set batch size to 1 since we'll be running inference on
10    # one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
11    GPU_COUNT = 1
12    IMAGES_PER_GPU = 1
1314config = InferenceConfig()
15# config.display()
1617# Create model object in inference mode.
18model = modellib.MaskRCNN(mode="inference", model_dir=MODEL_DIR, config=config)
1920# Load weights trained on MS-COCO
21model.load_weights(weights_path, by_name=True)
2223# Load a random image from a path
24image = skimage.io.imread(os.path.join(ROOT_DIR, 'img.jpg'))
2526# prediction
27results = model.detect([image], verbose=1)
2829# Visualize results
30r = results[0]
31class_names = ['BG', 'football']
32visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'], 
33                            class_names, r['scores'],
34                            title="Predictions")
35

The model does a pretty good job of identifying footballs in the images. It can also recognize masked footballs and difficult to isolate footballs like in the image below.

In this tutorial, you learned to collect and labeled data, set up your Mask RCNN project, and train a model to perform instance segmentation. The labeled data, the entire code, and the trained weights are available at my Github repo.

Hope you enjoyed the post! Leave your comments and suggestions below.

References

  • Matterport implementation of Mask-RCNN.
  • Original Mask-RCNN paper.
  • Blog post about object-detection and instance segmentation (R-CNN, Fast-RCNN, Faster-RCNN, Mask-RCNN)
  • Blog post about Mask-RCNN

Reach out to us:

Thank you for reaching out! We will get in touch with you shortly
Oops! Something went wrong while submitting the form. Please try again.