The Idea

Ginosa Marina (TA) - 01/01/2024

During the Christmas holidays, I took the following photo and immediately thought that it would be cool to remove the people in the background as well as the wooden stick.

While there are plenty of iOS apps that offer magic-erasing capabilities, this feature is often premium, behind a subscription.

So, I challenged myself to implement it.

The first step is segmentation. We need to be able to isolate and identify the objects we want to remove. There are many segmentation models out there. The one I choose is **Segment Anything** by Meta AI Research.

The plan was clear: download the model, make it run on my M1 Pro, convert it to CoreML, and plug it into an iOS demo app. But the journey has taken a different turn.

Goals

Introduce and understand the basic blocks of Segment Anything model architecture.
Convert the ML model from PyTorch to CoreML using coremltools.
Plug the CoreML model into an iOS demo app to employ an end-to-end use case. Starting from the photo library, then preprocessing, model inference, postprocessing, and finally output display.

Segment Anything

Segment Anything comprises an Image Encoder for extracting image embeddings, a Prompt Encoder for encoding image points (pixels), and a Mask Decoder for combining the image and points embeddings to estimate the segmentation mask and its probabilistic score.

Pre-trained models of Segment Anything are considerably large when compared to the size of an iOS app, with most of the weight attributed to the image encoder. These include ViT-H (2.56 GB), ViT-L (1.25 GB), and ViT-B (375 MB): while the first two are unlikely to be used in an app, the last one, albeit still relatively large, might be feasible.

Upon researching online, I discovered MobileSAM as a viable alternative for use in an app. It incorporates a smaller image encoder trained using distillation from ViT-H, while maintaining the same mask decoder and prompt encoder, resulting in a total weight of just 40 MB.

segment_anything.drawio-4.jpg

Model Architecture

If you look at how the SamPredictor is implemented in MobileSAM, we can isolate 5 stages:

1. Preprocessing

Downsampling of the source image to fit the Image Encoder input size of 1024x1024.
Permute the dimensions of the source image from HxWxC to CxHxW, by splitting each RGB pixel into three separate planes.
Normalize each pixel by subtracting the mean pixel value and dividing by a standard pixel value.
Pad the image to fill 1024x1024.

The Idea

Goals

Segment Anything

Model Architecture

1. Preprocessing

2. Image Encoder - Vision Transformer ViT