Unpacking the data-centric AI concepts used in Segment Anything, the first basic model for image segmentation
Artificial intelligence (AI) has made remarkable progress, especially in the development of basic models, which are trained on vast amounts of data and can be adapted to a wide range of downstream tasks.
A notable success of the basic models is Large Language Models (LLM). These models can perform complex tasks with high precision, such as language translation, text summarization, and answering questions.
Entry-level models are also beginning to change the game in Computer Vision. Segment Anything by Meta is a recent development that is causing a stir.
Segment Anything’s success can be attributed to its large labeled data set, which has played a crucial role in enabling its remarkable performance. The architecture of the model, as described in the Segment Anything Paperit is surprisingly simple and light.
In this article, based on insights from our recent survey documents [1,2]Let’s take a closer look at Segment Anything through the lens of Data-Centric AIa growing concept in the data science community.
What can Segment Anything do?
In a nutshell, the image segmentation task is to predict a mask to separate areas of interest in an image, such as an object, person, etc. Segmentation is a very important task in Computer Visual as it makes the image more meaningful and easier. analyze.
The difference between Segment Anything and other image segmentation approaches is in the introduction of cues to specify the location of the segmentation. The indications can be vague, such as a point, a box, etc.
What is data-centric AI?
Data-Centric AI is a novel approach to AI systems development, which has been gaining ground and is being pioneered by AI pioneer Andrew Ng.
Data-Centric AI is the discipline of systematically engineering the data used to build an AI system. —Andrew Ng
Previously, our main focus was to develop better models using data that was largely unchanged; this was known as model-centric AI. However, this approach can be problematic in real-world scenarios, as it does not account for problems that can arise in the data, including inaccurate labels, duplicates, and bias. Consequently, overfitting a data set may not necessarily result in better model performance.
Data-centric AI, on the other hand, prioritizes improving the quality and quantity of data used in building AI systems. The focus is on the data itself, with relatively fixed models. Taking a data-centric approach to AI system development holds more promise in real-world applications, as a model’s full potential is determined by the data used for training.
It is crucial to distinguish between “data-centric” and “data-driven” approaches. “Data-driven” methods rely only on data to drive AI development, but the focus remains on modeling rather than engineering data, making them fundamentally different from “data-driven” approaches. “.
He Data-Centric AI Framework covers three main objectives:
- Training data development involves collecting and generating high-quality and diverse data to facilitate the training of machine learning models.
- Inference data development involves building innovative test suites that provide detailed information about the model or unlock model-specific capabilities through engineering data inputs, such as quick engineering.
- data maintenance aims to ensure data quality and reliability in a constantly changing environment.
The model used in Segment Anything
The design of the model is surprisingly simple. The model mainly consists of three parts:
- fast encoder: This part is used to get the representation of the notice, either by positional encoding or by convolution.
- Image encoder: This part directly uses the Vision Transformer (ViT) with no special modifications.
- Light mask decoder: This part mainly merges quick embedding and image embedding, using mechanisms like attention. It is called lightweight because it has only a few layers.
The lightweight mask decoder is interesting, as it allows the model to be easily implemented, even with just CPUs. Below is the comment provided by the authors of Segment Anything.
Surprisingly, we found that a simple design satisfies all three constraints: a powerful image encoder calculates an image embed, an ad encoder embeds ad, and then the two information sources are combined in a lightweight mask decoder that predicts targeting masks.
So the secret to Segment Anything’s great performance is probably not the model’s design, as it’s very simple and lightweight.
Data-Centric AI Concepts in Segment Anything
The core of Segment Anything training lies in a large annotated dataset containing over a billion masks, which is 400 times larger than existing segmentation datasets. How did they achieve this? The authors used a data engine to perform the annotation, which can be broadly divided into three steps:
- Assisted manual annotation: This step can be understood as an active learning process. First, an initial model is trained on public data sets. The annotators then modify the predicted masks. Finally, the model is trained on the newly annotated data. These three steps were repeated six times and ultimately resulted in 4.3 million mask scores.
- Semi-automatic annotation: The goal of this step is to increase the diversity of skins, which can also be understood as an active learning process. In simple terms, if the model can automatically generate good masks, human annotators do not need to label them and human efforts can be focused on masks where the model does not have enough confidence. The method used to find safe masks is quite interesting, since it involves detecting objects in the masks from the first step. For example, suppose there are 20 possible masks on an image. We first use the current model for segmentation, but this will probably only annotate a part of the masks, and some masks will not be well annotated. Now we need to identify which masks are good (trusted) automatically. The focus of this document is to perform object detection on the predicted masks to see if objects can be detected in the image. If objects are detected, we consider the corresponding mask to be safe. Suppose this process identifies eight secure masks; the scorer then labels the remaining 12, saving human effort. The above process was repeated five times, adding another 5.9 million mask annotations.
- Fully automatic annotation: In a nutshell, this step uses the model trained in the previous step to annotate data. A number of strategies were used to improve the quality of the annotations, including:
(1) filtering less secure masks based on predicted Intersection Over Join (IoU) values (the model has a head for predicting IoU).
(2) only considering stable masks, which means that if the threshold is adjusted slightly above or below 0.5, the masks remain largely unchanged. Specifically, for each pixel, the model returns a value between 0 and 1. We typically use 0.5 as a threshold to decide if a pixel is masked. Stability means that when the threshold is adjusted to a certain point around 0.5 (for example, from 0.45 to 0.55), the corresponding mask remains largely unchanged, indicating that the model predictions are significantly different. on both sides of the boundary.
(3) deduplication was performed with non-maximal suppression (NMS).
This step netted 11 billion masks (a more than 100x increase in quantity).
Does this process sound familiar to you? That’s right, the Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT is quite similar to the process described above. The commonality between the two approaches is that instead of directly relying on humans to annotate the data, a model is first trained with human input and then used to annotate the data. In RLHF, a reward model is trained to give rewards for reinforcement learning, while in Segment Anything, the model is trained for direct image annotation.
Summary
Segment Anything’s main contribution lies in its annotated big data, which demonstrates the crucial importance of the data-centric AI concept. The success of basic models in the field of computer vision can be considered an inevitable event, but surprisingly, it happened so quickly. In the future, I believe that other subfields of AI, and even fields that are not related to AI or computer science, will see the emergence of basic models in due course.
Regardless of how technology evolves, improving the quality and quantity of data will always be an effective way to improve AI performance, making the concept of data-centric AI increasingly important.
I hope this article can inspire you in your own work. You can learn more about the data-centric AI framework in the following documents/resources:
If you found this article interesting, you can also check out my previous article: What are the data-centric AI concepts behind GPT models?
Stay tuned!