From MOCO v1 to v3: Towards building a dynamic dictionary for self-supervised learning — Part 1 | by Mengliu Zhao | Jul, 2024

A brief overview of the boost contrast learning framework

Have we reached the era of self-supervised learning?

Data flows every day. People work 24 hours a day, 7 days a week. Jobs are spread across every corner of the world. But still, there is a lot of data that remains unrecorded, waiting to be used by a new model, a new training, or a new update.

Or it will never happen. It will never happen when the world operates in a supervised manner.

The rise of self-supervised learning in recent years has revealed a new direction. Instead of creating annotations for every task, self-supervised learning splits tasks into pretraining/pre-training (see my previous post on pretraining). here) tasks and post-tasks. Pretext tasks focus on extracting representative features from the entire dataset without the guidance of any ground truth annotations. Still, this task requires labels automatically generated from the dataset, usually through extensive data augmentation. Hence, we use the terminologies unsupervised learning (the dataset is not annotated) and self-supervised learning (tasks are monitored by auto-generated labels) interchangeably in this article.

Contrastive learning is an important category of self-supervised learning.It uses unlabeled datasets and encoded contrastive information losses (e.g., contrastive loss, InfoNCE loss, triplet loss, etc.) to train the deep learning network. Mainstream contrastive learners include SimCLR, SimSiam, and MOCO series.

MOCO is an abbreviation for “momentum contrast.” The core idea was written in the first MOCO paper, which suggested understanding a self-supervised learning problem in computer vision, as follows:

“(quote from original paper) Computer vision, on the other hand, is more concerned with dictionary construction, since the raw signal is in a continuous, high-dimensional space and is not structured for human communication… Although driven by various motivations, these methods (note: recent visual representation learning) can be thought of as building dynamic dictionaries… Unsupervised learning trains coders to perform dictionary lookups: an encoded “query” must be similar to its corresponding key and different from others.Learning is formulated as the minimization of a contrastive loss.”

In this article, we will do a detailed review of MOCO v1 to v3:

v1 — the document “Moment contrast for learning unsupervised visual representations” was published in CVPR 2020. The paper proposes a boosting update for key ResNet encoders using sample queues with InfoNCE loss.
v2 — the paper “Improved baselines with contrastive moment learning” was published immediately after, implementing two improvements to the SimCLR architecture: a) replacing the FC layer with a 2-layer MLP and b) extending the original data augmentation by including blurring.
v3: The paper “An Empirical Study of Training Self-Supervised Vision Transformers” was published at ICCV 2021. The framework extends the key-query pair to two key-query pairs, which were used to form a SimSiam-style symmetric contrastive loss. The core framework was also extended from ResNet-only to ResNet and ViT.