Representation learning with BYOL and SimSiam | TransferLab

Self-supervised learning (SSL) or representation learning is concerned with finding compressed and meaningful representations of inputs useful for downstream tasks. Due to the high dimensionality of images and the readily available transformations that do not change the semantics, SSL is particularly attractive in computer vision.

A standard goal of SSL in the visual domain is finding representations which are invariant under selected transformations, like random cropping, small rotations, horizontal flipping, etc. Thus, one may envision a loss that minimizes distances between predictions for different augmentations. Unfortunately, such a loss is prone to collapse - a model always predicting a constant output would give a perfect solution. For a long time it was thought that including a penalty for small distances to negative samples (i.e. representation of different images) to the loss was needed to prevent such collapse, which led to the development of algorithms involving (hard) negative sampling.

Figure 1. [Gri20B] Performance of BYOL on ImageNet (linear evaluation) using multiple CNN architectures, in particular ResNet-50 and ResNet-200 ($2\times$), compared to other unsupervised and supervised (Sup.) baselines

In Bootstrap Your Own Latent (BYOL) [Gri20B], it was shown for the first time that SSL is viable without negative sampling. The algorithm was based on the rather curious observation that if a network is trained to mimic representations of augmented images from a different, randomly initialized target network, the resulting learned representations are much better than the random targets. For imagenet, a linear classifier achieves a 18.8% top-1 accuracy on them vs. the 1.4% accuracy on the random targets themselves.

This experimental result gives rise to the sketch of a representation learning algorithm: iteratively improve the fixed target networks together with the learned, online networks. While a collapse is still theoretically possible with this strategy, it has not been observed experimentally. BYOL uses some engineering on top of this basic idea that is highlighted in the architecture diagram. The target itself is a moving average of previous iterations of the online network. This kind of technique, where a “student” network is trained on reproducing outputs of a “teacher” network is sometimes called knowledge distillation or simply distillation. See e.g. this nice summary from a NYU class, where non-contrastive SSL algorithms are grouped under “distillation”.

Figure 2. [Gri20B] BYOL’s architecture. BYOL minimizes a similarity loss between $q_\theta(z_\theta)$ and $sg(z^{’}_{\xi})$, where $\theta$ are the trained weights, $\xi$ are an exponential moving average of $\theta$ and $sg$ means stop-gradient. At the end of training, everything but $f_\theta$ is discarded, and $y_\theta$ is used as the image representation.

An experimental study in Exploring Simple Siamese Representation Learning (SimSiam) [Gri20B], showed that contrary to the assumption in BYOL, using previous iterations of the online network as target is not necessary to prevent collapse. One can thus use a simpler architecture, keeping only one network as both online and target and not computing gradients wrt. the weights of the “target”. The main improvement over BYOL here is a simpler implementation with comparable performance.

Figure 3. [Che21E] SimSiam architecture. Two augmented views of one image are processed by the same encoder network $f$ (a backbone plus a projection MLP). Then a prediction MLP $h$ is applied on one side, and a stop-gradient operation is applied on the other side. The model maximizes the similarity between both sides. It uses neither negative pairs nor a momentum encoder.

Both BYOL and SimSiam reach linear-classifier accuracies of above 70% on ImageNet, often coming remarkable close to supervised training with comparable networks, see the corresponding diagram. Multiple implementations for both BYOL and SimSiam are available on GitHub, see the references below.

Since then, other, by now very popular algorithms for self-supervised learning without negative pairs have emerged, like Barlow Twins [Zbo21B] and DINO [Car21E]. We plan to cover them in upcoming paper pills - stay tuned!

We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the …

Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none …

Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We …

References