Representation learning with BYOL and SimSiam

BYOL was the first work to show how useful low-dimensional representations can be when learned in an unsupervised way without negative sampling. It inspired a series of simpler architectures, with SimSiam among them.

Self-supervised learning (SSL) or representation learning is concerned with finding compressed and meaningful representations of inputs useful for downstream tasks. Due to the high dimensionality of images and the readily available transformations that do not change the semantics, SSL is particularly attractive in computer vision.

A standard goal of SSL in the visual domain is finding representations which are invariant under selected transformations, like random cropping, small rotations, horizontal flipping, etc. Thus, one may envision a loss that minimizes distances between predictions for different augmentations. Unfortunately, such a loss is prone to collapse - a model always predicting a constant output would give a perfect solution. For a long time it was thought that including a penalty for small distances to negative samples (i.e. representation of different images) to the loss was needed to prevent such collapse, which led to the development of algorithms involving (hard) negative sampling.

Figure 1. [Gri20B] Performance of BYOL on ImageNet (linear evaluation) using multiple CNN architectures, in particular ResNet-50 and ResNet-200 ($2\times$), compared to other unsupervised and supervised (Sup.) baselines

In Bootstrap Your Own Latent (BYOL) [Gri20B], it was shown for the first time that SSL is viable without negative sampling. The algorithm was based on the rather curious observation that if a network is trained to mimic representations of augmented images from a different, randomly initialized target network, the resulting learned representations are much better than the random targets. For imagenet, a linear classifier achieves a 18.8% top-1 accuracy on them vs. the 1.4% accuracy on the random targets themselves.

This experimental result gives rise to the sketch of a representation learning algorithm: iteratively improve the fixed target networks together with the learned, online networks. While a collapse is still theoretically possible with this strategy, it has not been observed experimentally. BYOL uses some engineering on top of this basic idea that is highlighted in the architecture diagram. The target itself is a moving average of previous iterations of the online network. This kind of technique, where a “student” network is trained on reproducing outputs of a “teacher” network is sometimes called knowledge distillation or simply distillation. See e.g. this nice summary from a NYU class, where non-contrastive SSL algorithms are grouped under “distillation”.

An experimental study in Exploring Simple Siamese Representation Learning (SimSiam) [Gri20B], showed that contrary to the assumption in BYOL, using previous iterations of the online network as target is not necessary to prevent collapse. One can thus use a simpler architecture, keeping only one network as both online and target and not computing gradients wrt. the weights of the “target”. The main improvement over BYOL here is a simpler implementation with comparable performance.

Both BYOL and SimSiam reach linear-classifier accuracies of above 70% on ImageNet, often coming remarkable close to supervised training with comparable networks, see the corresponding diagram. Multiple implementations for both BYOL and SimSiam are available on GitHub, see the references below.

Since then, other, by now very popular algorithms for self-supervised learning without negative pairs have emerged, like Barlow Twins [Zbo21B] and DINO [Car21E]. We plan to cover them in upcoming paper pills - stay tuned!

# References

• Bootstrap your own latent: A new approach to self-supervised Learning, Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko. Advances in Neural Information Processing Systems 33 (2020)
• Exploring Simple Siamese Representation Learning, Xinlei Chen, Kaiming He. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
• Barlow Twins: Self-Supervised Learning via Redundancy Reduction, Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stephane Deny. Proceedings of the 38th International Conference on Machine Learning (2021)
• Emerging Properties in Self-Supervised Vision Transformers, Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin. Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)