Self-supervised learning (SSL) or representation learning is concerned with finding compressed and meaningful representations of inputs useful for downstream tasks. Due to the high dimensionality of images and the readily available transformations that do not change the semantics, SSL is particularly attractive in computer vision.

A standard goal of SSL in the visual domain is finding representations which are
invariant under selected transformations, like random cropping, small rotations,
horizontal flipping, etc. Thus, one may envision a loss that minimizes distances
between predictions for different augmentations. Unfortunately, such a loss is
prone to *collapse* - a model always predicting a constant output would give a
perfect solution. For a long time it was thought that including a penalty for
small distances to *negative samples* (i.e. representation of different images)
to the loss was needed to prevent such collapse, which led to the development of
algorithms involving (hard) negative sampling.

In **Bootstrap Your Own Latent** (BYOL) [Gri20B], it was
shown for the first time that SSL is viable without negative sampling. The
algorithm was based on the rather curious observation that if a network is
trained to mimic representations of augmented images from a different, randomly
initialized *target network*, the resulting learned representations are much
better than the random targets. For imagenet, a linear classifier achieves a
18.8% top-1 accuracy on them vs. the 1.4% accuracy on the random targets
themselves.

This experimental result gives rise to the sketch of a representation learning
algorithm: iteratively improve the fixed target networks together with the
learned, *online networks*. While a collapse is still theoretically possible
with this strategy, it has not been observed experimentally. BYOL uses some
engineering on top of this basic idea that is highlighted in the architecture
diagram. The target itself is a moving average of previous
iterations of the online network. This kind of technique, where a “student”
network is trained on reproducing outputs of a “teacher” network is sometimes
called *knowledge distillation* or *simply distillation*. See e.g. this nice
summary from a NYU class,
where non-contrastive SSL algorithms are grouped under “distillation”.

An experimental study in Exploring Simple Siamese Representation Learning (SimSiam) [Gri20B], showed that contrary to the assumption in BYOL, using previous iterations of the online network as target is not necessary to prevent collapse. One can thus use a simpler architecture, keeping only one network as both online and target and not computing gradients wrt. the weights of the “target”. The main improvement over BYOL here is a simpler implementation with comparable performance.

Both BYOL and SimSiam reach linear-classifier accuracies of above 70% on ImageNet, often coming remarkable close to supervised training with comparable networks, see the corresponding diagram. Multiple implementations for both BYOL and SimSiam are available on GitHub, see the references below.

Since then, other, by now very popular algorithms for self-supervised learning without negative pairs have emerged, like Barlow Twins [Zbo21B] and DINO [Car21E]. We plan to cover them in upcoming paper pills - stay tuned!