Note

This page is a reference documentation. It only explains the class signature, and not how to use it. Please refer to the user guide for the big picture.

nidl.estimators.ssl.utils.DINOProjectionHead¶

class nidl.estimators.ssl.utils.DINOProjectionHead(input_dim=2048, hidden_dim=2048, bottleneck_dim=256, output_dim=4096, batch_norm=True, freeze_last_layer=-1, norm_last_layer=True)[source]¶

Bases: ProjectionHead

Projection head used in DINO [1].

The projection head consists of a 3-layer multi-layer perceptron (MLP) with hidden dimension 2048 followed by l2 normalization and a weight normalized fully connected layer with K dimensions, which is similar to the design from SwAV [2].

Parameters:

input_dim: int, default=2048: The input dimension of the head.
hidden_dim: int, default=2048: The hidden dimension.
bottleneck_dim: int, default=256: Dimension of the bottleneck in the last layer of the head.
output_dim: int, default=4096: The output dimension of the head.
batch_norm: bool, default=True: Whether to use batch norm or not. Should be set to False when using a vision transformer backbone.
freeze_last_layer: int, default=-1: Number of epochs during which we keep the output layer fixed. Typically doing so during the first epoch helps training. Try increasing this value if the loss does not decrease.
norm_last_layer: bool, default=True: Whether or not to weight normalize the last layer of the DINO head. Not normalizing leads to better performance but can make the training unstable.

References

[1]

Caron, M., et al., “Emerging Properties in Self-Supervised Vision Transformers” ICCV, 2021. https://arxiv.org/abs/2104.14294

[2]

Caron, M., et al., “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”, NeurIPS, 2020. https://arxiv.org/abs/2006.09882

__init__(input_dim=2048, hidden_dim=2048, bottleneck_dim=256, output_dim=4096, batch_norm=True, freeze_last_layer=-1, norm_last_layer=True)[source]¶: Initializes the DINOProjectionHead with the specified dimensions.

cancel_last_layer_gradients(current_epoch)[source]¶: Cancel last layer gradients to stabilize the training.

forward(x)[source]¶: Computes one forward pass through the head.