Note

This page is a reference documentation. It only explains the class signature, and not how to use it. Please refer to the user guide for the big picture.

nidl.estimators.ssl.DINO¶

class nidl.estimators.ssl.DINO(encoder, encoder_kwargs=None, proj_input_dim=2048, proj_hidden_dim=2048, proj_bottleneck_dim=256, proj_output_dim=4096, proj_batch_norm=True, proj_norm_last_layer=True, num_local_crops=8, student_temperature=0.1, teacher_temperature=0.07, warmup_teacher_temp=0.04, warmup_teacher_temp_epochs=30, base_lambda=0.996, final_lambda=1.0, clip_grad=0.0, freeze_last_layer=0, optimizer='adamW', learning_rate=0.0003, weight_decay=0.0005, exclude_bias_and_norm_wd=True, optimizer_kwargs=None, lr_scheduler='warmup_cosine', lr_scheduler_kwargs=None, **kwargs)[source]¶

Bases: TransformerMixin, BaseEstimator

DINO [1].

DINO (self-Distillation with NO labels) is a self-supervised learning method for vision models. It learns visual representations using knowledge distillation: a student network is trained to align the representation of local and global crops (or “views”) with the representation of global crops given by a teacher model. The teacher is updated through exponential moving average of the student, avoiding a representation collapse. The DINO loss is a cross-entropy across features between teacher and student representations. This way, it does not rely on negative samples as in contrastive learning and it is less sensitive to batch size than SimCLR.

After training, the teacher model is used at inference to obtain image features.

Parameters:

encodernn.Module or class

Architecture of the encoder. A PyTorch Module is expected. In general, the uninstantiated class should be passed, although instantiated modules will also work.

encoder_kwargsdict or None, default=None

Options for building the encoder (depends on each architecture). Ignored if encoder is already instantiated.

proj_input_dimint, default=2048

Projector input dimension. It must be consistent with encoder’s output dimension.

proj_hidden_dimint, default=2048

Projector hidden dimension.

proj_bottleneck_dimint, default=256

Projector bottleneck dimension.

proj_output_dimint, default=4096

Projector output dimension.

proj_batch_normbool, default=True

Whether to use batch norm or not in projector. Should be set to False when using a vision transformer backbone.

proj_norm_last_layerbool, default=True

Whether or not to weight normalize the last layer of the DINO head. Not normalizing leads to better performance but can make the training unstable.

num_local_cropsint, default=8

Number of local views.

student_temperaturefloat, default=0.1

Temperature for the student.

teacher_temperaturefloat, default=0.07

Final temperature for the teacher.

warmup_teacher_tempfloat, default=0.04

Initial temperature for the teacher network.

warmup_teacher_temp_epochsint, default=30

Number of epochs for the warmup phase of the teacher temperature.

base_lambdafloat, default=0.996

Base value for the weighting coefficient in the teacher momentum update with exponential moving average. A cosine annealing scheme is used.

final_lambdafloat, default=1.0

Final value for the weighting coefficient in the teacher momentum update.

clip_gradfloat, default=0.0

Threshold for gradient clipping. Null value means no clipping.

freeze_last_layerint, default=0

Number of epochs during which the last layer in student’s projection head is frozen.

optimizer{‘sgd’, ‘adam’, ‘adamW’} or Optimizer, default=”adamW”

Optimizer for training the model. If a string is given, it can be:

‘sgd’: Stochastic Gradient Descent (with optional momentum).

‘adam’: First-order gradient-based optimizer.

‘adamW’ (default): Adam with decoupled weight decay regularization (see “Decoupled Weight Decay Regularization”, Loshchilov and Hutter, ICLR 2019).

learning_ratefloat, default=3e-4

Initial learning rate.

weight_decayfloat, default=5e-4

Weight decay in the optimizer.

exclude_bias_and_norm_wdbool, default=True

Whether the bias terms and normalization layers get weight decay during optimization or not.

optimizer_kwargsdict or None, default=None

Extra named arguments for the optimizer.

lr_scheduler{“none”, “warmup_cosine”}, LRSchedulerPLType or None, default=”warmup_cosine”

Learning rate scheduler to use.

lr_scheduler_kwargsdict or None, default=None

Extra named arguments for the scheduler. By default, it is set to {“warmup_epochs”: 10, “warmup_start_lr”: 1e-6, “min_lr”: 0.0, “interval”: “step”}

**kwargsdict, optional

Extra named arguments for the BaseEstimator class (given to PL Trainer), such as max_epochs, max_steps, num_sanity_val_steps, check_val_every_n_epoch, callbacks, etc. See the PL Trainer API for more details.

Attributes:

encoder: nn.Module: Pointer to the teacher.
student: torch.nn.Module: Student backbone.
teacher: torch.nn.Module: Teacher backbone.
student_head: torch.nn.Module: Student head on top of student backbone (only for training).
teacher_head: torch.nn.Module: Teacher head on top of teacher backbone (only for training).
loss: DINOLoss: The DINO loss used for training.
optimizer: torch.optim.Optimizer: Optimizer used for training.
lr_scheduler: LRSchedulerPLType or None: Learning rate scheduler used for training.

Notes

We always assume to have 2 global crops (views) in DINO. Adding more views becomes computationally prohibitive.

References

[1]

Caron, M., et al., “Emerging Properties in Self-Supervised Vision Transformers” ICCV, 2021. https://arxiv.org/abs/2104.14294

__init__(encoder, encoder_kwargs=None, proj_input_dim=2048, proj_hidden_dim=2048, proj_bottleneck_dim=256, proj_output_dim=4096, proj_batch_norm=True, proj_norm_last_layer=True, num_local_crops=8, student_temperature=0.1, teacher_temperature=0.07, warmup_teacher_temp=0.04, warmup_teacher_temp_epochs=30, base_lambda=0.996, final_lambda=1.0, clip_grad=0.0, freeze_last_layer=0, optimizer='adamW', learning_rate=0.0003, weight_decay=0.0005, exclude_bias_and_norm_wd=True, optimizer_kwargs=None, lr_scheduler='warmup_cosine', lr_scheduler_kwargs=None, **kwargs)[source]¶

configure_optimizers()[source]¶: Initialize the optimizer and learning rate scheduler in DINO.

forward_student(X)[source]¶: Forward global + local views through student.

forward_teacher(X)[source]¶: Forward global views only through teacher.

on_after_backward()[source]¶: Performs gradient clipping and last layer freeze if required.

on_train_batch_end(outputs, batch, batch_idx)[source]¶

Performs the teacher momentum update.

Parameters:

outputs: dict[str, Any]: The outputs of the training step (ignored).
batch: Sequence[Any]: A batch of input data (ignored).
batch_idx: int: The index of the current batch (ignored).

test_step(batch, batch_idx)[source]¶: Skip the test step.

training_step(batch, batch_idx, dataloader_idx=0)[source]¶

Perform one training step and computes training loss.

Parameters:

batch: Sequence[Any]: A batch of data in the format [X] or ([X], Y) where [X] is a list of torch.Tensor containing num_large_crops global views (first elements) and num_small_crops local views (last elements). Y are labels (ignored).
batch_idx: int: The index of the current batch (ignored).
dataloader_idx: int, default=0: The index of the dataloader (ignored).

Returns:

outputs: dict

Dictionary containing:

“loss”: the DINO loss computed on this batch (scalar);
“z_student”: tensor of shape (n_views, batch_size, n_features);
“z_teacher”: tensor of shape (n_global_views, batch_size, n_features);
“y”: eventual targets (returned as is).

transform_step(batch, batch_idx, dataloader_idx=0)[source]¶

Encode the input data into the latent space.

Importantly, we do not apply the projection head here since it is not part of the final model at inference time (only used for training).

Parameters:

batch: torch.Tensor: A batch of data that has been generated from test_dataloader. This is given as is to the encoder.
batch_idx: int: The index of the current batch (ignored).
dataloader_idx: int, default=0: The index of the dataloader (ignored).

Returns:

features: torch.Tensor: The encoded features returned by the encoder.

validation_step(batch, batch_idx, dataloader_idx=0)[source]¶

Perform one validation step and computes validation loss.

Parameters:

batch: Sequence[Any]: A batch of data in the format [X] or ([X], Y) where [X] is a list of torch.Tensor containing num_large_crops global views (first elements) and num_small_crops local views (last elements). Y are labels (ignored).
batch_idx: int: The index of the current batch (ignored).
dataloader_idx: int, default=0: The index of the dataloader (ignored).

Returns:

outputs: dict

Dictionary containing:

“loss”: the DINO loss computed on this batch (scalar);
“z_student”: tensor of shape (n_views, batch_size, n_features);
“z_teacher”: tensor of shape (n_global_views, batch_size, n_features);
“y”: eventual targets.