Note
This page is a reference documentation. It only explains the class signature, and not how to use it. Please refer to the user guide for the big picture.
nidl.estimators.ssl.DINO¶
- class nidl.estimators.ssl.DINO(encoder, encoder_kwargs=None, proj_input_dim=2048, proj_hidden_dim=2048, proj_bottleneck_dim=256, proj_output_dim=4096, proj_batch_norm=True, proj_norm_last_layer=True, num_local_crops=8, student_temperature=0.1, teacher_temperature=0.07, warmup_teacher_temp=0.04, warmup_teacher_temp_epochs=30, base_lambda=0.996, final_lambda=1.0, clip_grad=0.0, freeze_last_layer=0, optimizer='adamW', learning_rate=0.0003, weight_decay=0.0005, exclude_bias_and_norm_wd=True, optimizer_kwargs=None, lr_scheduler='warmup_cosine', lr_scheduler_kwargs=None, **kwargs)[source]¶
Bases:
TransformerMixin,BaseEstimatorDINO [1].
DINO (self-Distillation with NO labels) is a self-supervised learning method for vision models. It learns visual representations using knowledge distillation: a student network is trained to align the representation of local and global crops (or “views”) with the representation of global crops given by a teacher model. The teacher is updated through exponential moving average of the student, avoiding a representation collapse. The DINO loss is a cross-entropy across features between teacher and student representations. This way, it does not rely on negative samples as in contrastive learning and it is less sensitive to batch size than SimCLR.
After training, the teacher model is used at inference to obtain image features.
- Parameters:
- encodernn.Module or class
Architecture of the encoder. A PyTorch
Moduleis expected. In general, the uninstantiated class should be passed, although instantiated modules will also work.- encoder_kwargsdict or None, default=None
Options for building the encoder (depends on each architecture). Ignored if encoder is already instantiated.
- proj_input_dimint, default=2048
Projector input dimension. It must be consistent with encoder’s output dimension.
- proj_hidden_dimint, default=2048
Projector hidden dimension.
- proj_bottleneck_dimint, default=256
Projector bottleneck dimension.
- proj_output_dimint, default=4096
Projector output dimension.
- proj_batch_normbool, default=True
Whether to use batch norm or not in projector. Should be set to False when using a vision transformer backbone.
- proj_norm_last_layerbool, default=True
Whether or not to weight normalize the last layer of the DINO head. Not normalizing leads to better performance but can make the training unstable.
- num_local_cropsint, default=8
Number of local views.
- student_temperaturefloat, default=0.1
Temperature for the student.
- teacher_temperaturefloat, default=0.07
Final temperature for the teacher.
- warmup_teacher_tempfloat, default=0.04
Initial temperature for the teacher network.
- warmup_teacher_temp_epochsint, default=30
Number of epochs for the warmup phase of the teacher temperature.
- base_lambdafloat, default=0.996
Base value for the weighting coefficient in the teacher momentum update with exponential moving average. A cosine annealing scheme is used.
- final_lambdafloat, default=1.0
Final value for the weighting coefficient in the teacher momentum update.
- clip_gradfloat, default=0.0
Threshold for gradient clipping. Null value means no clipping.
- freeze_last_layerint, default=0
Number of epochs during which the last layer in student’s projection head is frozen.
- optimizer{‘sgd’, ‘adam’, ‘adamW’} or Optimizer, default=”adamW”
Optimizer for training the model. If a string is given, it can be:
‘sgd’: Stochastic Gradient Descent (with optional momentum).
‘adam’: First-order gradient-based optimizer.
‘adamW’ (default): Adam with decoupled weight decay regularization (see “Decoupled Weight Decay Regularization”, Loshchilov and Hutter, ICLR 2019).
- learning_ratefloat, default=3e-4
Initial learning rate.
- weight_decayfloat, default=5e-4
Weight decay in the optimizer.
- exclude_bias_and_norm_wdbool, default=True
Whether the bias terms and normalization layers get weight decay during optimization or not.
- optimizer_kwargsdict or None, default=None
Extra named arguments for the optimizer.
- lr_scheduler{“none”, “warmup_cosine”}, LRSchedulerPLType or None, default=”warmup_cosine”
Learning rate scheduler to use.
- lr_scheduler_kwargsdict or None, default=None
Extra named arguments for the scheduler. By default, it is set to {“warmup_epochs”: 10, “warmup_start_lr”: 1e-6, “min_lr”: 0.0, “interval”: “step”}
- **kwargsdict, optional
Extra named arguments for the BaseEstimator class (given to PL Trainer), such as max_epochs, max_steps, num_sanity_val_steps, check_val_every_n_epoch, callbacks, etc. See the PL Trainer API for more details.
- Attributes:
- encoder: nn.Module
Pointer to the teacher.
- student: torch.nn.Module
Student backbone.
- teacher: torch.nn.Module
Teacher backbone.
- student_head: torch.nn.Module
Student head on top of student backbone (only for training).
- teacher_head: torch.nn.Module
Teacher head on top of teacher backbone (only for training).
- loss: DINOLoss
The DINO loss used for training.
- optimizer: torch.optim.Optimizer
Optimizer used for training.
- lr_scheduler: LRSchedulerPLType or None
Learning rate scheduler used for training.
Notes
We always assume to have 2 global crops (views) in DINO. Adding more views becomes computationally prohibitive.
References
[1]Caron, M., et al., “Emerging Properties in Self-Supervised Vision Transformers” ICCV, 2021. https://arxiv.org/abs/2104.14294
- __init__(encoder, encoder_kwargs=None, proj_input_dim=2048, proj_hidden_dim=2048, proj_bottleneck_dim=256, proj_output_dim=4096, proj_batch_norm=True, proj_norm_last_layer=True, num_local_crops=8, student_temperature=0.1, teacher_temperature=0.07, warmup_teacher_temp=0.04, warmup_teacher_temp_epochs=30, base_lambda=0.996, final_lambda=1.0, clip_grad=0.0, freeze_last_layer=0, optimizer='adamW', learning_rate=0.0003, weight_decay=0.0005, exclude_bias_and_norm_wd=True, optimizer_kwargs=None, lr_scheduler='warmup_cosine', lr_scheduler_kwargs=None, **kwargs)[source]¶
- on_train_batch_end(outputs, batch, batch_idx)[source]¶
Performs the teacher momentum update.
- Parameters:
- outputs: dict[str, Any]
The outputs of the training step (ignored).
- batch: Sequence[Any]
A batch of input data (ignored).
- batch_idx: int
The index of the current batch (ignored).
- training_step(batch, batch_idx, dataloader_idx=0)[source]¶
Perform one training step and computes training loss.
- Parameters:
- batch: Sequence[Any]
A batch of data in the format [X] or ([X], Y) where [X] is a list of torch.Tensor containing num_large_crops global views (first elements) and num_small_crops local views (last elements). Y are labels (ignored).
- batch_idx: int
The index of the current batch (ignored).
- dataloader_idx: int, default=0
The index of the dataloader (ignored).
- Returns:
- outputs: dict
- Dictionary containing:
“loss”: the DINO loss computed on this batch (scalar);
“z_student”: tensor of shape (n_views, batch_size, n_features);
“z_teacher”: tensor of shape (n_global_views, batch_size, n_features);
“y”: eventual targets (returned as is).
- transform_step(batch, batch_idx, dataloader_idx=0)[source]¶
Encode the input data into the latent space.
Importantly, we do not apply the projection head here since it is not part of the final model at inference time (only used for training).
- Parameters:
- batch: torch.Tensor
A batch of data that has been generated from test_dataloader. This is given as is to the encoder.
- batch_idx: int
The index of the current batch (ignored).
- dataloader_idx: int, default=0
The index of the dataloader (ignored).
- Returns:
- features: torch.Tensor
The encoded features returned by the encoder.
- validation_step(batch, batch_idx, dataloader_idx=0)[source]¶
Perform one validation step and computes validation loss.
- Parameters:
- batch: Sequence[Any]
A batch of data in the format [X] or ([X], Y) where [X] is a list of torch.Tensor containing num_large_crops global views (first elements) and num_small_crops local views (last elements). Y are labels (ignored).
- batch_idx: int
The index of the current batch (ignored).
- dataloader_idx: int, default=0
The index of the dataloader (ignored).
- Returns:
- outputs: dict
- Dictionary containing:
“loss”: the DINO loss computed on this batch (scalar);
“z_student”: tensor of shape (n_views, batch_size, n_features);
“z_teacher”: tensor of shape (n_global_views, batch_size, n_features);
“y”: eventual targets.