Note
This page is a reference documentation. It only explains the class signature, and not how to use it. Please refer to the user guide for the big picture.
nidl.volume.backbones.VisionTransformer3D¶
- class nidl.volume.backbones.VisionTransformer3D(img_size, patch_size, in_chans=1, num_classes=0, global_pool='cls_token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, scale_attn_norm=False, scale_mlp_norm=False, proj_bias=True, class_token=True, reg_tokens=0, no_embed_class=False, pre_norm=False, final_norm=True, fc_norm=None, dynamic_img_size=False, pos_embed='learned', drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, embed_layer=<class 'nidl.volume.backbones.vit3d.PatchEmbed3D'>, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, act_layer=<class 'torch.nn.modules.activation.GELU'>, block_fn=<class 'timm.models.vision_transformer.Block'>, mlp_layer=<class 'timm.layers.mlp.Mlp'>, attn_layer=<class 'timm.layers.attention.Attention'>)[source]¶
Bases:
Module3D Vision Transformer with a timm-like interface.
- Parameters:
- img_sizeint or sequence of int
Input size in
(H, W, D).- patch_sizeint or sequence of int
Patch size in
(H, W, D).- in_chansint, default=1
Number of input channels.
- num_classesint, default=0
Number of output classes. If non-positive, head is identity.
- global_pool{“cls_token”, “avg”, “max”, “avgmax”, “”},
default=”cls_token” Pooling mode. If “”, no pooling is applied and the full token sequence is returned.
- embed_dimint, default=768
Embedding dimension.
- depthint, default=12
Number of transformer blocks.
- num_headsint, default=12
Number of attention heads.
- mlp_ratiofloat, default=4.0
MLP expansion ratio.
- qkv_biasbool, default=True
Whether to use qkv bias.
- class_tokenbool, default=True
Whether to prepend a CLS token.
- reg_tokensint, default=0
Number of register tokens.
- no_embed_classbool, default=False
If
True, position embeddings are defined only on patch tokens.- pre_normbool, default=False
Whether to apply normalization before the transformer blocks.
- fc_normbool or None, default=None
Whether to apply normalization after pooling. If
None, defaults toglobal_pool == "avg".- dynamic_img_sizebool, default=False
Present for interface compatibility. Not used here.
- pos_embed{“learned”, “sincos”, “none”}, default=”learned”
Absolute positional embedding mode.
- drop_ratefloat, default=0.0
Head dropout.
- pos_drop_ratefloat, default=0.0
Positional dropout.
- proj_drop_ratefloat, default=0.0
Projection dropout in each transformer block.
- attn_drop_ratefloat, default=0.0
Attention dropout in each transformer block.
- drop_path_ratefloat, default=0.0
Maximum stochastic depth rate.
- norm_layercallable, default=nn.LayerNorm
Normalization layer constructor.
- act_layercallable, default=nn.GELU
Activation layer constructor.
Notes
This implementation reuses timm’s transformer Block as-is. Only the 3D-specific pieces are implemented locally: patch embedding, 3D positional embeddings, and checkpoint inflation.
- __init__(img_size, patch_size, in_chans=1, num_classes=0, global_pool='cls_token', embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, scale_attn_norm=False, scale_mlp_norm=False, proj_bias=True, class_token=True, reg_tokens=0, no_embed_class=False, pre_norm=False, final_norm=True, fc_norm=None, dynamic_img_size=False, pos_embed='learned', drop_rate=0.0, pos_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, embed_layer=<class 'nidl.volume.backbones.vit3d.PatchEmbed3D'>, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, act_layer=<class 'torch.nn.modules.activation.GELU'>, block_fn=<class 'timm.models.vision_transformer.Block'>, mlp_layer=<class 'timm.layers.mlp.Mlp'>, attn_layer=<class 'timm.layers.attention.Attention'>)[source]¶
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(x)[source]¶
Forward pass.
- Parameters:
- xtorch.Tensor
Input tensor of shape
(B, C, H, W, D).
- Returns:
- torch.Tensor
Logits or features depending on the classifier head.
- forward_features(x)[source]¶
Compute token features.
- Parameters:
- xtorch.Tensor
Input tensor of shape
(B, C, H, W, D).
- Returns:
- torch.Tensor
Token features of shape
(B, N_total, C).
- forward_head(x, pre_logits=False)[source]¶
Apply pooling and classification head.
- Parameters:
- xtorch.Tensor
Token features of shape
(B, N_total, C).- pre_logitsbool, default=False
Whether to return pooled features before the classifier.
- Returns:
- torch.Tensor
Model outputs.
- forward_intermediates(x, indices=(-1,), norm=False, output_fmt='NLC', intermediates_only=False)[source]¶
Forward and return intermediate block outputs.
- Parameters:
- xtorch.Tensor
Input tensor of shape
(B, C, H, W, D).- indicessequence of int or int, default=(-1,)
Block indices to return. Negative indices are supported.
- normbool, default=False
Whether to apply final normalization to returned intermediates.
- output_fmt{“NLC”}, default=”NLC”
Output format for intermediates.
- intermediates_onlybool, default=False
If
True, return only the intermediate tensors.
- Returns:
- tuple or list
If
intermediates_only=False, returns(final, intermediates)Otherwise returnsintermediates.
- Raises:
- ValueError
If an unsupported output format is requested.
- no_weight_decay()[source]¶
Return parameter names that should typically avoid weight decay.
- Returns:
- set of str
Parameter names exempt from weight decay.
Examples using nidl.volume.backbones.VisionTransformer3D¶
Self-Supervised Learning with I-JEPA on MedMNIST3D