Skip to content

🗂️ Code Structure

This page provides a high-level overview of the main components in ViT-SSL and how they fit together.

vit_core

Core implementation of the Vision Transformer (ViT) and self-supervised variants.

  • patch_embedding.py – splits the image into patches and projects them to the embedding dimension.
  • attention.py – multi-head self-attention layer used inside each transformer block.
  • feed_forward.py – MLP block that follows the attention mechanism.
  • encoder_block.py – combines attention and feed-forward layers into a single transformer block.
  • mlp_head.py – classification head applied to the final representation.
  • vit.py – convenience wrapper assembling the full ViT model.
  • ssl/ – implementations of DINO and SimMIM specific components.

utils

Utilities used throughout training and evaluation.

  • scripts/ – Visualization scripts for supervisely trained models and SimMIM.
  • trainers/ – training loops for Supervised, DINO and SimMIM methods.
  • logger.py – colored logging based on the rich library.
  • metrics.py – common evaluation metrics.
  • schedulers.py – helper learning rate schedulers.

configs

Hydra configuration files. Base configs live under configs/base/ and each method has overrides (e.g. configs/dino/, configs/simmim/). The root config.yaml selects which set of overrides to apply.

For details on each configurability, see the Configuration Guide section.

Other directories

  • data/ – small dataset wrappers used for experiments (data format).
  • evaluators/ – evaluation and helper scripts. See Evaluation Scripts for usage examples.
  • tests/ – PyTest unit tests for the core modules.
  • train.py – main entry point that loads a config and launches training.

Use these modules together to experiment with different SSL approaches and fine-tuning scenarios.

For details on each algorithm, see the Training Methods section.