Skip to content

📊 Data Format

This page describes the lightweight dataset wrappers under data/datasets.py and how input files are organized.


CIFAR10Dataset

  • CSV annotations – two columns: image filename without extension and label.
  • Image folder – root_dir contains the corresponding *.png files.
images/
  00001.png
  00002.png
...
train.csv

Each row in train.csv contains image_name,label. Labels are mapped to indices using the unique values discovered in the file.

STL10Dataset

  • JSON annotations – similar two-field structure ["file","label"].
  • Image folder – root_dir points to the directory with .png images.

The JSON file can store absolute paths or paths relative to the image folder. Only the final filename is used to load the image.

STL10UnsupervisedDataset

Used for unlabeled data. Simply provide a directory with images. Filenames are sorted alphabetically to create a deterministic order.

STL10DINODataset

Special dataset for DINO pretraining. It loads images from a folder like STL10UnsupervisedDataset but additionally generates multiple global and local views using provided transform functions. Parameters:

  • num_all_views – total number of views to create.
  • num_global_views – how many of those views should cover more than half of the image area.

Directory expectations

  • Images should be stored as .png files.
  • Annotation files (.csv or .json) reside alongside or above the image folder.
  • root_dir passed to the dataset points to the folder containing the images.

These simple structures keep the examples short and focused on the SSL algorithms rather than data loading.