HilDA: Hierarchical Distillation with Diffusion
for Advancing Self-Supervised LiDAR Pre-training

Maciej Wozniak^1, · Jesper Ericsson^1,3, · Hariprasath Govindarajan² · Truls Nyberg^1,3
Thomas Gustafsson³ · Patric Jensfelt¹ · Olov Andersson¹

¹KTH Royal Institute of Technology ²Linköping University ³TRATON AB / Scania
_{^*Equal contribution}

CODE COMING SOON :)

Overview

Vision Foundation Models (VFMs) are powerful teachers for camera-to-LiDAR knowledge distillation, but current methods treat them as black boxes — distilling only the final layer and ignoring both the teacher's layer-wise semantic structure and the spatiotemporal information in LiDAR sequences.

HilDA is a self-supervised pre-training framework that captures both the semantic what and the geometric where needed for driving. It combines hierarchical distillation (multi-layer + global context) with a temporal occupancy diffusion objective.

_{Segmentation errors (red) progressively vanish as we add (a) temporal occupancy diffusion, (b) multi-layer distillation, and (c) global context (CLS) distillation.}

Method

From LiDAR sweeps and synchronized multi-view images, a 3D backbone is trained end-to-end with three self-supervised objectives — no task labels:

#	Component	What it does
1	Multi-Layer Distillation	Aligns multiple teacher layers with student layers via calibrated point–pixel correspondences, transferring how features form across the hierarchy.
2	Global Context Distillation	Aligns the VFM's CLS token with a learnable 3D global-context token, injecting scene-level semantics.
3	Temporal Occupancy Diffusion	A conditional diffusion model denoises future BEV occupancy from past + present features, teaching object permanence and scene dynamics.

The distillation and diffusion heads are discarded at inference — only the pre-trained backbone transfers to all downstream tasks, with no re-pretraining.

Results

HilDA sets a new state of the art on camera–LiDAR cross-modal distillation and transfers strongly to spatial and spatiotemporal 3D tasks.

Semantic Segmentation

_{Fewer errors than ScaLR; correctly segments rare long-tail cases (scooter driver, person on a truck).}

3D Object Detection

_{Robust detections at long range and under heavy occlusion, where prior distillation baselines miss objects.}

Semantic Occupancy

_{Cleaner, more complete semantic occupancy than ScaLR / CleverDistiller; highest mIoU across a 5-second horizon.}

Cross-Modal Feature Alignment

_{HilDA's 3D feature similarity (bottom) closely matches DINOv2's 2D pattern — strong cross-modal alignment.}

Recovering Annotation Errors

_{Ground truth mislabels a light pole and signs as "vegetation"; HilDA correctly predicts "manmade".}

To publish via GitHub Pages, push this folder to a repository and enable Pages on the branch root.

Citation

@inproceedings{wozniak2026hilda,
  title     = {HilDA: Hierarchical Distillation with Diffusion for
               Advancing Self-Supervised LiDAR Pre-training},
  author    = {Wozniak, Maciej and Ericsson, Jesper and
               Govindarajan, Hariprasath and Nyberg, Truls and
               Gustafsson, Thomas and Jensfelt, Patric and Andersson, Olov},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

_{Website template adapted from Nerfies.}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
static/images		static/images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HilDA: Hierarchical Distillation with Diffusion
for Advancing Self-Supervised LiDAR Pre-training

CODE COMING SOON :)

Overview

Method