Maciej Wozniak1, · Jesper Ericsson1,3, · Hariprasath Govindarajan2 · Truls Nyberg1,3
Thomas Gustafsson3 · Patric Jensfelt1 · Olov Andersson1
1KTH Royal Institute of Technology 2Linköping University 3TRATON AB / Scania
*Equal contribution
Vision Foundation Models (VFMs) are powerful teachers for camera-to-LiDAR knowledge distillation, but current methods treat them as black boxes — distilling only the final layer and ignoring both the teacher's layer-wise semantic structure and the spatiotemporal information in LiDAR sequences.
HilDA is a self-supervised pre-training framework that captures both the semantic what and the geometric where needed for driving. It combines hierarchical distillation (multi-layer + global context) with a temporal occupancy diffusion objective.
Segmentation errors (red) progressively vanish as we add (a) temporal occupancy diffusion, (b) multi-layer distillation, and (c) global context (CLS) distillation.
From LiDAR sweeps and synchronized multi-view images, a 3D backbone is trained end-to-end with three self-supervised objectives — no task labels:
| # | Component | What it does |
|---|---|---|
| 1 | Multi-Layer Distillation | Aligns multiple teacher layers with student layers via calibrated point–pixel correspondences, transferring how features form across the hierarchy. |
| 2 | Global Context Distillation | Aligns the VFM's CLS token with a learnable 3D global-context token, injecting scene-level semantics. |
| 3 | Temporal Occupancy Diffusion | A conditional diffusion model denoises future BEV occupancy from past + present features, teaching object permanence and scene dynamics. |
The distillation and diffusion heads are discarded at inference — only the pre-trained backbone transfers to all downstream tasks, with no re-pretraining.
HilDA sets a new state of the art on camera–LiDAR cross-modal distillation and transfers strongly to spatial and spatiotemporal 3D tasks.
Fewer errors than ScaLR; correctly segments rare long-tail cases (scooter driver, person on a truck).
Robust detections at long range and under heavy occlusion, where prior distillation baselines miss objects.
Cleaner, more complete semantic occupancy than ScaLR / CleverDistiller; highest mIoU across a 5-second horizon.
HilDA's 3D feature similarity (bottom) closely matches DINOv2's 2D pattern — strong cross-modal alignment.
To publish via GitHub Pages, push this folder to a repository and enable Pages on the branch root.
@inproceedings{wozniak2026hilda,
title = {HilDA: Hierarchical Distillation with Diffusion for
Advancing Self-Supervised LiDAR Pre-training},
author = {Wozniak, Maciej and Ericsson, Jesper and
Govindarajan, Hariprasath and Nyberg, Truls and
Gustafsson, Thomas and Jensfelt, Patric and Andersson, Olov},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}Website template adapted from Nerfies.

