Skip to content

KTH-RPL/HilDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

HilDA: Hierarchical Distillation with Diffusion
for Advancing Self-Supervised LiDAR Pre-training

arXiv Project Page

ECCV 2026

Maciej Wozniak1,  ·  Jesper Ericsson1,3,  ·  Hariprasath Govindarajan2  ·  Truls Nyberg1,3
Thomas Gustafsson3  ·  Patric Jensfelt1  ·  Olov Andersson1

1KTH Royal Institute of Technology   2Linköping University   3TRATON AB / Scania
*Equal contribution


CODE COMING SOON :)

Overview

Vision Foundation Models (VFMs) are powerful teachers for camera-to-LiDAR knowledge distillation, but current methods treat them as black boxes — distilling only the final layer and ignoring both the teacher's layer-wise semantic structure and the spatiotemporal information in LiDAR sequences.

HilDA is a self-supervised pre-training framework that captures both the semantic what and the geometric where needed for driving. It combines hierarchical distillation (multi-layer + global context) with a temporal occupancy diffusion objective.

HilDA teaser
Segmentation errors (red) progressively vanish as we add (a) temporal occupancy diffusion, (b) multi-layer distillation, and (c) global context (CLS) distillation.

Method

HilDA architecture

From LiDAR sweeps and synchronized multi-view images, a 3D backbone is trained end-to-end with three self-supervised objectives — no task labels:

# Component What it does
1 Multi-Layer Distillation Aligns multiple teacher layers with student layers via calibrated point–pixel correspondences, transferring how features form across the hierarchy.
2 Global Context Distillation Aligns the VFM's CLS token with a learnable 3D global-context token, injecting scene-level semantics.
3 Temporal Occupancy Diffusion A conditional diffusion model denoises future BEV occupancy from past + present features, teaching object permanence and scene dynamics.

The distillation and diffusion heads are discarded at inference — only the pre-trained backbone transfers to all downstream tasks, with no re-pretraining.

Results

HilDA sets a new state of the art on camera–LiDAR cross-modal distillation and transfers strongly to spatial and spatiotemporal 3D tasks.

Semantic Segmentation

Segmentation comparison
Fewer errors than ScaLR; correctly segments rare long-tail cases (scooter driver, person on a truck).

3D Object Detection

3D detection comparison
Robust detections at long range and under heavy occlusion, where prior distillation baselines miss objects.

Semantic Occupancy

Semantic occupancy
Cleaner, more complete semantic occupancy than ScaLR / CleverDistiller; highest mIoU across a 5-second horizon.

Cross-Modal Feature Alignment

Cross-modal feature similarity
HilDA's 3D feature similarity (bottom) closely matches DINOv2's 2D pattern — strong cross-modal alignment.

Recovering Annotation Errors

Annotation error recovery
Ground truth mislabels a light pole and signs as "vegetation"; HilDA correctly predicts "manmade".

To publish via GitHub Pages, push this folder to a repository and enable Pages on the branch root.

Citation

@inproceedings{wozniak2026hilda,
  title     = {HilDA: Hierarchical Distillation with Diffusion for
               Advancing Self-Supervised LiDAR Pre-training},
  author    = {Wozniak, Maciej and Ericsson, Jesper and
               Govindarajan, Hariprasath and Nyberg, Truls and
               Gustafsson, Thomas and Jensfelt, Patric and Andersson, Olov},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Website template adapted from Nerfies.

About

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors