Open-vocabulary 3D object search using frozen point cloud encoders (Concerto) and a CLIP-aligned MLP translation head. Validated on S3DIS Area 5 and custom in-the-wild scans.
- Overview
- Architecture
- Repository Structure
- Dependencies
- Setup & Installation
- Data Preparation
- Training the Translation Head
- Evaluation
- In-the-Wild Demo
- Team & Acknowledgements
- References
This project builds and evaluates a pipeline for open-world text-based object search in 3D point cloud scenes. The core idea:
- Frozen 3D encoder — We use Concerto Small (pretrained on large-scale 3D data) to extract per-point features from indoor scans. The backbone is never finetuned.
- CLIP-aligned MLP translation head — A lightweight 3-layer MLP maps Concerto's 3D feature space into CLIP's text embedding space. This is trained with supervision on S3DIS label↔CLIP-text-embedding pairs.
- Open-vocabulary querying — At inference, a user provides a free-text query (e.g., "red chair", "lamp", "whiteboard"). The query is embedded by CLIP's text encoder and matched against the translated per-point features via cosine similarity, producing a heatmap over the point cloud.
- In-the-wild generalization — We test the pipeline on custom scenes captured and exported as
.ply. Polycam was initially considered but not used due to paywalls. However, any.plypoint cloud with XYZ and RGB channels can be used to test the model.
┌──────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Point Cloud │────▶│ Concerto Small │────▶│ Per-point 3D │
│ (XYZ + RGB) │ │ (frozen, 39M) │ │ features (D=896)│
└──────────────┘ └─────────────────┘ └───────┬──────────┘
│
▼
┌───────────────┐
│ MLP Translation│
│ Head (trainable│
│ 3 layers) │
└───────┬───────┘
│
▼
┌──────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Text query │────▶│ CLIP Text │────▶│ Cosine sim / │
│ "red chair" │ │ Encoder (frozen)│ │ heatmap on PC │
└──────────────┘ └─────────────────┘ └──────────────────┘
Deep_learning_project/
├── README.md # This file
├── LICENSE
├── .gitignore
├── pyproject.toml # Project dependencies for Python >= 3.11
├── uv.lock # Lockfile generated by uv
├── .env # Environment variables
├── readme_dataset.txt # Information on dataset format
│
├── configs/ # Training & eval config files (YAML)
│ └── train_mlp_s3dis.yaml # Configuration with input_dim=896
│
├── src/ # Core Python source code
│ ├── encoder.py # Concerto feature extraction wrapper
│ ├── translation_head.py # MLP definition & forward pass
│ ├── clip_utils.py # CLIP text embedding helpers
│ ├── dataset.py # S3DIS + custom dataset loaders
│ ├── train.py # Training loop for the MLP head
│ ├── evaluate.py # Quantitative evaluation
│ ├── evaluate_labels.py # Label-specific evaluation
│ └── visualize.py # 3D heatmap visualization utilities
│
├── notebooks/ # Colab notebooks (run sequentially)
│ ├── pyproject.toml # Project dependencies for Python 3.10 (Colab/spconv)
│ ├── 01_setup_and_data.ipynb
│ ├── 02_feature_extraction.ipynb
│ ├── 03_train_mlp.ipynb
│ ├── 04_evaluate.ipynb
│ ├── 04b_evaluate_labels.ipynb
│ ├── 05_demo.ipynb
│ └── 06_visualize_room.ipynb
│
├── scripts/ # CLI utility scripts
│ ├── extract_features.py # Batch feature extraction
│ ├── prepare_s3dis.py # S3DIS preprocessing
│ ├── export_polycam.py # .ply to pipeline format
│ ├── demo.py # Script for running demo
│ └── visualize_concerto_pca.py # PCA visualization of Concerto features
│
├── tests/ # Unit & smoke tests
├── docs/ # Documentation & papers
├── data/ # Symlinked from Google Drive
├── features/ # Extracted Concerto features (.npz)
└── presentation/ # Final slides & demo materials
We use uv for package management because it is significantly faster than standard pip.
There are two pyproject.toml files in this repository:
pyproject.toml(root): For standard environments (Python >= 3.11).notebooks/pyproject.toml: Specifically for Google Colab, forcing Python 3.10 to maintain compatibility withspconv(required by Concerto).
Key dependencies include:
torch(≥ 2.1)pointcept(Concerto encoder & data utilities)open_clip_torch(CLIP text encoder)spconv-cu120(Sparse convolution backend for Colab)open3d,numpy,scipy,plotly
This project is designed to be executed via Jupyter Notebooks on Google Colab, effectively using Colab as a virtual machine. Almost all code executions are made using uv run inside the notebooks.
For the notebooks to run seamlessly, you should have the data structured in your Google Drive as follows:
Drive/
└── DL_Project/
├── data/
│ ├── s3dis_raw/ # Raw S3DIS files
│ └── s3dis_processed/ # Preprocessed S3DIS files
└── checkpoints/ # Trained models
- Clone the repo in Colab and navigate to the project directory.
- Mount Google Drive to access data and checkpoints.
- Run the notebooks sequentially (
notebooks/01tonotebooks/06).
The notebooks will automatically install dependencies using uv (based on notebooks/pyproject.toml), set up the necessary symlinks (e.g., ./data -> /content/drive/MyDrive/DL_Project/data), and execute the pipeline steps via !uv run.
- Download S3DIS from the Stanford website (requires form).
- Place the raw data in your Google Drive at
Drive > DL_Project > data > s3dis_raw/. - In
notebooks/01_setup_and_data.ipynb, the preprocessing script is executed:!uv run scripts/prepare_s3dis.py --input data/s3dis_raw --output data/s3dis_processed
- Obtain any
.plypoint cloud with XYZ and RGB values (e.g., via 3D scanning apps or datasets). - Place it in the appropriate folder (or update paths in
05_demo.ipynb). - (Optional) Process using
scripts/export_polycam.pyif specific formatting is required.
Executed primarily via notebooks/03_train_mlp.ipynb:
!uv run src/train.py --config configs/train_mlp_s3dis.yamlKey hyperparameters based on our final runs:
- Input Dimensionality: 896 (from Concerto)
- MLP layers: 3 layers (896 → 512 → 512 → 512, with GELU + dropout 0.1)
- Loss: MSE between predicted embeddings and CLIP text embeddings of ground-truth labels.
- Optimizer: AdamW, lr=1e-3, weight decay=1e-4
- Epochs: 40
- Batch size: 16384 (Adjusted to fit T4 VRAM ~15GB)
Executed primarily via notebooks/04_evaluate.ipynb:
!uv run src/evaluate.py --config configs/eval_s3dis.yaml --split area5Metrics:
- mIoU (semantic segmentation via nearest-label matching)
- Top-k retrieval accuracy (given a text query, what % of top-k points belong to the correct class)
- Qualitative heatmaps (per-query 3D visualizations)
The notebooks/05_demo.ipynb notebook provides an interactive demo:
- Load a custom
.plyscan - Extract Concerto features (frozen)
- Apply the trained MLP translation head
- Enter a free-text query → visualize the heatmap on the 3D scene using interactive Plotly figures.
| Member | Role |
|---|---|
| Ricardo | Lead engineer — feature extraction, evaluation |
| Leonardo | Encoder integration, MLP architecture, training pipeline |
| Adrian | Data preparation, demo notebook |
| Matteo | Evaluation scripts, visualization, presentation & slides |
Course: Deep Learning — Master's program
Compute: Google Colab free tier (NVIDIA T4 GPU)
- Concerto: Concerto: Cooperative Contrastive Pretraining for 3D Point Cloud Understanding — GitHub | HuggingFace
- CLIP: Radford et al., Learning Transferable Visual Models From Natural Language Supervision, 2021
- S3DIS: Armeni et al., 3D Semantic Parsing of Large-Scale Indoor Spaces, CVPR 2016
- Pointcept: github.com/Pointcept/Pointcept