Open-World Text-Based 3D Object Search

Open-vocabulary 3D object search using frozen point cloud encoders (Concerto) and a CLIP-aligned MLP translation head. Validated on S3DIS Area 5 and custom in-the-wild scans.

Overview

This project builds and evaluates a pipeline for open-world text-based object search in 3D point cloud scenes. The core idea:

Frozen 3D encoder — We use Concerto Small (pretrained on large-scale 3D data) to extract per-point features from indoor scans. The backbone is never finetuned.
CLIP-aligned MLP translation head — A lightweight 3-layer MLP maps Concerto's 3D feature space into CLIP's text embedding space. This is trained with supervision on S3DIS label↔CLIP-text-embedding pairs.
Open-vocabulary querying — At inference, a user provides a free-text query (e.g., "red chair", "lamp", "whiteboard"). The query is embedded by CLIP's text encoder and matched against the translated per-point features via cosine similarity, producing a heatmap over the point cloud.
In-the-wild generalization — We test the pipeline on custom scenes captured and exported as .ply. Polycam was initially considered but not used due to paywalls. However, any .ply point cloud with XYZ and RGB channels can be used to test the model.

Architecture

┌──────────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Point Cloud │────▶│  Concerto Small │────▶│  Per-point 3D    │
│  (XYZ + RGB) │     │  (frozen, 39M)  │     │  features (D=896)│
└──────────────┘     └─────────────────┘     └───────┬──────────┘
                                                     │
                                                     ▼
                                             ┌───────────────┐
                                             │ MLP Translation│
                                             │ Head (trainable│
                                             │ 3 layers)      │
                                             └───────┬───────┘
                                                     │
                                                     ▼
┌──────────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Text query  │────▶│  CLIP Text      │────▶│  Cosine sim /    │
│ "red chair"  │     │ Encoder (frozen)│     │  heatmap on PC   │
└──────────────┘     └─────────────────┘     └──────────────────┘

Repository Structure

Deep_learning_project/
├── README.md                      # This file
├── LICENSE
├── .gitignore
├── pyproject.toml                 # Project dependencies for Python >= 3.11
├── uv.lock                        # Lockfile generated by uv
├── .env                           # Environment variables
├── readme_dataset.txt             # Information on dataset format
│
├── configs/                       # Training & eval config files (YAML)
│   └── train_mlp_s3dis.yaml       # Configuration with input_dim=896
│
├── src/                           # Core Python source code
│   ├── encoder.py                 # Concerto feature extraction wrapper
│   ├── translation_head.py        # MLP definition & forward pass
│   ├── clip_utils.py              # CLIP text embedding helpers
│   ├── dataset.py                 # S3DIS + custom dataset loaders
│   ├── train.py                   # Training loop for the MLP head
│   ├── evaluate.py                # Quantitative evaluation
│   ├── evaluate_labels.py         # Label-specific evaluation
│   └── visualize.py               # 3D heatmap visualization utilities
│
├── notebooks/                     # Colab notebooks (run sequentially)
│   ├── pyproject.toml             # Project dependencies for Python 3.10 (Colab/spconv)
│   ├── 01_setup_and_data.ipynb
│   ├── 02_feature_extraction.ipynb
│   ├── 03_train_mlp.ipynb
│   ├── 04_evaluate.ipynb
│   ├── 04b_evaluate_labels.ipynb
│   ├── 05_demo.ipynb
│   └── 06_visualize_room.ipynb
│
├── scripts/                       # CLI utility scripts
│   ├── extract_features.py        # Batch feature extraction
│   ├── prepare_s3dis.py           # S3DIS preprocessing
│   ├── export_polycam.py          # .ply to pipeline format 
│   ├── demo.py                    # Script for running demo
│   └── visualize_concerto_pca.py  # PCA visualization of Concerto features
│
├── tests/                         # Unit & smoke tests
├── docs/                          # Documentation & papers
├── data/                          # Symlinked from Google Drive
├── features/                      # Extracted Concerto features (.npz)
└── presentation/                  # Final slides & demo materials

Dependencies

We use uv for package management because it is significantly faster than standard pip.

There are two pyproject.toml files in this repository:

pyproject.toml (root): For standard environments (Python >= 3.11).
notebooks/pyproject.toml: Specifically for Google Colab, forcing Python 3.10 to maintain compatibility with spconv (required by Concerto).

Key dependencies include:

torch (≥ 2.1)
pointcept (Concerto encoder & data utilities)
open_clip_torch (CLIP text encoder)
spconv-cu120 (Sparse convolution backend for Colab)
open3d, numpy, scipy, plotly

Setup & Installation

This project is designed to be executed via Jupyter Notebooks on Google Colab, effectively using Colab as a virtual machine. Almost all code executions are made using uv run inside the notebooks.

Drive Folder Structure

For the notebooks to run seamlessly, you should have the data structured in your Google Drive as follows:

Drive/
└── DL_Project/
    ├── data/
    │   ├── s3dis_raw/             # Raw S3DIS files
    │   └── s3dis_processed/       # Preprocessed S3DIS files
    └── checkpoints/               # Trained models

Execution Steps

Clone the repo in Colab and navigate to the project directory.
Mount Google Drive to access data and checkpoints.
Run the notebooks sequentially (notebooks/01 to notebooks/06).

The notebooks will automatically install dependencies using uv (based on notebooks/pyproject.toml), set up the necessary symlinks (e.g., ./data -> /content/drive/MyDrive/DL_Project/data), and execute the pipeline steps via !uv run.

Data Preparation

S3DIS Area 5

Download S3DIS from the Stanford website (requires form).
Place the raw data in your Google Drive at Drive > DL_Project > data > s3dis_raw/.

In notebooks/01_setup_and_data.ipynb, the preprocessing script is executed:

!uv run scripts/prepare_s3dis.py --input data/s3dis_raw --output data/s3dis_processed

Custom In-the-Wild Scan

Obtain any .ply point cloud with XYZ and RGB values (e.g., via 3D scanning apps or datasets).
Place it in the appropriate folder (or update paths in 05_demo.ipynb).
(Optional) Process using scripts/export_polycam.py if specific formatting is required.

Training the Translation Head

Executed primarily via notebooks/03_train_mlp.ipynb:

!uv run src/train.py --config configs/train_mlp_s3dis.yaml

Key hyperparameters based on our final runs:

Input Dimensionality: 896 (from Concerto)
MLP layers: 3 layers (896 → 512 → 512 → 512, with GELU + dropout 0.1)
Loss: MSE between predicted embeddings and CLIP text embeddings of ground-truth labels.
Optimizer: AdamW, lr=1e-3, weight decay=1e-4
Epochs: 40
Batch size: 16384 (Adjusted to fit T4 VRAM ~15GB)

Evaluation

Executed primarily via notebooks/04_evaluate.ipynb:

!uv run src/evaluate.py --config configs/eval_s3dis.yaml --split area5

Metrics:

mIoU (semantic segmentation via nearest-label matching)
Top-k retrieval accuracy (given a text query, what % of top-k points belong to the correct class)
Qualitative heatmaps (per-query 3D visualizations)

In-the-Wild Demo

The notebooks/05_demo.ipynb notebook provides an interactive demo:

Load a custom .ply scan
Extract Concerto features (frozen)
Apply the trained MLP translation head
Enter a free-text query → visualize the heatmap on the 3D scene using interactive Plotly figures.

Team & Acknowledgements

Member	Role
Ricardo	Lead engineer — feature extraction, evaluation
Leonardo	Encoder integration, MLP architecture, training pipeline
Adrian	Data preparation, demo notebook
Matteo	Evaluation scripts, visualization, presentation & slides

Course: Deep Learning — Master's program
Compute: Google Colab free tier (NVIDIA T4 GPU)

References

Concerto: Concerto: Cooperative Contrastive Pretraining for 3D Point Cloud Understanding — GitHub | HuggingFace
CLIP: Radford et al., Learning Transferable Visual Models From Natural Language Supervision, 2021
S3DIS: Armeni et al., 3D Semantic Parsing of Large-Scale Indoor Spaces, CVPR 2016
Pointcept: github.com/Pointcept/Pointcept

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-World Text-Based 3D Object Search

Table of Contents

Overview

Architecture

Repository Structure

Dependencies

Setup & Installation

Drive Folder Structure

Execution Steps

Data Preparation

S3DIS Area 5

Custom In-the-Wild Scan

Training the Translation Head

Evaluation

In-the-Wild Demo

Team & Acknowledgements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
configs		configs
docs		docs
notebooks		notebooks
presentation		presentation
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
readme_dataset.txt		readme_dataset.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Open-World Text-Based 3D Object Search

Table of Contents

Overview

Architecture

Repository Structure

Dependencies

Setup & Installation

Drive Folder Structure

Execution Steps

Data Preparation

S3DIS Area 5

Custom In-the-Wild Scan

Training the Translation Head

Evaluation

In-the-Wild Demo

Team & Acknowledgements

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages