Skip to content

Gandata/Deep_learning_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open-World Text-Based 3D Object Search

Open-vocabulary 3D object search using frozen point cloud encoders (Concerto) and a CLIP-aligned MLP translation head. Validated on S3DIS Area 5 and custom in-the-wild scans.


Table of Contents


Overview

This project builds and evaluates a pipeline for open-world text-based object search in 3D point cloud scenes. The core idea:

  1. Frozen 3D encoder — We use Concerto Small (pretrained on large-scale 3D data) to extract per-point features from indoor scans. The backbone is never finetuned.
  2. CLIP-aligned MLP translation head — A lightweight 3-layer MLP maps Concerto's 3D feature space into CLIP's text embedding space. This is trained with supervision on S3DIS label↔CLIP-text-embedding pairs.
  3. Open-vocabulary querying — At inference, a user provides a free-text query (e.g., "red chair", "lamp", "whiteboard"). The query is embedded by CLIP's text encoder and matched against the translated per-point features via cosine similarity, producing a heatmap over the point cloud.
  4. In-the-wild generalization — We test the pipeline on custom scenes captured and exported as .ply. Polycam was initially considered but not used due to paywalls. However, any .ply point cloud with XYZ and RGB channels can be used to test the model.

Architecture

┌──────────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Point Cloud │────▶│  Concerto Small │────▶│  Per-point 3D    │
│  (XYZ + RGB) │     │  (frozen, 39M)  │     │  features (D=896)│
└──────────────┘     └─────────────────┘     └───────┬──────────┘
                                                     │
                                                     ▼
                                             ┌───────────────┐
                                             │ MLP Translation│
                                             │ Head (trainable│
                                             │ 3 layers)      │
                                             └───────┬───────┘
                                                     │
                                                     ▼
┌──────────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Text query  │────▶│  CLIP Text      │────▶│  Cosine sim /    │
│ "red chair"  │     │ Encoder (frozen)│     │  heatmap on PC   │
└──────────────┘     └─────────────────┘     └──────────────────┘

Repository Structure

Deep_learning_project/
├── README.md                      # This file
├── LICENSE
├── .gitignore
├── pyproject.toml                 # Project dependencies for Python >= 3.11
├── uv.lock                        # Lockfile generated by uv
├── .env                           # Environment variables
├── readme_dataset.txt             # Information on dataset format
│
├── configs/                       # Training & eval config files (YAML)
│   └── train_mlp_s3dis.yaml       # Configuration with input_dim=896
│
├── src/                           # Core Python source code
│   ├── encoder.py                 # Concerto feature extraction wrapper
│   ├── translation_head.py        # MLP definition & forward pass
│   ├── clip_utils.py              # CLIP text embedding helpers
│   ├── dataset.py                 # S3DIS + custom dataset loaders
│   ├── train.py                   # Training loop for the MLP head
│   ├── evaluate.py                # Quantitative evaluation
│   ├── evaluate_labels.py         # Label-specific evaluation
│   └── visualize.py               # 3D heatmap visualization utilities
│
├── notebooks/                     # Colab notebooks (run sequentially)
│   ├── pyproject.toml             # Project dependencies for Python 3.10 (Colab/spconv)
│   ├── 01_setup_and_data.ipynb
│   ├── 02_feature_extraction.ipynb
│   ├── 03_train_mlp.ipynb
│   ├── 04_evaluate.ipynb
│   ├── 04b_evaluate_labels.ipynb
│   ├── 05_demo.ipynb
│   └── 06_visualize_room.ipynb
│
├── scripts/                       # CLI utility scripts
│   ├── extract_features.py        # Batch feature extraction
│   ├── prepare_s3dis.py           # S3DIS preprocessing
│   ├── export_polycam.py          # .ply to pipeline format 
│   ├── demo.py                    # Script for running demo
│   └── visualize_concerto_pca.py  # PCA visualization of Concerto features
│
├── tests/                         # Unit & smoke tests
├── docs/                          # Documentation & papers
├── data/                          # Symlinked from Google Drive
├── features/                      # Extracted Concerto features (.npz)
└── presentation/                  # Final slides & demo materials

Dependencies

We use uv for package management because it is significantly faster than standard pip.

There are two pyproject.toml files in this repository:

  1. pyproject.toml (root): For standard environments (Python >= 3.11).
  2. notebooks/pyproject.toml: Specifically for Google Colab, forcing Python 3.10 to maintain compatibility with spconv (required by Concerto).

Key dependencies include:

  • torch (≥ 2.1)
  • pointcept (Concerto encoder & data utilities)
  • open_clip_torch (CLIP text encoder)
  • spconv-cu120 (Sparse convolution backend for Colab)
  • open3d, numpy, scipy, plotly

Setup & Installation

This project is designed to be executed via Jupyter Notebooks on Google Colab, effectively using Colab as a virtual machine. Almost all code executions are made using uv run inside the notebooks.

Drive Folder Structure

For the notebooks to run seamlessly, you should have the data structured in your Google Drive as follows:

Drive/
└── DL_Project/
    ├── data/
    │   ├── s3dis_raw/             # Raw S3DIS files
    │   └── s3dis_processed/       # Preprocessed S3DIS files
    └── checkpoints/               # Trained models

Execution Steps

  1. Clone the repo in Colab and navigate to the project directory.
  2. Mount Google Drive to access data and checkpoints.
  3. Run the notebooks sequentially (notebooks/01 to notebooks/06).

The notebooks will automatically install dependencies using uv (based on notebooks/pyproject.toml), set up the necessary symlinks (e.g., ./data -> /content/drive/MyDrive/DL_Project/data), and execute the pipeline steps via !uv run.


Data Preparation

S3DIS Area 5

  1. Download S3DIS from the Stanford website (requires form).
  2. Place the raw data in your Google Drive at Drive > DL_Project > data > s3dis_raw/.
  3. In notebooks/01_setup_and_data.ipynb, the preprocessing script is executed:
    !uv run scripts/prepare_s3dis.py --input data/s3dis_raw --output data/s3dis_processed

Custom In-the-Wild Scan

  1. Obtain any .ply point cloud with XYZ and RGB values (e.g., via 3D scanning apps or datasets).
  2. Place it in the appropriate folder (or update paths in 05_demo.ipynb).
  3. (Optional) Process using scripts/export_polycam.py if specific formatting is required.

Training the Translation Head

Executed primarily via notebooks/03_train_mlp.ipynb:

!uv run src/train.py --config configs/train_mlp_s3dis.yaml

Key hyperparameters based on our final runs:

  • Input Dimensionality: 896 (from Concerto)
  • MLP layers: 3 layers (896 → 512 → 512 → 512, with GELU + dropout 0.1)
  • Loss: MSE between predicted embeddings and CLIP text embeddings of ground-truth labels.
  • Optimizer: AdamW, lr=1e-3, weight decay=1e-4
  • Epochs: 40
  • Batch size: 16384 (Adjusted to fit T4 VRAM ~15GB)

Evaluation

Executed primarily via notebooks/04_evaluate.ipynb:

!uv run src/evaluate.py --config configs/eval_s3dis.yaml --split area5

Metrics:

  • mIoU (semantic segmentation via nearest-label matching)
  • Top-k retrieval accuracy (given a text query, what % of top-k points belong to the correct class)
  • Qualitative heatmaps (per-query 3D visualizations)

In-the-Wild Demo

The notebooks/05_demo.ipynb notebook provides an interactive demo:

  1. Load a custom .ply scan
  2. Extract Concerto features (frozen)
  3. Apply the trained MLP translation head
  4. Enter a free-text query → visualize the heatmap on the 3D scene using interactive Plotly figures.

Team & Acknowledgements

Member Role
Ricardo Lead engineer — feature extraction, evaluation
Leonardo Encoder integration, MLP architecture, training pipeline
Adrian Data preparation, demo notebook
Matteo Evaluation scripts, visualization, presentation & slides

Course: Deep Learning — Master's program
Compute: Google Colab free tier (NVIDIA T4 GPU)


References

  1. Concerto: Concerto: Cooperative Contrastive Pretraining for 3D Point Cloud UnderstandingGitHub | HuggingFace
  2. CLIP: Radford et al., Learning Transferable Visual Models From Natural Language Supervision, 2021
  3. S3DIS: Armeni et al., 3D Semantic Parsing of Large-Scale Indoor Spaces, CVPR 2016
  4. Pointcept: github.com/Pointcept/Pointcept

About

Open-vocabulary 3D object search using frozen Concerto/Utonia encoders and a CLIP-aligned MLP translation head. Validated on S3DIS and in-the-wild .ply scans.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors