Skip to content

Latest commit

 

History

History
709 lines (519 loc) · 29.7 KB

File metadata and controls

709 lines (519 loc) · 29.7 KB

🧭 Comprehensive Tutorial: Understanding and Visualizing the Waymo Open Dataset v2

This tutorial provides an in-depth, practical, and mathematical explanation of how to interpret, transform, and visualize Waymo v2.1 LiDAR and 3D box data.
It's written for researchers and developers who want to deeply understand how Waymo structures its data, what coordinate frames are used, and how to correctly align LiDAR, camera, and annotations.

Code Reference: This tutorial is complemented by the comprehensive data inspection utilities in waymo_parquet_inspector.py, which provides detailed field analysis and validation tools for all Waymo data components.


Table of Contents

  1. Dataset Overview
  2. File Structure & Contents
  3. Coordinate Frames
  4. Camera Data Components
  5. LiDAR Data Components
  6. Calibration Data
  7. LiDAR-to-Vehicle Transform Mathematics
  8. 3D Box Definitions
  9. Coordinate Alignment: LiDAR ↔ Box
  10. 2D and 3D Box Visualization
  11. Common Pitfalls & Debug Tips
  12. References

1️⃣ Dataset Overview

📊 Dataset Scale & Format

Waymo Open Dataset (WOD) provides synchronized LiDAR and camera data with ground-truth 3D bounding boxes.

  • 2,030 segments of 20s each, collected at 10Hz (390,000 frames)
  • Diverse geographies and conditions
  • Perception object assets data in a modular format (v2.0.0)
  • Extracted perception objects from multi-sensor data: all 5 cameras and the top lidar

🔧 Sensor Configuration

LiDAR Sensors:

  • 1 mid-range lidar
  • 4 short-range lidars

Camera Sensors:

  • 5 cameras (front and sides)

Data Synchronization:

  • Synchronized lidar and camera data
  • Lidar to camera projections
  • Sensor calibrations and vehicle poses

🏷️ Annotation Categories

3D Bounding Box Labels

  • 4 object classes: Vehicles, Pedestrians, Cyclists, Signs
  • High-quality labels for lidar data: 1,200 segments
  • 12.6M 3D bounding box labels with tracking IDs on lidar data

2D Bounding Box Labels

  • High-quality labels for camera data: 1,000 segments
  • 11.8M 2D bounding box labels with tracking IDs on camera data

2D Video Panoptic Segmentation

  • Subset: 100k camera images
  • 28 classes including:
    • Vehicles: Car, Bus, Truck, Other Large Vehicle, Trailer, Ego Vehicle, Motorcycle, Bicycle
    • People: Pedestrian, Cyclist, Motorcyclist
    • Animals: Ground Animal, Bird
    • Infrastructure: Pole, Sign, Traffic Light, Construction Cone, Pedestrian Object, Building
    • Road Elements: Road, Sidewalk, Road Marker, Lane Marker
    • Environment: Vegetation, Sky, Ground
    • Motion States: Static, Dynamic
  • Instance segmentation labels for Vehicle, Pedestrian and Cyclist classes
  • Consistent both across cameras and over time

Key Point Labels

  • 2 object classes: Pedestrians and Cyclists
  • 14 key points from nose to ankle
  • 200k object frames with 2D key point labels
  • 10k object frames with 3D key point labels

3D Semantic Segmentation

  • Segmentation labels: 1,150 segments
  • 23 classes including:
    • Vehicles: Car, Truck, Bus, Other Vehicle
    • People: Motorcyclist, Bicyclist, Pedestrian
    • Objects: Bicycle, Motorcycle, Sign, Traffic Light, Pole, Construction Cone
    • Environment: Building, Vegetation, Tree Trunk, Curb
    • Road Elements: Road, Lane Marker, Walkable, Sidewalk, Other Ground
    • Undefined: Undefined

🗺️ Map Data

  • 3D road graph data for each segment
  • Includes: lane centers, lane boundaries, road boundaries, crosswalks, speed bumps, stop signs, and entrances to driveways

🔗 Cross-Modal Associations

  • Association of 2D and 3D bounding boxes
  • Corresponding object IDs provided for 2 object classes: Pedestrians and Cyclists

🎯 Challenge Data

  • 3D Camera-Only Detection Challenge: 80 segments of 20s camera imagery

🚀 Advanced Features

LiDAR features include:

  • 3D point cloud sequences that support 3D object shape reconstruction

Camera features include:

  • Sequences of camera patches from the most_visible_camera
  • Projected lidar returns on the corresponding camera
  • Per-pixel camera rays information
  • Auto-labeled 2D panoptic segmentation that supports object NeRF reconstruction

The v2.0.1 Parquet version is designed for efficient columnar access and includes:

Component Folder Description Schema Columns
Camera Images camera_image/ RGB frames with pose/velocity metadata 15 columns with binary JPEG/PNG data
Camera Boxes camera_box/ 2D bounding boxes in pixel coordinates 12 columns with detection annotations
Camera Calibration camera_calibration/ Intrinsics + extrinsics for all cameras 15 columns with calibration matrices
LiDAR Range Images lidar/ Raw per-LiDAR sensor range images 7 columns with flattened range data
3D Boxes lidar_box/ Ground-truth boxes in Vehicle Frame 21 columns with 3D object state
LiDAR Calibration lidar_calibration/ Beam inclinations and extrinsic transforms 6 columns with sensor parameters
LiDAR Poses lidar_pose/ Per-pixel transforms LiDAR → Vehicle → World Pose transformation matrices
Segmentation lidar_segmentation/ Point-wise semantic labels Semantic segmentation masks
Projections lidar_camera_projection/ LiDAR-to-camera pixel mappings Cross-modal alignment data

2️⃣ File Structure & Contents

Each segment is approximately 20 seconds long, split into multiple Parquet files with standardized naming:

segment_id_start_end.parquet
Example: 10017090168044687777_6380_000_6400_000.parquet

Inside each data folder (e.g., training/lidar/), files contain rows corresponding to sensor measurements at specific timestamps.

Parquet Schema Structure

All Waymo v2.01 data follows a consistent schema pattern:

Key Fields (Common across all data types):
├── index: Unique row identifier (String)
├── key.segment_context_name: Segment ID (String)  
├── key.frame_timestamp_micros: Timestamp in microseconds (Int64)
└── key.[sensor]_name: Sensor identifier (Int8)

Component Fields:
└── [ComponentType].[field_hierarchy]: Actual data values
    ├── Scalar values: Direct numeric/string data
    ├── List values: Arrays (e.g., transformation matrices)
    └── Nested structures: Complex hierarchical data

Note: The [ComponentType] follows the pattern [SensorType]Component (e.g., CameraImageComponent, LiDARBoxComponent). See the inspector code for detailed field analysis of each component type.

LiDAR IDs (v2.1 five-sensor setup):

ID Location Yaw (deg) Position (m)
1 Roof edge / back-right +148° [1.43, 0.0, 2.18]
2 Front bumper [4.07, 0.0, 0.69]
3 Left side +90° [3.25, +1.02, 0.98]
4 Right side −90° [3.25, −1.02, 0.98]
5 Rear 180° [−1.15, 0.0, 0.46]

⚠️ v2.1 no longer includes the 360° Top LiDAR used in early WOD versions.


3️⃣ Coordinate Frames

Understanding coordinate systems is the foundation for correct visualization and data alignment.

Vehicle Frame (Primary Reference Frame)

  • Origin: Vehicle center (geometric center of the ego vehicle)
  • Axes:
    • +X: Forward direction (vehicle's front)
    • +Y: Left direction (driver's left)
    • +Z: Upward direction (towards sky)
  • Usage: All 3D bounding boxes and calibration extrinsics are defined in this frame
  • Mathematical Representation: Right-handed coordinate system

$$\mathbf{p}_{\text{vehicle}} = \begin{bmatrix} x \ y \ z \end{bmatrix} \text{ where } \begin{cases} x > 0 & \text{forward} \ y > 0 & \text{left} \ z > 0 & \text{up} \end{cases}$$

LiDAR Frame (Sensor-Specific)

  • Origin: Individual LiDAR sensor center
  • Axes: Same orientation as vehicle frame but translated/rotated
    • +X: Sensor's forward direction
    • +Y: Sensor's left direction
    • +Z: Sensor's upward direction
  • Usage: Raw range image data is initially in this frame
  • Transform: Each LiDAR has a unique extrinsic transformation to vehicle frame

Camera Frame (OpenCV Convention)

  • Origin: Camera optical center
  • Axes:
    • +X: Right direction (image columns)
    • +Y: Down direction (image rows)
    • +Z: Forward direction (optical axis, into the scene)
  • Usage: Camera images and 2D bounding boxes
  • Projection: 3D points project to 2D image coordinates via intrinsic matrix

$$\begin{bmatrix} u \ v \ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} X_c/Z_c \ Y_c/Z_c \ 1 \end{bmatrix}$$

where $\mathbf{K}$ is the camera intrinsic matrix and $(X_c, Y_c, Z_c)$ are coordinates in camera frame.


4️⃣ Camera Data Components

Camera Images (camera_image/)

Schema: 15 columns containing RGB image data and comprehensive metadata

Field Type Description
index String Unique row identifier
key.segment_context_name String Segment/sequence identifier
key.frame_timestamp_micros Int64 Frame timestamp in microseconds
key.camera_name Int8 Camera ID (0-4 for FRONT, FRONT_LEFT, FRONT_RIGHT, SIDE_LEFT, SIDE_RIGHT)
[CameraImageComponent].image Binary JPEG/PNG compressed image bytes
[CameraImageComponent].pose.transform List[Double] 4×4 transformation matrix (16 elements)
[CameraImageComponent].velocity.linear_velocity.{x,y,z} Double Linear velocity components (m/s)
[CameraImageComponent].velocity.angular_velocity.{x,y,z} Double Angular velocity components (rad/s)
[CameraImageComponent].pose_timestamp Double Pose measurement timestamp
[CameraImageComponent].rolling_shutter_params.shutter Double Rolling shutter timing parameter

Usage Example:

# Extract image from binary data
image_bytes = row['[CameraImageComponent].image']
pil_image = Image.open(io.BytesIO(image_bytes))

# Extract pose matrix (4x4 transformation)
pose_flat = row['[CameraImageComponent].pose.transform']  # 16 elements
pose_matrix = np.array(pose_flat).reshape(4, 4)

Camera Boxes (camera_box/)

Schema: 12 columns containing 2D bounding box annotations

Field Type Description
key.camera_object_id String Unique object identifier per camera
[CameraBoxComponent].box.center.{x,y} Double Bounding box center coordinates (pixels)
[CameraBoxComponent].box.size.{x,y} Double Bounding box dimensions (width, height in pixels)
[CameraBoxComponent].type Int8 Object class type ID
[CameraBoxComponent].difficulty_level.detection Int8 Detection difficulty rating (1-5)
[CameraBoxComponent].difficulty_level.tracking Int8 Tracking difficulty rating (1-5)

Camera Calibration (camera_calibration/)

Schema: 15 columns containing intrinsic and extrinsic calibration parameters

Field Type Description
[CameraCalibrationComponent].intrinsic.f_u Double Focal length in u direction (pixels)
[CameraCalibrationComponent].intrinsic.f_v Double Focal length in v direction (pixels)
[CameraCalibrationComponent].intrinsic.c_u Double Principal point u coordinate (pixels)
[CameraCalibrationComponent].intrinsic.c_v Double Principal point v coordinate (pixels)
[CameraCalibrationComponent].intrinsic.k1,k2,k3 Double Radial distortion coefficients
[CameraCalibrationComponent].intrinsic.p1,p2 Double Tangential distortion coefficients
[CameraCalibrationComponent].extrinsic.transform List[Double] 4×4 camera-to-vehicle transformation
[CameraCalibrationComponent].width Int32 Image width in pixels
[CameraCalibrationComponent].height Int32 Image height in pixels

Intrinsic Matrix Construction: $$\mathbf{K} = \begin{bmatrix} f_u & 0 & c_u \ 0 & f_v & c_v \ 0 & 0 & 1 \end{bmatrix}$$


5️⃣ LiDAR Data Components

LiDAR Range Images (lidar/)

Schema: 11 columns containing range image data and sensor metadata

Each LiDAR captures a range image instead of a raw point cloud. This is a 2D representation where each pixel encodes distance, intensity, and other measurements.

Field Type Description
index String Unique row identifier
key.segment_context_name String Segment/sequence identifier
key.frame_timestamp_micros Int64 Frame timestamp in microseconds
key.laser_name Int8 LiDAR sensor ID (0-4 for TOP, FRONT, SIDE_LEFT, SIDE_RIGHT, REAR)
[LiDARComponent].range_image_return1.range Binary First return range data (compressed)
[LiDARComponent].range_image_return1.intensity Binary First return intensity data (compressed)
[LiDARComponent].range_image_return1.elongation Binary First return elongation data (compressed)
[LiDARComponent].range_image_return2.range Binary Second return range data (compressed)
[LiDARComponent].range_image_return2.intensity Binary Second return intensity data (compressed)
[LiDARComponent].range_image_return2.elongation Binary Second return elongation data (compressed)
[LiDARComponent].camera_projection_exclusion_mask Binary Exclusion mask for camera projections

Range Image Structure:

  • Dimensions: Typically H×W where H varies by sensor (64-200 rows), W is azimuth resolution
  • Encoding: Each pixel encodes distance measurement in meters
  • Returns: Two returns per laser beam (first and second reflection)
  • Compression: Data is compressed using Waymo's proprietary format

6️⃣ LiDAR-to-Vehicle Transform Mathematics

Converting LiDAR range images to 3D point clouds in the vehicle coordinate frame requires several mathematical transformations.

Step 1: Range Image to Spherical Coordinates

Each pixel $(u, v)$ in the range image corresponds to spherical coordinates:

$$\begin{align} \text{azimuth} &= \frac{2\pi \cdot u}{W} - \pi \\ \text{inclination} &= \text{beam_inclinations}[v] \\ \text{range} &= \text{range_image}[v, u] \end{align}$$

where $W$ is the azimuth resolution (typically 2650 for Waymo LiDAR).

Step 2: Spherical to Cartesian Conversion (LiDAR Frame)

Convert spherical coordinates to 3D Cartesian coordinates in the LiDAR sensor frame:

$$\begin{align} x_{\text{lidar}} &= \text{range} \cdot \cos(\text{inclination}) \cdot \cos(\text{azimuth}) \\ y_{\text{lidar}} &= \text{range} \cdot \cos(\text{inclination}) \cdot \sin(\text{azimuth}) \\ z_{\text{lidar}} &= \text{range} \cdot \sin(\text{inclination}) \end{align}$$

Step 3: LiDAR-to-Vehicle Transformation

Apply the extrinsic calibration matrix to transform from LiDAR frame to vehicle frame:

$$\begin{bmatrix} x_{\text{vehicle}} \ y_{\text{vehicle}} \ z_{\text{vehicle}} \ 1 \end{bmatrix} = \mathbf{T}_{\text{lidar→vehicle}} \begin{bmatrix} x_{\text{lidar}} \ y_{\text{lidar}} \ z_{\text{lidar}} \ 1 \end{bmatrix}$$

where $\mathbf{T}_{\text{lidar→vehicle}}$ is the 4×4 transformation matrix from the calibration data:

$$\mathbf{T}_{\text{lidar→vehicle}} = \begin{bmatrix} \mathbf{R} & \mathbf{t} \ \mathbf{0}^T & 1 \end{bmatrix}$$

with $\mathbf{R} \in \mathbb{R}^{3 \times 3}$ being the rotation matrix and $\mathbf{t} \in \mathbb{R}^{3}$ the translation vector.

Complete Transformation Pipeline

The complete transformation from range image pixel to vehicle coordinates:

$$\mathbf{p}_{\text{vehicle}} = \mathbf{T}_{\text{lidar→vehicle}} \cdot \mathbf{f}_{\text{spherical→cartesian}}(\text{range}[v,u], \text{azimuth}(u), \text{inclination}(v))$$

Implementation Note: The waymo_parquet_inspector.py script provides detailed field analysis for understanding the exact data formats and transformations.

LiDAR Boxes (lidar_box/)

Schema: 18 columns containing 3D bounding box annotations

Field Type Description
key.laser_object_id String Unique object identifier per LiDAR
[LiDARBoxComponent].box.center.{x,y,z} Double 3D bounding box center in vehicle frame (meters)
[LiDARBoxComponent].box.size.{x,y,z} Double 3D bounding box dimensions (length, width, height in meters)
[LiDARBoxComponent].box.heading Double Object orientation angle (radians)
[LiDARBoxComponent].type Int8 Object class type ID
[LiDARBoxComponent].id String Persistent object tracking ID
[LiDARBoxComponent].detection_difficulty_level Int8 Detection difficulty rating (1-5)
[LiDARBoxComponent].tracking_difficulty_level Int8 Tracking difficulty rating (1-5)
[LiDARBoxComponent].num_lidar_points_in_box Int32 Number of LiDAR points inside the box

3D Box Representation: $$\mathbf{Box} = {\mathbf{c}, \mathbf{s}, \theta} \text{ where } \begin{cases} \mathbf{c} = [c_x, c_y, c_z]^T & \text{center position} \ \mathbf{s} = [s_x, s_y, s_z]^T & \text{size (L×W×H)} \ \theta & \text{heading angle} \end{cases}$$

LiDAR Calibration (lidar_calibration/)

Schema: 8 columns containing sensor calibration parameters

Field Type Description
[LiDARCalibrationComponent].extrinsic.transform List[Double] 4×4 LiDAR-to-vehicle transformation matrix
[LiDARCalibrationComponent].beam_inclinations List[Double] Vertical beam angle inclinations (radians)
[LiDARCalibrationComponent].beam_inclination_min Double Minimum beam inclination angle
[LiDARCalibrationComponent].beam_inclination_max Double Maximum beam inclination angle

7️⃣ Additional Data Components

Projected LiDAR (projected_lidar_labels/)

Schema: 14 columns containing LiDAR points projected onto camera images

Field Type Description
[ProjectedLiDARLabelsComponent].box.center.{x,y} Double Projected 2D box center (pixels)
[ProjectedLiDARLabelsComponent].box.size.{x,y} Double Projected 2D box size (pixels)
[ProjectedLiDARLabelsComponent].type Int8 Object class type ID
[ProjectedLiDARLabelsComponent].id String Object tracking ID
[ProjectedLiDARLabelsComponent].detection_difficulty_level Int8 Detection difficulty (1-5)

Segmentation Labels (lidar_segmentation/)

Schema: 7 columns containing point-wise semantic segmentation

Field Type Description
[LiDARSegmentationLabelComponent].pointcloud_to_image_projection Binary Point-to-pixel mapping data
[LiDARSegmentationLabelComponent].segmentation_label Binary Per-point semantic labels
[LiDARSegmentationLabelComponent].instance_id_to_global_id_mapping Binary Instance ID mappings

Statistics (stats/)

Schema: 9 columns containing frame-level statistics and metadata

Field Type Description
[StatsComponent].location String Geographic location identifier
[StatsComponent].time_of_day String Time period (Dawn, Day, Dusk, Night)
[StatsComponent].weather String Weather conditions
[StatsComponent].camera_object_counts List[Int32] Object counts per camera
[StatsComponent].lidar_object_counts List[Int32] Object counts per LiDAR

Usage for Data Analysis:

# Filter by weather conditions
sunny_frames = df[df['[StatsComponent].weather'] == 'sunny']

# Analyze object distribution
total_objects = df['[StatsComponent].lidar_object_counts'].apply(sum)
---

## 8️⃣ Range Image Processing

Each pixel encodes `(range, intensity, elongation, ...)` data in compressed binary format.

**Typical Range Image Dimensions**:

| Sensor | Shape (H, W, C) | Field of View |
|---------|-----------------|---------------|
| TOP LiDAR (#0) | 64 × 2650 × 4 | 360° horizontal |
| FRONT LiDAR (#1) | 200 × 600 × 4 | ~100° horizontal |
| SIDE_LEFT (#2) | 200 × 600 × 4 | ~100° horizontal |
| SIDE_RIGHT (#3) | 200 × 600 × 4 | ~100° horizontal |
| REAR (#4) | 200 × 600 × 4 | ~100° horizontal |

### Range Image to Point Cloud Conversion

**Step 1**: Decode the compressed range/intensity data from binary format
**Step 2**: Apply coordinate transformations to get 3D points in vehicle frame
**Step 3**: Filter invalid points (range = 0)

---

## 9️⃣ LiDAR Calibration Mathematics

### Extrinsic Transformation Matrix

The extrinsic matrix $\mathbf{T}_{V \leftarrow L}$ transforms points from LiDAR frame to vehicle frame:

$$\mathbf{T}_{V \leftarrow L} = \begin{bmatrix} \mathbf{R} & \mathbf{t} \\ \mathbf{0}^T & 1 \end{bmatrix}$$

**Storage Format**: Row-major order (`order="C"`) in Parquet, stored as 16-element list

**Example Transformation Matrix**:

[[-0.8478, -0.5304, -0.0025, 1.43 ], [ 0.5304, -0.8478, 0.0002, 0.00 ], [-0.0022, -0.0012, 1.0000, 2.184], [ 0.0000, 0.0000, 0.0000, 1.0000]]

→ Rotation yaw ≈ 148°, translation ≈ (1.43, 0.0, 2.18) meters

### Beam Inclination Angles

Vertical angles for each row in the range image, typically distributed linearly between minimum and maximum inclination values.

$$\theta_v = \text{beam\_inclinations}[v] \text{ for row } v \in [0, H-1]$$

---

## 🔟 Complete LiDAR Processing Pipeline

### Mathematical Transformation Steps

For each LiDAR pixel at position $(u, v)$ with range $r$:

**Step 1: Spherical Coordinates**
$$\begin{align}
\phi &= \frac{2\pi \cdot u}{W} - \pi \quad \text{(azimuth angle)} \\
\theta &= \text{beam\_inclinations}[v] \quad \text{(inclination angle)} \\
r &= \text{range\_image}[v, u] \quad \text{(distance in meters)}
\end{align}$$

**Step 2: LiDAR Frame Cartesian Coordinates**
$$\begin{align}
x_L &= r \cos(\theta) \cos(\phi) \\
y_L &= r \cos(\theta) \sin(\phi) \\
z_L &= r \sin(\theta)
\end{align}$$

**Step 3: Homogeneous Coordinates**
$$\mathbf{p}_L = \begin{bmatrix} x_L \\ y_L \\ z_L \\ 1 \end{bmatrix}$$

**Step 4: Transform to Vehicle Frame**
$$\mathbf{p}_V = \mathbf{T}_{V \leftarrow L} \cdot \mathbf{p}_L$$

### Implementation in NumPy

```python
# Create homogeneous coordinate matrix
pts_h = np.stack([x_L, y_L, z_L, np.ones_like(z_L)], axis=-1).reshape(-1, 4)

# Transform to vehicle frame (do NOT invert the matrix)
xyz_vehicle = (pts_h @ extrinsic_matrix.T)[:, :3]

Important: The dataset stores LiDAR→Vehicle transforms directly. Do not invert the matrix.


1️⃣1️⃣ 3D Bounding Box Specifications

Box Parameters in Vehicle Frame

Each 3D bounding box in lidar_box/ is defined by:

Parameter Field Description
Center [LiDARBoxComponent].box.center.{x,y,z} Box center position (meters)
Size [LiDARBoxComponent].box.size.{x,y,z} Length (X), Width (Y), Height (Z)
Heading [LiDARBoxComponent].box.heading Yaw angle (radians, CCW from +X axis)
Type [LiDARBoxComponent].type Object class (vehicle, pedestrian, cyclist)

Important Note: Box center Z-coordinate represents the object's geometric center, not the bottom.

3D Box Mathematical Representation

$$\mathbf{Box} = {\mathbf{c}, \mathbf{s}, \psi}$$

where:

  • $\mathbf{c} = [c_x, c_y, c_z]^T$ is the center position
  • $\mathbf{s} = [s_x, s_y, s_z]^T$ is the size vector (length × width × height)
  • $\psi$ is the heading angle (yaw rotation about Z-axis)

1️⃣2️⃣ Multi-Sensor Data Fusion

Coordinate Alignment Process

Step 1: Decode range images from all 5 LiDAR sensors Step 2: Transform each sensor's points to vehicle frame using respective extrinsics Step 3: Merge all point clouds into unified coordinate system

# Process each LiDAR sensor
all_points = []
for sensor_id in range(5):  # 0=TOP, 1=FRONT, 2=SIDE_LEFT, 3=SIDE_RIGHT, 4=REAR
    # Extract sensor-specific data
    range_data = decode_range_image(sensor_data[sensor_id])
    extrinsic = get_extrinsic_matrix(sensor_id)
    
    # Transform to vehicle frame
    points_vehicle = transform_to_vehicle_frame(range_data, extrinsic)
    all_points.append(points_vehicle)

# Merge all sensors
merged_pointcloud = np.concatenate(all_points, axis=0)

Result: Both point cloud and 3D boxes are now in the same vehicle coordinate frame and align perfectly.


1️⃣3️⃣ Visualization and Projection

3D Visualization with Open3D

import open3d as o3d
import numpy as np

# Create point cloud visualization
pcd = o3d.geometry.PointCloud(o3d.utility.Vector3dVector(xyz_vehicle))
pcd.paint_uniform_color([0.6, 0.6, 0.6])
geometries = [pcd]

# Add 3D bounding boxes
for box_data in boxes_3d:
    x, y, z = box_data['center']
    dx, dy, dz = box_data['size'] 
    yaw = box_data['heading']
    
    # Create rotation matrix for yaw
    c, s = np.cos(yaw), np.sin(yaw)
    R = np.array([[c, -s, 0], [s, c, 0], [0, 0, 1]], dtype=np.float32)
    
    # Create oriented bounding box
    obb = o3d.geometry.OrientedBoundingBox(
        center=[x, y, z], 
        R=R, 
        extent=[dx, dy, dz]
    )
    obb.color = (1, 0, 0)  # Red color
    geometries.append(obb)

# Add coordinate frame
axis = o3d.geometry.TriangleMesh.create_coordinate_frame(size=5.0)
geometries.append(axis)

# Visualize
o3d.visualization.draw_geometries(geometries)

2D Projection onto Camera Images

Mathematical Projection Pipeline:

  1. Transform 3D points to camera frame: $$\mathbf{p}C = \mathbf{T}{C \leftarrow V} \cdot \mathbf{p}_V$$

  2. Project to image plane: $$\begin{bmatrix} u \ v \ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} X_C/Z_C \ Y_C/Z_C \ 1 \end{bmatrix}$$

  3. Apply distortion correction (if needed): $$\begin{align} r^2 &= u_n^2 + v_n^2 \ u_d &= u_n(1 + k_1r^2 + k_2r^4 + k_3r^6) + 2p_1u_nv_n + p_2(r^2 + 2u_n^2) \ v_d &= v_n(1 + k_1r^2 + k_2r^4 + k_3r^6) + p_1(r^2 + 2v_n^2) + 2p_2u_nv_n \end{align}$$

def project_3d_to_2d(points_3d, camera_intrinsic, camera_extrinsic):
    """Project 3D points to camera image coordinates"""
    # Transform to camera frame
    vehicle_to_camera = np.linalg.inv(camera_extrinsic)
    points_homogeneous = np.hstack([points_3d, np.ones((len(points_3d), 1))])
    points_camera = (vehicle_to_camera @ points_homogeneous.T).T[:, :3]
    
    # Project to image plane
    points_2d_homogeneous = (camera_intrinsic @ points_camera.T).T
    image_points = points_2d_homogeneous[:, :2] / points_2d_homogeneous[:, 2:3]
    depths = points_camera[:, 2]
    
    return image_points, depths

1️⃣4️⃣ Common Issues and Solutions

Troubleshooting Guide

Problem Cause Solution
Box appears "floating" above ground LiDAR mounted ~2m high, box Z is object center This is normal behavior
Box appears "in front of" points Using single LiDAR sensor only Merge all 5 LiDAR sensors
Point cloud mirrored/flipped Used np.linalg.inv(extrinsic) Use extrinsic.T for matrix multiplication
Translation values all zeros Used order='F' for reshape Use order='C' (row-major)
Beam angles incorrect Reused wrong beam inclinations Read sensor-specific beam ranges
Point cloud appears "warped" Mixed sensors with wrong extrinsics Verify yaw angle per LiDAR sensor

Best Practices

Recommended Settings:

  • Use order="C" for array reshaping
  • Apply extrinsic.T for transformations (do not invert)
  • Set flip_rows=True, flip_cols=False for range image processing
  • Use azimuth = np.linspace(np.pi, -np.pi, W) for azimuth calculation

Validation Checks:

  • Verify point cloud and boxes align in 3D visualization
  • Check that merged multi-LiDAR coverage is 360°
  • Ensure camera projections fall within image boundaries
  • Validate coordinate frame orientations match expected directions

1️⃣5️⃣ References and Resources

Official Documentation

Community Tools and Converters

  • OpenCOOD: Multi-modal 3D detection framework with Waymo support
  • OpenMMLab: MMDetection3D parser examples
  • Waymo2KITTI: Format conversion utilities (GitHub community)

Data Analysis Tools

  • Field Inspector: - Comprehensive schema analysis
  • Visualization Scripts: Open3D and Matplotlib integration examples

✨ Summary

Complete Processing Pipeline

Step Action Coordinate Frame
1 Decode range image LiDAR frame
2 Apply extrinsic transform Vehicle frame
3 Merge all sensors Vehicle frame
4 Visualize with boxes Vehicle frame
5 Project to cameras Camera/Image frame

Key Mathematical Transformations

$$\text{Range Image} \xrightarrow{\text{spherical→cartesian}} \text{LiDAR Frame} \xrightarrow{\mathbf{T}_{V \leftarrow L}} \text{Vehicle Frame} \xrightarrow{\mathbf{T}_{C \leftarrow V}} \text{Camera Frame}$$

When implemented correctly, the merged multi-LiDAR point cloud aligns perfectly with Waymo's 3D bounding boxes and camera images, enabling robust multi-modal perception and analysis.