RTX5090 Blackwell support

**Summary**
The official `context_chat_backend` Docker images do not support NVIDIA Blackwell GPUs (RTX 5090, RTX 5080, etc.) because they are compiled against CUDA 12.2 without sm_120 architecture support.

**Environment**
- GPU: NVIDIA GeForce RTX 5090 (compute capability 12.0 / sm_120)
- Driver: 595.58.03
- CUDA: 13.2
- context_chat_backend: 5.3.0
- context_chat: 5.3.1
- Nextcloud: 33.x

**Problem**
The official image `ghcr.io/nextcloud/context_chat_backend:5.3.0` is based on `nvidia/cuda:12.2.2-runtime-ubuntu22.04` and downloads a prebuilt llama-cpp-python wheel compiled for CUDA 12.2 without Blackwell support:

```
CUDA : ARCHS = 500,520,530,600,610,620,700,720,750,800,860,870,890,900
# Missing: 1000 (sm_100), 1200 (sm_120)
```

As a result:
- GPU utilization stays at 0%
- Only 746 MiB of 32607 MiB VRAM is used
- Power draw: 7W of 575W
- All embedding computation falls back to CPU

The RTX 5090 requires **CUDA 12.8+** and **sm_120** for native Blackwell support.

**What already works**
The `master` branch (5.4.0-beta0) has a new multi-stage Dockerfile using CUDA 12.8 which is much better. However, the final stage still defaults to `FROM runtime-cpu AS final` instead of `FROM runtime-cuda AS final`.

With two small fixes to the master Dockerfile:
```dockerfile
# 1. Add Blackwell architecture
ENV CMAKE_CUDA_ARCHITECTURES="89;90;100;120"

# 2. Use CUDA runtime as final stage
FROM runtime-cuda AS final  # was: runtime-cpu
```
...the RTX 5090 works correctly:
```
CUDA : ARCHS = 500,610,700,750,800,860,890,1200 | BLACKWELL_NATIVE_FP4 = 1
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0
```

**Request**
1. Fix the `FROM runtime-cpu AS final` → `FROM runtime-cuda AS final` in the master Dockerfile (this seems like a bug)
2. Add `sm_120` to `CMAKE_CUDA_ARCHITECTURES` for Blackwell support
3. Consider publishing a CUDA-specific image tag e.g. `context_chat_backend:5.3.x-cuda` for GPU users
4. Publish a `v5.3.1` git tag to match the `context_chat` frontend app version

**Additional Notes**
- The RTX 5090 has 32 GB VRAM which is ideal for large embedding workloads
- With proper Blackwell support, indexing performance should improve dramatically
- `BLACKWELL_NATIVE_FP4 = 1` confirms native FP4 Tensor Core support is available

Thank you for the great work on this project!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTX5090 Blackwell support #305

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RTX5090 Blackwell support #305

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions