Skip to content

RTX5090 Blackwell support #305

@HKNDerRollo

Description

@HKNDerRollo

Summary
The official context_chat_backend Docker images do not support NVIDIA Blackwell GPUs (RTX 5090, RTX 5080, etc.) because they are compiled against CUDA 12.2 without sm_120 architecture support.

Environment

  • GPU: NVIDIA GeForce RTX 5090 (compute capability 12.0 / sm_120)
  • Driver: 595.58.03
  • CUDA: 13.2
  • context_chat_backend: 5.3.0
  • context_chat: 5.3.1
  • Nextcloud: 33.x

Problem
The official image ghcr.io/nextcloud/context_chat_backend:5.3.0 is based on nvidia/cuda:12.2.2-runtime-ubuntu22.04 and downloads a prebuilt llama-cpp-python wheel compiled for CUDA 12.2 without Blackwell support:

CUDA : ARCHS = 500,520,530,600,610,620,700,720,750,800,860,870,890,900
# Missing: 1000 (sm_100), 1200 (sm_120)

As a result:

  • GPU utilization stays at 0%
  • Only 746 MiB of 32607 MiB VRAM is used
  • Power draw: 7W of 575W
  • All embedding computation falls back to CPU

The RTX 5090 requires CUDA 12.8+ and sm_120 for native Blackwell support.

What already works
The master branch (5.4.0-beta0) has a new multi-stage Dockerfile using CUDA 12.8 which is much better. However, the final stage still defaults to FROM runtime-cpu AS final instead of FROM runtime-cuda AS final.

With two small fixes to the master Dockerfile:

# 1. Add Blackwell architecture
ENV CMAKE_CUDA_ARCHITECTURES="89;90;100;120"

# 2. Use CUDA runtime as final stage
FROM runtime-cuda AS final  # was: runtime-cpu

...the RTX 5090 works correctly:

CUDA : ARCHS = 500,610,700,750,800,860,890,1200 | BLACKWELL_NATIVE_FP4 = 1
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0

Request

  1. Fix the FROM runtime-cpu AS finalFROM runtime-cuda AS final in the master Dockerfile (this seems like a bug)
  2. Add sm_120 to CMAKE_CUDA_ARCHITECTURES for Blackwell support
  3. Consider publishing a CUDA-specific image tag e.g. context_chat_backend:5.3.x-cuda for GPU users
  4. Publish a v5.3.1 git tag to match the context_chat frontend app version

Additional Notes

  • The RTX 5090 has 32 GB VRAM which is ideal for large embedding workloads
  • With proper Blackwell support, indexing performance should improve dramatically
  • BLACKWELL_NATIVE_FP4 = 1 confirms native FP4 Tensor Core support is available

Thank you for the great work on this project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions