Skip to content

[RFD]: Artifact Library Service Architecture and MinIO Integration for OpenCHAMI #129

@bmcdonald3

Description

@bmcdonald3

Decision Goal

Decide whether OpenCHAMI should pursue a dedicated metadata broker pattern for managing HPC boot and update artifacts, and choose the initial storage backend implementation. The specific decisions requested are: 1. Should OpenCHAMI adopt an S3-compatible object storage model for cluster artifacts? 2. Should we accept the existing Fabrica-based Artifact Library Service (ALS) codebase as an MVP to prototype this architecture? 3. Should we standardize on MinIO as the reference storage backend for the MVP?

Category

Architecture

Stakeholders / Affected Areas

No response

Decision Needed By

No response

Problem Statement

OpenCHAMI currently lacks a scalable, standardized system to securely store and distribute artifacts (OS images, container images, boot files, firmware binaries) across the cluster.
During the initial development of the Firmware Management Service, a standard HTTP server pointing to a local filesystem was used to serve payload files directly to BMCs. While this was sufficient for early prototyping and functional testing, it does not translate well. Several OpenCHAMI use cases would benefit from a centralized solution for object storage and distributed access.

Proposed Solution

Prototype a centralized artifact distribution architecture divided into two distinct layers: a physical S3-compatible storage backend, and a lightweight metadata brokering service (the Artifact Library Service).
The MVP would focus on using MinIO as the backend and the Fabrica-based ALS codebase to track metadata and delegate access via a valet key pattern (Presigned URLs).

  1. The Storage Backend (MinIO)

MinIO is proposed as the reference S3-compatible storage backend for all immutable files. It treats everything as Binary Large Objects (BLOBs), easily accommodating HPC asset sizes:

  • OS / Container Images: 50MB - 5GB+
  • Boot Files (vmlinuz/initramfs): 20MB - 150MB
  • Firmware Binaries: 1MB - 100MB
  • Config Artifacts: < 1MB
  1. The Access Broker (Artifact Library Service)

Rather than services managing files themselves, the ALS acts as a metadata catalog.

  • State Tracking: Maps logical artifact names to physical S3 buckets and object keys.
  • Validation: Periodically queries the storage backend to verify physical file existence, byte sizes, and SHA-256 hashes against expected values.
  • Access Delegation: Generates cryptographic Presigned URLs (e.g., valid for 60 minutes) granting downstream clients direct, read-only HTTP access to specific objects.

Example Integration (Firmware Management Service)
In this model, OpenCHAMI services coordinate metadata, but avoid touching the heavy binaries:

  1. FMS queries the ALS for the BIOS v1.2 record.
  2. ALS returns the payload metadata and a securely signed 60-minute Presigned URL.
  3. FMS connects to the target BMC via Redfish and submits the update job, injecting the Presigned URL into the ImageURI payload.
  4. The physical BMC bypasses FMS entirely, connecting directly to MinIO using the URL to download the binary.

Alternatives Considered

Alternative Storage Backends:

  • Ceph: Ceph is a heavily adopted standard in enterprise HPC environments and provides excellent S3 compatibility. However, deploying and managing Ceph is difficult. I think it's too heavy for an OpenCHAMI MVP, though the proposed architecture's use of standard S3 APIs means sites could swap MinIO for Ceph in production.
  • Public Cloud S3 (AWS/GCP): Offloading storage entirely to the cloud simplifies local management, but not ideal for air-gapped systems.
  • MinIO (Proposed for MVP): Selected because it is open-source, lightweight to deploy as a containerized service, S3-compliant, and capable of scaling horizontally. It provides an easier developer experience while remaining robust enough for many production HPC sites.

Alternative Architectures:

  • Direct S3 Integration in FMS: We could compile S3 SDK dependencies and URL generation directly into FMS or boot orchestration services. If multiple services track their own assets, managing storage credentials, hash validation, and inventory audits becomes complex and distributed.
  • Continuing with Local HTTP Servers: Rejected due to an inability to scale. A local filesystem HTTP server attached to an FMS pod will immediately bottleneck during concurrent operations.

Other Considerations

  • Garbage Collection: The current ALS prototype tracks state but does not define who is responsible for deleting old or orphaned objects from MinIO. We need to decide if ALS should eventually manage object retention policies, or if that is left to external operators.

Related Docs / PRs

Prototype with MinIO - https://github.com/bmcdonald3/library-service
Firmware service that would utilize this, instead of the built in HTTP server - https://github.com/bmcdonald3/firmware-updater

Metadata

Metadata

Assignees

No one assigned

    Labels

    rfdRequest for Discussion

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions