GitHub - BAL-DMU/datorcloud: DatorCloud - Multimodal Data Management Platform

Lightweight, self-hosted framework for managing, querying, and sharing multimodal research data.

Overview · Installation · Architecture · Quickstart · 4dor Tutorial

DatorCloud is a Python framework developed at Balgrist University Hospital and the OR-X Translational Center for Surgery. It pairs DuckDB for fast SQL-style analytics with MinIO for S3-compatible object storage, exposing five single-responsibility components, a high-level orchestrator, a CLI, and a ready-to-run Dagster workspace.

Key features

Multimodal data management — organize images, video, sensor streams, and clinical records in a structured, browsable framework.
Unified dataset catalog — explore datasets by project, researcher, or experimental context.
Composable, testable architecture — five components, dependency-injected, with a 30-test suite that runs entirely against in-memory fakes.
Three matched surfaces — Python orchestrator, datorcloud CLI, and Dagster assets all hit the same pipeline.
.env-driven configuration — no credentials are hard-coded; every component validates that real values were supplied.

Installation

From source (recommended during development):

pip install -e ".[dagster,test]"

This installs the datorcloud Python package and the datorcloud CLI in editable mode.

Configuration

Copy the template and fill in your MinIO credentials and storage paths:

cp .env.example .env

Required variables (the components raise a clear ValueError when these are missing):

Variable	Purpose
`S3_ENDPOINT`	MinIO host:port (no scheme).
`S3_ACCESS_KEY`	MinIO access key.
`S3_SECRET_KEY`	MinIO secret key.
`DATA_LAKE_PATH`	Host path used for raw datasets.
`RETRIEVED_DATA_PATH`	Host path for files written by `retrieve`.

Usage

Orchestrated (recommended)

DatorCloudOrchestrator.from_env() loads .env and wires every component for you. Pass keyword overrides for anything you want to change.

from datorcloud.core import DatorCloudOrchestrator

orchestrator = DatorCloudOrchestrator.from_env(
    data_bucket="orx-datalake",
    metadata_bucket="orx-metadata",
)

orchestrator.upload_datasets({"my-dataset": "./dataspaces/data_lake/my-dataset"})
orchestrator.generate_and_upload_metadata(
    dataset_dirs={"my-dataset": "./dataspaces/data_lake/my-dataset"},
    output_file="./dataspaces/data_lake/metadata.csv",
    object_name="metadata.csv",
)

results = orchestrator.query_metadata(filters={"camera_id": "camera01"}, limit=10)

Individual components

Each component is also usable on its own. Credentials are required arguments; loading them from .env keeps secrets out of the source tree.

import os
from dotenv import load_dotenv

from datorcloud import (
    MinioObjectComponent,
    MetadataGeneratorComponent,
    MetadataStorageComponent,
)

load_dotenv()

minio = MinioObjectComponent(
    endpoint=os.environ.get("S3_ENDPOINT", "minio:9090"),
    access_key=os.environ["S3_ACCESS_KEY"],
    secret_key=os.environ["S3_SECRET_KEY"],
)

minio.upload_directory(
    local_directory="./dataspaces/data_lake/my-dataset",
    bucket_name="orx-datalake",
    prefix="my-dataset",
)

generator = MetadataGeneratorComponent()
storage = MetadataStorageComponent(minio_component=minio, metadata_bucket="orx-metadata")
storage.create_metadata_and_store(
    metadata_generator_component=generator,
    dataset_dirs={"my-dataset": "./dataspaces/data_lake/my-dataset"},
    local_file_path="./dataspaces/data_lake/metadata.csv",
    object_name="metadata.csv",
)

Dagster

DatorCloudResource reads credentials and storage paths from .env via Pydantic default_factory, so the resource works out of the box when .env is loaded.

from dagster import AssetSelection, Definitions, define_asset_job
from datorcloud.dagster import DatorCloudResource, component_assets

resource = DatorCloudResource(
    data_bucket="orx-datalake",
    metadata_bucket="orx-metadata",
)

datorcloud_job = define_asset_job(
    name="datorcloud_workflow_job",
    selection=AssetSelection.assets(*component_assets),
)

defs = Definitions(
    assets=component_assets,
    jobs=[datorcloud_job],
    resources={"datorcloud": resource},
)

CLI

datorcloud upload   --dataset 4dor-dataset=./dataspaces/data_lake/4dor-dataset
datorcloud metadata --dataset 4dor-dataset=./dataspaces/data_lake/4dor-dataset
datorcloud query    --filter camera_id=camera01 --limit 10
datorcloud retrieve --dataset 4dor-dataset --filter camera_id=camera01 --max-files 5

The four subcommands map 1:1 to the orchestrator methods and to the four Dagster assets:

upload_datasets — upload datasets to MinIO
generate_metadata — extract per-file metadata and persist it
query_metadata — filter the metadata catalog
retrieve_objects — download every object matched by a query

See examples/ for runnable end-to-end scripts and docs/04_user_guide/tutorial_4dor.md for a full walkthrough on the bundled multi-camera surgical dataset.

Project layout

datorcloud/                  Python package
  components/                  Five single-responsibility components
  core/                        DatorCloudOrchestrator (+ from_env factory)
  dagster/                     ConfigurableResource + @asset definitions
  cli.py                       `datorcloud` command entry point
tests/                       pytest suite (30 tests, in-memory fakes)
examples/                    Runnable scripts (basic, components, dagster)
docs/                        MkDocs site
build/                       Dockerfiles for the Compose stack
dataspaces/                  Project storage skeleton (data_lake, ...)
docker-compose.yml           MinIO + DuckDB + Dagster + python-runner + cli

Testing

pip install -e ".[test]"
pytest -q

You should see 30 passed. The Dagster materialization test auto-skips when dagster is not installed.

Authors

Dr. John Anderson Garcia-Henao - main author and creator.
DatorCloud contributors - Digital Medicine Unit / OR-X Translational Center for Surgery, Balgrist University Hospital.

License

BSD 3-Clause — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github/workflows		.github/workflows
build		build
dataspaces		dataspaces
datorcloud		datorcloud
docs		docs
examples		examples
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
workspace.yaml		workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Key features

Installation

Configuration

Usage

Orchestrated (recommended)

Individual components

Dagster

CLI

Project layout

Testing

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Key features

Installation

Configuration

Usage

Orchestrated (recommended)

Individual components

Dagster

CLI

Project layout

Testing

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages