Skip to content

feat(self-packaging #44): embed:// DuckDB filesystem#54

Merged
jrosskopf merged 1 commit into
mainfrom
feature/gh-44-embed-duckdb-fs
May 22, 2026
Merged

feat(self-packaging #44): embed:// DuckDB filesystem#54
jrosskopf merged 1 commit into
mainfrom
feature/gh-44-embed-duckdb-fs

Conversation

@jrosskopf
Copy link
Copy Markdown
Contributor

Part of epic #40. Stacked on #53 which is stacked on #52 which is stacked on #51.

Summary

Adds the DuckDB-side reader for the embed scheme. After this PR
lands, the bundle is fully observable from both halves of flapi:
config loading via IFileProvider (#43) AND SQL-side read_csv() /
read_parquet() / read_json() / etc. via DuckDB's
VirtualFileSystem.

  • src/include/duckdb_embed_fs.hpp + src/duckdb_embed_fs.cpp --
    EmbeddedFileSystem subclassing duckdb::FileSystem, plus the
    RegisterEmbeddedFileSystem() helper that wires it onto the
    running DuckDB instance.
  • src/main.cpp -- one new call after initializeDatabase, logged
    at INFO when it actually registers, silent when there's no bundle.

Why two file-system layers (and not one)

IFileProvider covers config loading and SQL-template reading from
flapi C++ code. But SQL templates themselves contain DuckDB calls
like read_csv('data/foo.csv'), and those bypass IFileProvider
entirely -- they go to DuckDB's VirtualFileSystem. Hence two
adapters over one shared std::shared_ptr<const ArchiveEntries>:
one decompressed map, two readers.

The override set

The spike's hard-won lesson:

Implementing only OpenFile / Read / Seek / GetFileSize is
not enough to satisfy read_csv(). The integration also requires
Glob() (read_csv expands paths before opening) and
SeekPosition(). The base class throws ``not implemented'' rather
than no-oping.

So we override:

Method Why
OpenFile returns an EmbeddedFileHandle over the entry's bytes
Read(handle, buf, n) streaming read, advances position
Read(handle, buf, n, location) positional read, does not move the cursor
GetFileSize entry's byte count
Seek / SeekPosition streaming-position management
Glob expands embed://data/*.csv to the matching entries (sorted)
CanHandleFile claims embed:// paths so VirtualFileSystem routes them here
FileExists sanity helper
OnDiskFile false (relevant for DuckDB's random-read optimisations)
CanSeek true
GetName embed

Registration in main.cpp

Once after initializeDatabase():

if (RegisterEmbeddedFileSystem()) {
    CROW_LOG_INFO << \"Registered embed:// DuckDB filesystem\";
}

The helper is a no-op when:

Each failure mode silently returns false; the binary keeps running.

Tests (10 cases / proof-of-life included)

# Test What it asserts
1 CanHandleFile recognises embed://, rejects /, s3://, bare paths
2 OpenFile for known entry non-null handle, correct size
3 OpenFile for missing entry throws IOException
4 streaming Read bytes returned, position advances, EOF returns 0
5 positional Read bytes returned, position unchanged
6 Seek + SeekPosition round-trip
7 Glob('embed://data/*.csv') both CSVs, sorted
8 Glob exact-match / missing single result / empty
9 proof-of-life: read_csv in-memory DuckDB + register + SELECT name FROM read_csv('embed://data/people.csv') ORDER BY name returns Alice/Bob/Carol
10 proof-of-life: glob read_csv read_csv('embed://data/*.csv', union_by_name=true) returns 6 rows
10/10 Test #77: EmbeddedFileSystem proof-of-life: read_csv with embed:// glob_test  Passed
100% tests passed, 0 tests failed out of 10

The two ``proof-of-life'' tests are spike behaviour #9 -- the
hardest-won evidence that the design actually works. They now run
against the flapi build of DuckDB, not the spike's standalone
amalgamation, so this is real validation that the integration
transplants cleanly.

Test plan

  • 10 new tests green.
  • Existing 588 Catch2 tests still pass; same 2 pre-existing
    DuckDB AddressSanitizer leaks on main remain unchanged.
  • Smoke: flapi --validate-config -c examples/flapi.yaml runs
    with no "Registered embed://" log line (no bundle in this
    binary -- filesystem mode untouched).
  • CI cross-platform once the stack lands.

Closes #44. Part of #40. Stacked on #53 -> #52 -> #51.

Part of #40, depends on #43 (shared ArchiveContents). Lets SQL
templates say `read_csv('embed://data/cities.csv')` and have them
resolve to the in-memory archive when the binary is bundled.

- `src/include/duckdb_embed_fs.hpp` + `src/duckdb_embed_fs.cpp`:
  - `EmbeddedFileSystem` subclasses `duckdb::FileSystem`
  - `EmbeddedFileHandle` is the per-open-file state (non-owning data
    pointer + streaming position); shares the same
    `std::shared_ptr<const ArchiveEntries>` set by #43 so there's one
    decompressed map for both readers
  - overrides: OpenFile, Read (streaming + positional), GetFileSize,
    Seek, **SeekPosition**, **Glob**, FileExists, CanHandleFile,
    OnDiskFile, CanSeek, GetName
  - `Glob` accepts both exact paths and `*` / `?` patterns; returns
    paths preserved with the `embed://` prefix and sorted
  - `RegisterEmbeddedFileSystem()` helper: looks up
    `FileProviderFactory::GetBundleContents()`, walks via
    `DatabaseManager::getInstance()->getConnection()` to reach the
    `duckdb::Connection` wrapper, fetches the VFS via
    `FileSystem::GetFileSystem(context)`, and calls
    `RegisterSubSystem` on it; no-op when no bundle / DB not yet up.

- `src/main.cpp`: after `initializeDatabase`, call
  `RegisterEmbeddedFileSystem()` once. Logged at INFO when it
  registers; silent when there's no bundle.

The Glob + SeekPosition overrides are the spike-found requirement --
the base `FileSystem` throws "not implemented" rather than no-oping
for both, and `read_csv()` exercises both before returning a row.

Tests (`test/cpp/duckdb_embed_fs_test.cpp`, 10 cases):
- CanHandleFile recognises embed:// paths
- OpenFile returns handle / throws on missing
- Read (streaming) advances position; second read returns 0 at EOF
- Read (positional) does not move the cursor
- Seek + SeekPosition agree
- Glob expands `*.csv` to the right entries (sorted)
- Glob exact-match for non-wildcard / empty for missing
- **proof-of-life:** spin up an in-memory DuckDB, register the
  embed FS, run `SELECT name FROM read_csv('embed://data/people.csv')
  ORDER BY name`, assert Alice / Bob / Carol
- **proof-of-life with glob:** same with
  `read_csv('embed://data/*.csv', union_by_name=true)` returning 6 rows

This is the spike's behaviour #9 (the most important test of the
whole epic), and it now runs natively against the flapi build of
DuckDB rather than the spike's standalone amalgamation.

Verified:
- 10/10 new tests pass.
- Existing 588 tests still pass; same 2 pre-existing DuckDB ASan
  leaks on main (`QueryExecutor type coverage`, `DuckDBResult RAII`)
  remain unchanged.
- Filesystem-mode smoke test (`flapi --validate-config -c
  examples/flapi.yaml`) loads cleanly with no embed FS messages.

Closes #44.
@jrosskopf jrosskopf force-pushed the feature/gh-44-embed-duckdb-fs branch from 6b4007e to 1f0bb8f Compare May 22, 2026 13:38
@jrosskopf jrosskopf marked this pull request as ready for review May 22, 2026 14:25
@jrosskopf jrosskopf merged commit 969df3d into main May 22, 2026
17 checks passed
@jrosskopf jrosskopf deleted the feature/gh-44-embed-duckdb-fs branch May 22, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Self-packaging #4: embed:// DuckDB FileSystem (with Glob + SeekPosition)

1 participant