Skip to content

ci: speed up secret CI from 65 min to ~20 min#292

Merged
luarss merged 10 commits into
The-OpenROAD-Project:masterfrom
luarss:fast-secret-ci
Jun 6, 2026
Merged

ci: speed up secret CI from 65 min to ~20 min#292
luarss merged 10 commits into
The-OpenROAD-Project:masterfrom
luarss:fast-secret-ci

Conversation

@luarss
Copy link
Copy Markdown
Collaborator

@luarss luarss commented Jun 6, 2026

Summary

  • Parallelise jobs: split the monolithic build-backend-docker job into lint-backend, lint-frontend, lint-evaluation (free ubuntu-latest runners, run in parallel), test (self-hosted), and docker-eval (self-hosted, runs after all others pass) — mirrors the structure already used in ci.yaml
  • Parallel tests: make test now runs pytest -n auto via the already-installed pytest-xdist; cuts the 349-test suite from ~10 min to ~2–3 min on a multicore runner
  • Skip HF dataset clone in Docker build: added SKIP_HF_DOWNLOAD build arg to backend/Dockerfile; CI passes true and bind-mounts a runner-cached ./data dir instead — saves 5–15 min per build; production behaviour unchanged (arg defaults to false)
  • Reduce healthcheck window: new docker-compose.ci.yml override sets HEALTHCHECK_START_PERIOD=300s (down from 1200s) with tighter interval=10s/retries=30
  • Remove dev deps from production image: uv sync --devuv sync in Dockerfile; test/lint tools no longer baked into the image
  • Remove stale HF download from test step: source_list.json was committed in ci: parallelise jobs, remove HF download, commit source_list fixture #288; the redundant huggingface-cli download in the unit-test step is dropped

@luarss luarss force-pushed the fast-secret-ci branch 2 times, most recently from 051caa8 to e4f6797 Compare June 6, 2026 04:19
luarss and others added 9 commits June 6, 2026 04:20
…ROAD-Project#286)

fix: patch Python CVEs in all pyproject.toml and uv.lock files

Fixes the following CVEs by bumping minimum version constraints:
- aiohttp: 3.13.4 -> 3.14.0 (CVE-2026-34993, CVE-2026-47265)
- pyarrow: 19/20/21.x -> 23.0.1+ (PYSEC-2026-113)
- pygments: 2.19.2 -> 2.20.0 (CVE-2026-4539)
- PyJWT: 2.12.1 -> 2.13.0 (PYSEC-2026-175/177/178/179)
- starlette: 0.46/0.50.x -> 1.2.1 (PYSEC-2026-161, CVE-2025-54121, CVE-2025-62727)
- fastapi: 0.115.14 -> 0.136.3 (frontend, to pull in starlette 1.x)

aiohttp and pyarrow are added as explicit constraints to force
transitive dependency updates. torch 2.9.0 (PYSEC-2026-139) has no
upstream fix available yet.

https://claude.ai/code/session_01GeF1r9TL34WDNuRp3ny8iR

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
…ROAD-Project#287)

Add tenacity retry with exponential backoff to FAISSVectorDatabase._add_to_db
and process_json to handle transient rate-limit errors from the Google
Generative AI / Vertex AI embedding API during backend startup.
Promotes tenacity to an explicit direct dependency.

Signed-off-by: Jack Luar <jluar@precisioninno.com>
…he-OpenROAD-Project#288)

* ci: parallelise jobs, remove HF download, commit source_list fixture

- Split monolithic build-backend-docker into lint-backend, lint-frontend,
  lint-evaluation (parallel), test (needs lint-backend only), and
  docker-build (needs test + frontend/evaluation lint)
- Scope init-dev to each job's module to avoid redundant uv sync calls
- Remove huggingface-cli download; commit backend/data/source_list.json
  as a static test fixture instead
- Add --parallel to docker compose build
- Add check-ci root Makefile target (ruff+mypy per module, no pre-commit)
- Unignore backend/data/source_list.json in .gitignore

Signed-off-by: Jack Luar <jluar@precisioninno.com>

* ci: run lint jobs on ubuntu-latest to reduce self-hosted contention

Lint jobs only need Python + uv — no Docker or local services. Moving
them to ubuntu-latest frees self-hosted runners for test and docker-build.

Signed-off-by: Jack Luar <jluar@precisioninno.com>

---------

Signed-off-by: Jack Luar <jluar@precisioninno.com>
- Split monolithic build-backend-docker job into lint-backend,
  lint-frontend, lint-evaluation (ubuntu-latest), test, and
  docker-eval jobs; lint jobs run free runners in parallel
- Remove redundant HF source_list.json download from test step
- Add docker-compose.ci.yml override: skips HF clone via
  SKIP_HF_DOWNLOAD build arg, bind-mounts pre-downloaded ./data,
  reduces healthcheck start_period from 1200s to 300s
- Add docker-up-ci / docker-down-ci Makefile targets using the
  CI compose override
- Use pytest -n auto in make test to parallelise 349 tests via
  already-installed pytest-xdist
- Add SKIP_HF_DOWNLOAD ARG to backend/Dockerfile so production
  builds still clone the dataset; CI skips it and mounts instead
- Change uv sync --dev to uv sync in Dockerfile to omit dev
  tools from the production image

Signed-off-by: Jack Luar <jluar@precisioninno.com>

Signed-off-by: Jack Luar <jluar@precisioninno.com>
pip is not on PATH on the self-hosted runner; the docker-eval job
had no uv setup step. Add Install uv before the checkout and switch
to uv tool install huggingface-hub so huggingface-cli is available.

Signed-off-by: Jack Luar <jluar@precisioninno.com>

Signed-off-by: Jack Luar <jluar@precisioninno.com>
…enROAD-Project#289)

Split document embedding into 100-chunk batches with a 1s delay
between batches so a 429 only retries one batch (~1 API call) rather
than restarting FAISS.from_documents from scratch (~87 calls). Also
raise retry wait times from max 120s to max 600s to give the quota
time to reset before the next attempt.

Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
* fix: defer graph init to background, prevent health-check timeout

Move RetrieverGraph construction out of module-level import in
conversations.py and into a background thread spawned during the
FastAPI lifespan. This lets the server start instantly so the
Docker health-check passes within the reduced 30 s start_period
(instead of timing out after 22+ min waiting for FAISS embedding).

- Replace module-level rg = RetrieverGraph(...) with lazy singleton
  (get_graph / start_graph_init / reset_graph_state_for_testing)
- Add /conversations/ready readiness probe returning 'ready' or
  'initializing'
- Conversation endpoints return 503 / stream error when graph is
  not yet initialized
- Add readiness poll loop (30 min, 10 s intervals) before Run LLM CI
  step in ci-secret.yaml
- Reduce Docker healthcheck start_period default from 1200 s to 30 s
- Update streaming tests to use new public reset_graph_state_for_testing()

Signed-off-by: Jack Luar <jluar@precisioninno.com>

---------

Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
* fix: defer graph init to background, prevent health-check timeout

Move RetrieverGraph construction out of module-level import in
conversations.py and into a background thread spawned during the
FastAPI lifespan. This lets the server start instantly so the
Docker health-check passes within the reduced 30 s start_period
(instead of timing out after 22+ min waiting for FAISS embedding).

- Replace module-level rg = RetrieverGraph(...) with lazy singleton
  (get_graph / start_graph_init / reset_graph_state_for_testing)
- Add /conversations/ready readiness probe returning 'ready' or
  'initializing'
- Conversation endpoints return 503 / stream error when graph is
  not yet initialized
- Add readiness poll loop (30 min, 10 s intervals) before Run LLM CI
  step in ci-secret.yaml
- Reduce Docker healthcheck start_period default from 1200 s to 30 s
- Update streaming tests to use new public reset_graph_state_for_testing()

Signed-off-by: Jack Luar <jluar@precisioninno.com>

---------

Signed-off-by: Jack Luar <jluar@precisioninno.com>
…he-OpenROAD-Project#291)

- Split monolithic build-backend-docker job into lint-backend,
  lint-frontend, lint-evaluation (ubuntu-latest), test, and
  docker-eval jobs; lint jobs run free runners in parallel
- Remove redundant HF source_list.json download from test step
- Add docker-compose.ci.yml override: skips HF clone via
  SKIP_HF_DOWNLOAD build arg, bind-mounts pre-downloaded ./data,
  reduces healthcheck start_period from 1200s to 300s
- Add docker-up-ci / docker-down-ci Makefile targets using the
  CI compose override
- Use pytest -n auto in make test to parallelise 349 tests via
  already-installed pytest-xdist
- Add SKIP_HF_DOWNLOAD ARG to backend/Dockerfile so production
  builds still clone the dataset; CI skips it and mounts instead
- Change uv sync --dev to uv sync in Dockerfile to omit dev
  tools from the production image

Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Song Luar <jluar@precisioninno.com>
@luarss luarss merged commit 369af13 into The-OpenROAD-Project:master Jun 6, 2026
1 check passed
@luarss luarss deleted the fast-secret-ci branch June 6, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant