Improve EverCore Python quality gates by hylarucoder · Pull Request #233 · EverMind-AI/EverOS

hylarucoder · 2026-05-26T11:23:54Z

Summary

Added EverCore developer quality gates for Ruff, ty, and Pyright through Makefile targets and pyproject.toml configuration.
Cleaned up the initial Ruff/formatter baseline, including import cleanup, formatting, safer type checks, duplicate prompt registration, DTO re-export compatibility, and a real possibly-unresolved reference found by ty.
Updated the test command to run through uv, added the missing psutil test dependency, and refreshed uv.lock for the new dev tooling.

Area

Verification

cd methods/EverCore
make ruff
make typecheck
make typecheck-pyright
make test

make ruff passed.
make typecheck passed.
make typecheck-pyright completed with 0 errors and 8 existing warnings.
make test now runs under the uv-managed Python environment, but the full local suite is not green in this checkout: 1574 passed, 6 skipped, 164 failed, and 160 errored. The failures are concentrated in integration/repository/E2E suites that require local Mongo/Redis/Elasticsearch/Milvus/provider configuration; for example, the delete API integration suite used none:27017 and MONGO_DB=None.

Checklist

I kept the change scoped to the relevant area.
I updated docs, examples, or setup notes when behavior changed.
I added or updated tests when the change affects behavior.
I did not commit secrets, .env files, dependency folders, or generated output.
Active relative links in Markdown files resolve.

Notes for Reviewers

The main reviewer focus should be the initial rule baseline. Ruff is now passing, while ty is intentionally limited to possibly-unresolved-reference so this first rollout does not fail on the current broader type-checking baseline.

By submitting this pull request, I agree that my contribution is licensed under
the Apache License 2.0.

- ruff with FastAPI-style rule set (E/W/F/B/C4/ASYNC); `make ruff` and `make ruff-fix` targets. Keeps black/isort during transition. - ty as the default typechecker (~10x faster than pyright); pyright retained as `make typecheck-pyright`. [tool.ty.rules] enables possibly-unresolved-reference, which catches real bugs surfaced in the audit (milvus_start, user_id_list). - pytest-asyncio: pin asyncio_mode = "strict" explicitly so the current default doesn't shift under us on upstream release. - load_env.py: use os.pathsep instead of hardcoded ":" in PYTHONPATH manipulation so the env loader works on Windows.

- memory_manager.get_vector_search_results: initialize milvus_start before the try so the except branch can always compute the stage duration. Previously a failure before line 729 would raise UnboundLocalError, masking the real exception. - mem_memorize._trigger_profile_extraction: initialize user_id_list before the try so the recovery branch can iterate participants safely on early failure. - mem_memorize: `request == None` -> `request is None` (E711). - mem_memorize / memory_manager / mem_db_operations: remove duplicate imports of traceback, dataclass, RawDataType, MemorizeConfig, DEFAULT_MEMORIZE_CONFIG, Any, MemoryType (F811). - retrieval_utils: bare `except:` -> `except Exception:` to stop swallowing KeyboardInterrupt / SystemExit (E722). - i18n_tool: declare OpenAIProvider under TYPE_CHECKING so the forward-reference annotations resolve without forcing memory_layer to be importable at module load.

Two classes named SearchMemoriesResponse were defined in api_specs/dtos/memory.py: a GET-endpoint variant at line 774 (BaseApiResponse[RetrieveMemResponse]) and a POST-endpoint variant at line 1697 (BaseApiResponse[SearchMemoriesResponseData]). The second definition silently shadowed the first, making the GET variant dead code. The POST variant is the one wired to memory_search_controller (it instantiates `SearchMemoriesResponse(data=response_data)` where response_data is SearchMemoriesResponseData), so keep that one and remove the shadowed GET variant + its example schema.

Same "assigned inside try, used in except / after-try" pattern as the earlier milvus_start / user_id_list fixes: - mem_memorize._trigger_profile_extraction: initialize profile_repo to None before the try, and skip the last_updated_ts advance when the DI lookup never ran (failure during lazy imports). - memory_layer.memory_manager: initialize display_name to None before the if/else so the else-branch loop falling through (no matching sender_id) doesn't leave it unbound at the downstream call site. - milvus_collection_base: pull `new_coll = Collection(...)` outside the index-creation try so an index-warning path can't shadow the collection-creation success with a NameError on `return new_coll`. - hmac_signature_middleware: bind timestamp_header / nonce_header / signature_header to None before the try, so the except branch's error logging works even if request.headers itself raises. - redis_length_cache_manager, redis_windows_cache_manager: bind `message = None` at the top of each loop iteration so the except branch can format its warning without a NameError when indexing raises before the assignment. Incidentally includes ruff F541 cleanup (f-strings without placeholders -> plain strings) in mem_memorize.py from the same auto-fix pass.

Mechanical cleanup from `ruff check --fix --select W291,W293,F541, C420,B010,B009`: - W293 blank-line-with-whitespace: removed trailing whitespace on otherwise-empty lines (~126 occurrences). - W291 trailing-whitespace: trimmed line-end whitespace. - F541 f-string-missing-placeholders: rewrote `f"..."` literals with no interpolation as plain strings (~22 occurrences). - C420 unnecessary-dict-comprehension-for-iterable, B010 set-attr- with-constant, B009 get-attr-with-constant: simplified a handful of unnecessary dict comprehensions and setattr/getattr-with- constant-attribute calls into direct attribute access. No behavior changes. The remaining 144 W293 / 9 W291 occurrences are inside docstrings or string literals and require --unsafe-fixes, which is intentionally not applied.

Mechanical cleanup via `ruff check --select F401 --fix --unsafe-fixes`. 138 of 139 unused imports removed across 67 files. Audited removals before committing: - DI decorators (`from core.di.decorators import service / component / repository`) were imported in a few adapter modules but never used as `@service` etc. `core.di.decorators` is a pure decorator factory module with no module-load side effects, so removing the dangling imports is safe. - stdlib (`os`, `time`, `json`, `random`, `uuid`, `numpy`, `jieba`, ...) and typing helpers (`Optional`, `List`, `Tuple`, ...) form the bulk of removals. - `__init__.py` files are intentionally exempt (per the `[tool.ruff.lint.per-file-ignores]` rule allowing re-exports). The one remaining F401 (`pyinstrument` in profile_middleware.py) is an availability-probe import — `try: import pyinstrument` — and is suppressed with `# noqa: F401` since the import is the side effect. Net impact: ruff baseline 387 -> 244 errors. ty count unchanged at 891 (F401 doesn't affect type inference). All src/*.py files compile successfully.

Fixes 5 ruff ASYNC violations. The first is a real bug; the others are dev-script cleanup for consistency now that aiofiles is already a project dependency. - core/oxm/es/migration/utils.py:177,182: `wait_for_task_completion` polls Elasticsearch task status inside an async function but used `time.sleep(5)` and `time.sleep(10)` on error, freezing the entire event loop on each iteration. ES rebuilds can run for minutes-to-hours, so this would stall every other coroutine in the process (FastAPI lifespan, background tasks, etc.). Switch to `await asyncio.sleep(...)`. - devops_scripts/i18n/i18n_tool.py:663,701 and devops_scripts/sensitive_info/sensitive_info_tool.py:477: Replace synchronous `with open(...)` / `f.read()` / `f.write()` with `async with aiofiles.open(...)` / `await f.read()` / `await f.write()`. These are dev-tool scripts where blocking IO impact is negligible, but the conversion is mechanical and removes the noise from future lint reports.

Mechanical cleanup of 190 ruff violations across 39 files. Manually audited the auto-fixes; the noteworthy decisions: - W293/W291 inside src/memory_layer/prompts/* was REVERTED. Those triple-quoted strings are LLM prompt templates and the trailing whitespace is part of the literal sent to the model. Three files restored to HEAD: prompts/__init__.py, en/atomic_fact_prompts.py, zh/conv_prompts.py. - F841 "unused variable" auto-fix mostly deleted assignments where the RHS is pure (literals, len(), sorted(), str()). For cases where ruff conservatively kept the RHS as an orphan expression (`get_tenant_aware_collection_name(origin_alias_name)` etc.), manually verified those functions are pure and deleted the expression lines too: - test_di_scanner.py:284 (`len(container._named_beans)`) - tenant_aware_collection_with_suffix.py:350 - user_profile_milvus_converter.py:52 The `job = await pool.enqueue_job(...)` case correctly kept the awaited call -- only the unused name binding was removed. - E731 lambda-to-def conversion in milvus_rebuild_collection.py produced wrong inner indentation; fixed to PEP 8 4-space. - C408 dict(...) -> {...}, B905 zip(..., strict=False), E401 multiple-imports-on-one-line, and the remaining W291/W293 outside prompts were applied verbatim. Result: ruff 239 -> 56 errors. All src/ files compile. ty 891 -> 839 (F841 deletions remove type tracking on those vars).

The previous --unsafe-fixes commit (dabc3c4) auto-removed assignments that ruff classified as unused locals. Several of those were not safe to drop on a purely-syntactic read: - agentic_layer/memory_manager.py: three `memory_type = (...)` bindings in `retrieve_mem_keyword / _vector / _hybrid`. These were paired with `start_time = time.perf_counter()` in the original metrics-labelling commit (b3958cb). The paired start_time was removed during a later metrics refactor that pushed recording into the inner `get_*_search_results` helpers; the memory_type binding got orphaned but the value is still a useful debugging breadcrumb at the wrapper layer. - core/asynctasks/task_manager.py: `job = await pool.enqueue_job(...)` — keeping the name binding aids debugging and any future follow-up that needs the job handle. - memory_manager.py `extend_data = fields['extend_data']`, agent_skill_milvus_converter `content = source_doc.content or ""`, episode_memory_extractor `default_title = "Conversation Episode"`, test_di_scanner `initial_bean_count = len(...)`, tenant_aware_collection_with_suffix `tenant_aware_alias_name = ...`, user_profile_milvus_converter `doc_id = str(source_doc.id) ...`: same story — likely vestiges of in-flight work or intentional reads kept for clarity. Restoring them is cheap; auto-removing was the wrong default. Add F841 to the ruff `ignore` list in pyproject.toml so future `--unsafe-fixes` runs won't reintroduce the same destructive cleanup. F841 can still be surfaced explicitly via `ruff check --select F841` when a human wants to triage them; we just don't auto-fix.

Claude Code loads CLAUDE.local.md alongside CLAUDE.md and lets it take precedence (closer-to-cwd loads last). Use it for personal workflow rules that shouldn't bind other contributors, while CLAUDE.md / AGENTS.md keep the shared team conventions.

Two new dev_docs articles drafted from a project-wide code review: - exception_handling_analysis.md: full audit of 781 try blocks and 17 hand-written retry loops. Catalogues 7 anti-patterns and 6 retry types with code references, performance/correctness analysis, and a sprint-organised action plan. - code_quality_roadmap.md: 8-week three-phase execution plan (observability -> tests -> try-catch cleanup), with per-phase acceptance criteria and rollback paths.

K8s-style liveness vs readiness separation, per the observability roadmap (Phase 1, T1.1). - /livez: process heartbeat, always 200. Suitable for liveness probes; a failure means the process should be restarted. - /readyz: runs four downstream probes (Redis, MongoDB, Elasticsearch, Milvus) in parallel under a 2 s timeout. Returns 200 only when all succeed; 503 with a per-dependency breakdown otherwise. Suitable for readiness probes and load-balancer drain checks. - /health: legacy alias for /livez, preserved for backward compatibility. Each probe also publishes the evercore_dependency_healthy{name="..."} Prometheus gauge so alerts can fire without polling /readyz externally. Probe definitions live in a sibling probes.py module; adding a new downstream is one async function plus an entry in default_probes(). prometheus_middleware skips /livez and /readyz so probe traffic does not dominate request-rate dashboards. Smoke-tested locally: happy path returns 200 with all four deps healthy; pointing REDIS_PORT at a dead port correctly drives /readyz to 503 with redis=False detail, /livez stays 200, and the gauge for redis drops to 0 while others remain 1.

Per Phase 1, T1.3 of the observability roadmap. Before this change, four independent classifiers (memorize/retrieve/rerank/vectorize) each did string-matching on str(error).lower() and fell through to the literal "unknown" (or "error") bucket. Cross-service alerts could not group failures the same way, and any unfamiliar exception class was invisible behind the catch-all label. The new core/observation/error_classification.py is the single source of truth: it still emits the existing semantic labels (timeout, rate_limit, validation_error, connection_error, not_found) when those match, and otherwise returns the exception class name in snake_case. The four call sites now delegate to it. Effect on metrics: - error_type='unknown' / error_type='error' goes away entirely. - A previously-anonymous KeyError or pydantic.ValidationError now shows up as error_type='key_error' / 'validation_error' and can be alerted on. - Vectorize-specific failures previously bucketed as 'api_error' now surface as 'vectorize_error', which is more actionable. Unit-tested with 9 representative exception shapes; ruff + compile clean. No call-site behaviour change beyond the label string.

Per Phase 1, T1.2 of the observability roadmap. The logger module now selects its output format from the LOG_FORMAT env var: - LOG_FORMAT=text (default) keeps the existing human-readable format used by local development and tail-based debugging. - LOG_FORMAT=json switches to one JSON object per line via the new JsonFormatter. Each record carries ts (ISO 8601 ms + Z), level, logger, msg, file, request_id, and any extra= mapping the caller attached. Tracebacks live in a structured exc_info field instead of being concatenated into msg. ContextFilter (formerly RequestIdFilter) now also forwards tenant_id, user_id, group_id and session_id from the app_info contextvar onto the LogRecord, so JSON output captures them automatically when they are populated by middleware. The change is fully backward compatible: all 247 existing call sites keep working unchanged in both formats. Structured fields can be adopted incrementally via logger.info(msg, extra={...}) without touching the rest. Implemented with zero new dependencies (stdlib logging + json). A future migration to structlog can layer on top without churning callers again. Smoke-tested locally: with LOG_FORMAT=json, /readyz emits per-probe JSON lines that share the same request_id across the Redis / MongoDB / Elasticsearch / Milvus checks, which is exactly the cross-call correlation the roadmap calls out as Phase 1's headline win.

Closes Phase 1 T1.5 of the code-quality roadmap. - docs/dev_docs/slo_definitions.md: four load-bearing SLOs (memorize success rate, retrieve p95 latency, LLM error budget, dependency availability) wired to existing metric names. Includes PromQL, severity guidance, and an error-budget policy. - docs/grafana/evercore_slo_overview.json: importable dashboard with one stat per SLO plus 5-min burn-rate timeseries panels.

Closes Phase 1 T1.4 of the code-quality roadmap. - core/observation/tracing/otel.py: soft-imports the OTel SDK; init_tracing is a no-op unless OTEL_EXPORTER_OTLP_ENDPOINT is set AND the deps are installed. Installs FastAPI/httpx/redis/pymongo instrumentations when available. Default sampler is ParentBased(TraceIdRatioBased(0.1)). - trace_logger decorator now opens a real span when tracing is active; unchanged behavior when inactive (zero perf cost). - base_app wires init_tracing(app) at the end of create_base_app. - pyproject adds an optional `otel` dependency group. Activate with `uv sync --group otel`; default sync does NOT pull these in.

…heck Closes Phase 2 T2.1, T2.2, T2.3, T2.5 of the code-quality roadmap. - evercore-smoke.yml: runs unit-marker pytest with coverage XML upload and a non-blocking ty typecheck step. Both `continue-on-error: true` until tests are explicitly tagged per T2.4. - Makefile: adds test-unit / test-integration / test-e2e / test-cov targets. - tests/conftest.py: auto-tags tests by file path (tests/integration/** → integration; *_e2e.py → e2e; rest → unit). Explicit markers still win. - pytest.ini: fixes section header from [tool:pytest] (legacy setup.cfg form) to [pytest], so pytest actually reads markers and other settings. Registers e2e marker. - pyproject.toml: removes duplicate [tool.pytest.ini_options] section; pytest >= 8 silently ignored it whenever pytest.ini existed.

Closes Phase 3 W5 of the code-quality roadmap. Phase 3 W6 (retry refactor) and W7 (catch-all overhaul + custom exception hierarchy) are intentionally not in this commit — both require 24-hour staging soak per roadmap §4.2. Substantive changes - Replace 20 `traceback.print_exc()` call sites with `logger.exception(...)`; remove now-unused `import traceback`. Affected: bootstrap, biz_layer/mem_memorize, mem_db_operations, core/di/scanner, core/component/mongodb_client_factory, core/request/timeout_background, core/oxm/es/migration/utils, and 7 devops_scripts/data_fix scripts. - Consolidate duplicate `RetryConfig` classes. The canonical implementation lives in core/longjob/interfaces (richer API: jitter, backoff_multiplier, retry_on_fatal). The dataclass in core/asynctasks/task_manager is removed and the module re-imports from interfaces. No external consumers, safe merge. Ruff baseline - Enable G (flake8-logging-format) and BLE (flake8-blind-except). - 930 `# noqa: G004` / `# noqa: BLE001` markers applied to existing violations via `ruff check --add-noqa`. New code is held to the rule; baseline shrinks naturally as files are touched. Roadmap status - docs/dev_docs/code_quality_roadmap.md §0 updated to reflect Phase 1 done, Phase 2 partial, Phase 3 W5 done; explicit hand-off note for the W6/W7 boundary.

Adds opt-in performance benchmarks for the two hot paths called out in roadmap §3.2: hybrid retrieval and end-to-end memorize. Per roadmap guidance, benchmarks are NOT wired into CI yet — the first ~3 months are baseline collection only, with thresholds added later. - tests/benchmarks/test_retrieve_hybrid_benchmark.py: measures MemoryManager.retrieve_mem_hybrid; tied to the retrieve_p95_latency SLO (target p95 < 500 ms). - tests/benchmarks/test_memorize_benchmark.py: measures end-to-end memorize for a single message. - Both skip cleanly when RUN_BENCHMARKS is unset or when the DI container / DTOs can't be imported — collection-only is safe in CI. - Adds optional `bench` dependency group (pytest-benchmark). Default `uv sync` does not pull it in; activate with `uv sync --group bench`. - `make benchmark` target sets RUN_BENCHMARKS=1 and prints the min/mean/median/p95/max columns.

Records that T2.6 (benchmark scaffolding) landed and names T2.4 (unit-test backfill) as the next-up item gating Phase 3 W6/W7.

…nch group Picked up by uv during T2.6 verification. No behavior change for the default sync — these only activate under `uv sync --group bench`.

… layer Closes Phase 2 T2.4 of the code-quality roadmap. agent_skill_extractor was already covered by 193 existing tests, so the gap was in two modules: - tests/test_agentic_memory_manager.py (14 tests) Pins the swallow-and-return-empty contract on retrieve_mem_hybrid / retrieve_mem so Phase 3 W7's intentional change is a deliberate, reviewed diff. Also covers _classify_retrieve_error for the most common failure modes. Surfaces one xfail: the audit-flagged possibly-unbound bug where retrieve_mem(None) enters the fallback path but crashes inside QueryMetadata.from_request(None). Strict-xfail so Phase 3 W7's fix flips it naturally to xpass. - tests/test_biz_mem_memorize.py (15 tests) Covers if_memorize, _is_agent_case_quality_sufficient, _should_skip_atomic_fact_for_agent, _clone_episodes_for_users, and the _save_agent_case catch-all (returns 0 on save / conversion failure). Pure-logic tests; no DB or LLM fixtures required. Both files pass `ruff check`, `ty check`, and run cleanly under `pytest -m unit` without docker.

Two W7-subset items that are safely landable without staging soak: 1. core/errors/ — typed exception hierarchy. EverCoreError is the common root so controllers can install a single except EverCoreError fallback while pattern-matching subtypes higher in the chain. Subclasses cover: - MemorizeError / MemoryCellPersistFailed — write-path failures - ExtractionError / LLMOutputFormatError — LLM-side failures, with LLMOutputFormatError carved out for the Type-B "feedback retry" pattern (parse failure feeds into next prompt) - RetrieveError / SearchBackendUnavailable — read-path failures - ProviderUnavailable — root-level (both paths can hit a provider outage), carries an explicit `provider` field for metric/label attribution Every class carries `tenant_id`, `cause`, and arbitrary `context` kwargs so structured logging and Sentry grouping work without parsing message strings. Purely additive — no callers migrated. 15 tests pin the hierarchy shape and attribute contract. 2. Fix audit-flagged possibly-unbound bug in MemoryManager.retrieve_mem. When the request was None, the validation ValueError was caught by the broad fallback, but the fallback then crashed inside QueryMetadata.from_request(None). Short-circuited None at the top so the swallow-and-empty contract is honored. The xfail in tests/test_agentic_memory_manager.py::test_no_request_returns_empty_response flips to xpass and is no longer marked. Roadmap §0 status updated. W6 retry refactor remains deferred — openai_provider._execute_with_retry's key-rotation + status-code dispatch + per-error metric labels aren't a tenacity drop-in and need behavior-equivalence tests plus 24h staging soak per §4.2.

hylarucoder added 24 commits May 26, 2026 18:45

Finish EverCore quality gate setup

783c842

docs(EverCore): update code-quality roadmap status (T2.6 done)

a2cd973

Records that T2.6 (benchmark scaffolding) landed and names T2.4 (unit-test backfill) as the next-up item gating Phase 3 W6/W7.

chore(EverCore): lock pytest-benchmark and transitive deps for the be…

dc6bed7

…nch group Picked up by uv during T2.6 verification. No behavior change for the default sync — these only activate under `uv sync --group bench`.

hylarucoder added the do-not-merge label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve EverCore Python quality gates#233

Improve EverCore Python quality gates#233
hylarucoder wants to merge 24 commits into
mainfrom
fix/python-best-practices

hylarucoder commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant