Skip to content

Improve EverCore Python quality gates#233

Open
hylarucoder wants to merge 24 commits into
mainfrom
fix/python-best-practices
Open

Improve EverCore Python quality gates#233
hylarucoder wants to merge 24 commits into
mainfrom
fix/python-best-practices

Conversation

@hylarucoder
Copy link
Copy Markdown
Collaborator

Summary

  • Added EverCore developer quality gates for Ruff, ty, and Pyright through Makefile targets and pyproject.toml configuration.
  • Cleaned up the initial Ruff/formatter baseline, including import cleanup, formatting, safer type checks, duplicate prompt registration, DTO re-export compatibility, and a real possibly-unresolved reference found by ty.
  • Updated the test command to run through uv, added the missing psutil test dependency, and refreshed uv.lock for the new dev tooling.

Area

  • Architecture method
  • Benchmark
  • Use case
  • Documentation
  • Developer experience
  • CI, build, or release

Verification

cd methods/EverCore
make ruff
make typecheck
make typecheck-pyright
make test

make ruff passed.
make typecheck passed.
make typecheck-pyright completed with 0 errors and 8 existing warnings.
make test now runs under the uv-managed Python environment, but the full local suite is not green in this checkout: 1574 passed, 6 skipped, 164 failed, and 160 errored. The failures are concentrated in integration/repository/E2E suites that require local Mongo/Redis/Elasticsearch/Milvus/provider configuration; for example, the delete API integration suite used none:27017 and MONGO_DB=None.

Checklist

  • I kept the change scoped to the relevant area.
  • I updated docs, examples, or setup notes when behavior changed.
  • I added or updated tests when the change affects behavior.
  • I did not commit secrets, .env files, dependency folders, or generated output.
  • Active relative links in Markdown files resolve.

Notes for Reviewers

The main reviewer focus should be the initial rule baseline. Ruff is now passing, while ty is intentionally limited to possibly-unresolved-reference so this first rollout does not fail on the current broader type-checking baseline.

By submitting this pull request, I agree that my contribution is licensed under
the Apache License 2.0.

- ruff with FastAPI-style rule set (E/W/F/B/C4/ASYNC); `make ruff`
  and `make ruff-fix` targets. Keeps black/isort during transition.
- ty as the default typechecker (~10x faster than pyright); pyright
  retained as `make typecheck-pyright`. [tool.ty.rules] enables
  possibly-unresolved-reference, which catches real bugs surfaced
  in the audit (milvus_start, user_id_list).
- pytest-asyncio: pin asyncio_mode = "strict" explicitly so the
  current default doesn't shift under us on upstream release.
- load_env.py: use os.pathsep instead of hardcoded ":" in PYTHONPATH
  manipulation so the env loader works on Windows.
- memory_manager.get_vector_search_results: initialize milvus_start
  before the try so the except branch can always compute the stage
  duration. Previously a failure before line 729 would raise
  UnboundLocalError, masking the real exception.
- mem_memorize._trigger_profile_extraction: initialize user_id_list
  before the try so the recovery branch can iterate participants
  safely on early failure.
- mem_memorize: `request == None` -> `request is None` (E711).
- mem_memorize / memory_manager / mem_db_operations: remove duplicate
  imports of traceback, dataclass, RawDataType, MemorizeConfig,
  DEFAULT_MEMORIZE_CONFIG, Any, MemoryType (F811).
- retrieval_utils: bare `except:` -> `except Exception:` to stop
  swallowing KeyboardInterrupt / SystemExit (E722).
- i18n_tool: declare OpenAIProvider under TYPE_CHECKING so the
  forward-reference annotations resolve without forcing memory_layer
  to be importable at module load.
Two classes named SearchMemoriesResponse were defined in
api_specs/dtos/memory.py: a GET-endpoint variant at line 774
(BaseApiResponse[RetrieveMemResponse]) and a POST-endpoint variant
at line 1697 (BaseApiResponse[SearchMemoriesResponseData]). The
second definition silently shadowed the first, making the GET
variant dead code.

The POST variant is the one wired to memory_search_controller
(it instantiates `SearchMemoriesResponse(data=response_data)`
where response_data is SearchMemoriesResponseData), so keep that
one and remove the shadowed GET variant + its example schema.
Same "assigned inside try, used in except / after-try" pattern as the
earlier milvus_start / user_id_list fixes:

- mem_memorize._trigger_profile_extraction: initialize profile_repo
  to None before the try, and skip the last_updated_ts advance when
  the DI lookup never ran (failure during lazy imports).
- memory_layer.memory_manager: initialize display_name to None before
  the if/else so the else-branch loop falling through (no matching
  sender_id) doesn't leave it unbound at the downstream call site.
- milvus_collection_base: pull `new_coll = Collection(...)` outside
  the index-creation try so an index-warning path can't shadow the
  collection-creation success with a NameError on `return new_coll`.
- hmac_signature_middleware: bind timestamp_header / nonce_header /
  signature_header to None before the try, so the except branch's
  error logging works even if request.headers itself raises.
- redis_length_cache_manager, redis_windows_cache_manager: bind
  `message = None` at the top of each loop iteration so the except
  branch can format its warning without a NameError when indexing
  raises before the assignment.

Incidentally includes ruff F541 cleanup (f-strings without
placeholders -> plain strings) in mem_memorize.py from the same
auto-fix pass.
Mechanical cleanup from `ruff check --fix --select W291,W293,F541,
C420,B010,B009`:

- W293 blank-line-with-whitespace: removed trailing whitespace on
  otherwise-empty lines (~126 occurrences).
- W291 trailing-whitespace: trimmed line-end whitespace.
- F541 f-string-missing-placeholders: rewrote `f"..."` literals with
  no interpolation as plain strings (~22 occurrences).
- C420 unnecessary-dict-comprehension-for-iterable, B010 set-attr-
  with-constant, B009 get-attr-with-constant: simplified a handful
  of unnecessary dict comprehensions and setattr/getattr-with-
  constant-attribute calls into direct attribute access.

No behavior changes. The remaining 144 W293 / 9 W291 occurrences
are inside docstrings or string literals and require --unsafe-fixes,
which is intentionally not applied.
Mechanical cleanup via `ruff check --select F401 --fix
--unsafe-fixes`. 138 of 139 unused imports removed across 67 files.

Audited removals before committing:

- DI decorators (`from core.di.decorators import service / component /
  repository`) were imported in a few adapter modules but never used
  as `@service` etc. `core.di.decorators` is a pure decorator factory
  module with no module-load side effects, so removing the dangling
  imports is safe.
- stdlib (`os`, `time`, `json`, `random`, `uuid`, `numpy`, `jieba`,
  ...) and typing helpers (`Optional`, `List`, `Tuple`, ...) form
  the bulk of removals.
- `__init__.py` files are intentionally exempt (per the
  `[tool.ruff.lint.per-file-ignores]` rule allowing re-exports).

The one remaining F401 (`pyinstrument` in profile_middleware.py) is
an availability-probe import — `try: import pyinstrument` — and is
suppressed with `# noqa: F401` since the import is the side effect.

Net impact: ruff baseline 387 -> 244 errors. ty count unchanged at
891 (F401 doesn't affect type inference). All src/*.py files
compile successfully.
Fixes 5 ruff ASYNC violations. The first is a real bug; the others
are dev-script cleanup for consistency now that aiofiles is already
a project dependency.

- core/oxm/es/migration/utils.py:177,182:
  `wait_for_task_completion` polls Elasticsearch task status inside
  an async function but used `time.sleep(5)` and `time.sleep(10)`
  on error, freezing the entire event loop on each iteration. ES
  rebuilds can run for minutes-to-hours, so this would stall every
  other coroutine in the process (FastAPI lifespan, background
  tasks, etc.). Switch to `await asyncio.sleep(...)`.

- devops_scripts/i18n/i18n_tool.py:663,701 and
  devops_scripts/sensitive_info/sensitive_info_tool.py:477:
  Replace synchronous `with open(...)` / `f.read()` / `f.write()`
  with `async with aiofiles.open(...)` / `await f.read()` /
  `await f.write()`. These are dev-tool scripts where blocking
  IO impact is negligible, but the conversion is mechanical and
  removes the noise from future lint reports.
Mechanical cleanup of 190 ruff violations across 39 files. Manually
audited the auto-fixes; the noteworthy decisions:

- W293/W291 inside src/memory_layer/prompts/* was REVERTED. Those
  triple-quoted strings are LLM prompt templates and the trailing
  whitespace is part of the literal sent to the model. Three files
  restored to HEAD: prompts/__init__.py, en/atomic_fact_prompts.py,
  zh/conv_prompts.py.

- F841 "unused variable" auto-fix mostly deleted assignments where
  the RHS is pure (literals, len(), sorted(), str()). For cases
  where ruff conservatively kept the RHS as an orphan expression
  (`get_tenant_aware_collection_name(origin_alias_name)` etc.),
  manually verified those functions are pure and deleted the
  expression lines too:
    - test_di_scanner.py:284 (`len(container._named_beans)`)
    - tenant_aware_collection_with_suffix.py:350
    - user_profile_milvus_converter.py:52

  The `job = await pool.enqueue_job(...)` case correctly kept the
  awaited call -- only the unused name binding was removed.

- E731 lambda-to-def conversion in milvus_rebuild_collection.py
  produced wrong inner indentation; fixed to PEP 8 4-space.

- C408 dict(...) -> {...}, B905 zip(..., strict=False), E401
  multiple-imports-on-one-line, and the remaining W291/W293 outside
  prompts were applied verbatim.

Result: ruff 239 -> 56 errors. All src/ files compile. ty 891 -> 839
(F841 deletions remove type tracking on those vars).
The previous --unsafe-fixes commit (dabc3c4) auto-removed assignments
that ruff classified as unused locals. Several of those were not safe
to drop on a purely-syntactic read:

- agentic_layer/memory_manager.py: three `memory_type = (...)`
  bindings in `retrieve_mem_keyword / _vector / _hybrid`. These
  were paired with `start_time = time.perf_counter()` in the
  original metrics-labelling commit (b3958cb). The paired
  start_time was removed during a later metrics refactor that
  pushed recording into the inner `get_*_search_results` helpers;
  the memory_type binding got orphaned but the value is still a
  useful debugging breadcrumb at the wrapper layer.
- core/asynctasks/task_manager.py: `job = await pool.enqueue_job(...)`
  — keeping the name binding aids debugging and any future
  follow-up that needs the job handle.
- memory_manager.py `extend_data = fields['extend_data']`,
  agent_skill_milvus_converter `content = source_doc.content or ""`,
  episode_memory_extractor `default_title = "Conversation Episode"`,
  test_di_scanner `initial_bean_count = len(...)`,
  tenant_aware_collection_with_suffix `tenant_aware_alias_name = ...`,
  user_profile_milvus_converter `doc_id = str(source_doc.id) ...`:
  same story — likely vestiges of in-flight work or intentional
  reads kept for clarity. Restoring them is cheap; auto-removing
  was the wrong default.

Add F841 to the ruff `ignore` list in pyproject.toml so future
`--unsafe-fixes` runs won't reintroduce the same destructive cleanup.
F841 can still be surfaced explicitly via `ruff check --select F841`
when a human wants to triage them; we just don't auto-fix.
Claude Code loads CLAUDE.local.md alongside CLAUDE.md and lets it
take precedence (closer-to-cwd loads last). Use it for personal
workflow rules that shouldn't bind other contributors, while
CLAUDE.md / AGENTS.md keep the shared team conventions.
Two new dev_docs articles drafted from a project-wide code review:

- exception_handling_analysis.md: full audit of 781 try blocks and
  17 hand-written retry loops. Catalogues 7 anti-patterns and 6
  retry types with code references, performance/correctness
  analysis, and a sprint-organised action plan.

- code_quality_roadmap.md: 8-week three-phase execution plan
  (observability -> tests -> try-catch cleanup), with per-phase
  acceptance criteria and rollback paths.
K8s-style liveness vs readiness separation, per the observability
roadmap (Phase 1, T1.1).

- /livez: process heartbeat, always 200. Suitable for liveness probes;
  a failure means the process should be restarted.
- /readyz: runs four downstream probes (Redis, MongoDB, Elasticsearch,
  Milvus) in parallel under a 2 s timeout. Returns 200 only when all
  succeed; 503 with a per-dependency breakdown otherwise. Suitable for
  readiness probes and load-balancer drain checks.
- /health: legacy alias for /livez, preserved for backward compatibility.

Each probe also publishes the evercore_dependency_healthy{name="..."}
Prometheus gauge so alerts can fire without polling /readyz externally.

Probe definitions live in a sibling probes.py module; adding a new
downstream is one async function plus an entry in default_probes().

prometheus_middleware skips /livez and /readyz so probe traffic does
not dominate request-rate dashboards.

Smoke-tested locally: happy path returns 200 with all four deps
healthy; pointing REDIS_PORT at a dead port correctly drives /readyz
to 503 with redis=False detail, /livez stays 200, and the gauge for
redis drops to 0 while others remain 1.
Per Phase 1, T1.3 of the observability roadmap.

Before this change, four independent classifiers
(memorize/retrieve/rerank/vectorize) each did string-matching on
str(error).lower() and fell through to the literal "unknown" (or
"error") bucket. Cross-service alerts could not group failures the
same way, and any unfamiliar exception class was invisible behind the
catch-all label.

The new core/observation/error_classification.py is the single source
of truth: it still emits the existing semantic labels (timeout,
rate_limit, validation_error, connection_error, not_found) when those
match, and otherwise returns the exception class name in snake_case.
The four call sites now delegate to it.

Effect on metrics:
- error_type='unknown' / error_type='error' goes away entirely.
- A previously-anonymous KeyError or pydantic.ValidationError now
  shows up as error_type='key_error' / 'validation_error' and can be
  alerted on.
- Vectorize-specific failures previously bucketed as 'api_error' now
  surface as 'vectorize_error', which is more actionable.

Unit-tested with 9 representative exception shapes; ruff + compile
clean. No call-site behaviour change beyond the label string.
Per Phase 1, T1.2 of the observability roadmap.

The logger module now selects its output format from the LOG_FORMAT
env var:

- LOG_FORMAT=text (default) keeps the existing human-readable format
  used by local development and tail-based debugging.
- LOG_FORMAT=json switches to one JSON object per line via the new
  JsonFormatter. Each record carries ts (ISO 8601 ms + Z), level,
  logger, msg, file, request_id, and any extra= mapping the caller
  attached. Tracebacks live in a structured exc_info field instead of
  being concatenated into msg.

ContextFilter (formerly RequestIdFilter) now also forwards tenant_id,
user_id, group_id and session_id from the app_info contextvar onto
the LogRecord, so JSON output captures them automatically when they
are populated by middleware.

The change is fully backward compatible: all 247 existing call sites
keep working unchanged in both formats. Structured fields can be
adopted incrementally via logger.info(msg, extra={...}) without
touching the rest.

Implemented with zero new dependencies (stdlib logging + json). A
future migration to structlog can layer on top without churning
callers again.

Smoke-tested locally: with LOG_FORMAT=json, /readyz emits per-probe
JSON lines that share the same request_id across the Redis / MongoDB /
Elasticsearch / Milvus checks, which is exactly the cross-call
correlation the roadmap calls out as Phase 1's headline win.
Closes Phase 1 T1.5 of the code-quality roadmap.

- docs/dev_docs/slo_definitions.md: four load-bearing SLOs (memorize
  success rate, retrieve p95 latency, LLM error budget, dependency
  availability) wired to existing metric names. Includes PromQL,
  severity guidance, and an error-budget policy.
- docs/grafana/evercore_slo_overview.json: importable dashboard with
  one stat per SLO plus 5-min burn-rate timeseries panels.
Closes Phase 1 T1.4 of the code-quality roadmap.

- core/observation/tracing/otel.py: soft-imports the OTel SDK; init_tracing
  is a no-op unless OTEL_EXPORTER_OTLP_ENDPOINT is set AND the deps are
  installed. Installs FastAPI/httpx/redis/pymongo instrumentations when
  available. Default sampler is ParentBased(TraceIdRatioBased(0.1)).
- trace_logger decorator now opens a real span when tracing is active;
  unchanged behavior when inactive (zero perf cost).
- base_app wires init_tracing(app) at the end of create_base_app.
- pyproject adds an optional `otel` dependency group. Activate with
  `uv sync --group otel`; default sync does NOT pull these in.
…heck

Closes Phase 2 T2.1, T2.2, T2.3, T2.5 of the code-quality roadmap.

- evercore-smoke.yml: runs unit-marker pytest with coverage XML upload
  and a non-blocking ty typecheck step. Both `continue-on-error: true`
  until tests are explicitly tagged per T2.4.
- Makefile: adds test-unit / test-integration / test-e2e / test-cov
  targets.
- tests/conftest.py: auto-tags tests by file path
  (tests/integration/** → integration; *_e2e.py → e2e; rest → unit).
  Explicit markers still win.
- pytest.ini: fixes section header from [tool:pytest] (legacy
  setup.cfg form) to [pytest], so pytest actually reads markers
  and other settings. Registers e2e marker.
- pyproject.toml: removes duplicate [tool.pytest.ini_options]
  section; pytest >= 8 silently ignored it whenever pytest.ini existed.
Closes Phase 3 W5 of the code-quality roadmap. Phase 3 W6 (retry
refactor) and W7 (catch-all overhaul + custom exception hierarchy)
are intentionally not in this commit — both require 24-hour staging
soak per roadmap §4.2.

Substantive changes
- Replace 20 `traceback.print_exc()` call sites with
  `logger.exception(...)`; remove now-unused `import traceback`.
  Affected: bootstrap, biz_layer/mem_memorize, mem_db_operations,
  core/di/scanner, core/component/mongodb_client_factory,
  core/request/timeout_background, core/oxm/es/migration/utils,
  and 7 devops_scripts/data_fix scripts.
- Consolidate duplicate `RetryConfig` classes. The canonical
  implementation lives in core/longjob/interfaces (richer API:
  jitter, backoff_multiplier, retry_on_fatal). The dataclass in
  core/asynctasks/task_manager is removed and the module re-imports
  from interfaces. No external consumers, safe merge.

Ruff baseline
- Enable G (flake8-logging-format) and BLE (flake8-blind-except).
- 930 `# noqa: G004` / `# noqa: BLE001` markers applied to existing
  violations via `ruff check --add-noqa`. New code is held to the
  rule; baseline shrinks naturally as files are touched.

Roadmap status
- docs/dev_docs/code_quality_roadmap.md §0 updated to reflect
  Phase 1 done, Phase 2 partial, Phase 3 W5 done; explicit
  hand-off note for the W6/W7 boundary.
Adds opt-in performance benchmarks for the two hot paths called out in
roadmap §3.2: hybrid retrieval and end-to-end memorize. Per roadmap
guidance, benchmarks are NOT wired into CI yet — the first ~3 months
are baseline collection only, with thresholds added later.

- tests/benchmarks/test_retrieve_hybrid_benchmark.py: measures
  MemoryManager.retrieve_mem_hybrid; tied to the retrieve_p95_latency
  SLO (target p95 < 500 ms).
- tests/benchmarks/test_memorize_benchmark.py: measures end-to-end
  memorize for a single message.
- Both skip cleanly when RUN_BENCHMARKS is unset or when the DI
  container / DTOs can't be imported — collection-only is safe in CI.
- Adds optional `bench` dependency group (pytest-benchmark). Default
  `uv sync` does not pull it in; activate with `uv sync --group bench`.
- `make benchmark` target sets RUN_BENCHMARKS=1 and prints the
  min/mean/median/p95/max columns.
Records that T2.6 (benchmark scaffolding) landed and names T2.4
(unit-test backfill) as the next-up item gating Phase 3 W6/W7.
…nch group

Picked up by uv during T2.6 verification. No behavior change for the
default sync — these only activate under `uv sync --group bench`.
… layer

Closes Phase 2 T2.4 of the code-quality roadmap. agent_skill_extractor
was already covered by 193 existing tests, so the gap was in two
modules:

- tests/test_agentic_memory_manager.py (14 tests)
  Pins the swallow-and-return-empty contract on
  retrieve_mem_hybrid / retrieve_mem so Phase 3 W7's intentional change
  is a deliberate, reviewed diff. Also covers _classify_retrieve_error
  for the most common failure modes. Surfaces one xfail: the
  audit-flagged possibly-unbound bug where retrieve_mem(None) enters
  the fallback path but crashes inside QueryMetadata.from_request(None).
  Strict-xfail so Phase 3 W7's fix flips it naturally to xpass.

- tests/test_biz_mem_memorize.py (15 tests)
  Covers if_memorize, _is_agent_case_quality_sufficient,
  _should_skip_atomic_fact_for_agent, _clone_episodes_for_users, and
  the _save_agent_case catch-all (returns 0 on save / conversion
  failure). Pure-logic tests; no DB or LLM fixtures required.

Both files pass `ruff check`, `ty check`, and run cleanly under
`pytest -m unit` without docker.
Two W7-subset items that are safely landable without staging soak:

1. core/errors/ — typed exception hierarchy.

   EverCoreError is the common root so controllers can install a single
   except EverCoreError fallback while pattern-matching subtypes
   higher in the chain. Subclasses cover:

   - MemorizeError / MemoryCellPersistFailed — write-path failures
   - ExtractionError / LLMOutputFormatError — LLM-side failures, with
     LLMOutputFormatError carved out for the Type-B "feedback retry"
     pattern (parse failure feeds into next prompt)
   - RetrieveError / SearchBackendUnavailable — read-path failures
   - ProviderUnavailable — root-level (both paths can hit a provider
     outage), carries an explicit `provider` field for metric/label
     attribution

   Every class carries `tenant_id`, `cause`, and arbitrary `context`
   kwargs so structured logging and Sentry grouping work without
   parsing message strings. Purely additive — no callers migrated.
   15 tests pin the hierarchy shape and attribute contract.

2. Fix audit-flagged possibly-unbound bug in MemoryManager.retrieve_mem.

   When the request was None, the validation ValueError was caught by
   the broad fallback, but the fallback then crashed inside
   QueryMetadata.from_request(None). Short-circuited None at the top
   so the swallow-and-empty contract is honored. The xfail in
   tests/test_agentic_memory_manager.py::test_no_request_returns_empty_response
   flips to xpass and is no longer marked.

Roadmap §0 status updated. W6 retry refactor remains deferred —
openai_provider._execute_with_retry's key-rotation + status-code
dispatch + per-error metric labels aren't a tenacity drop-in and need
behavior-equivalence tests plus 24h staging soak per §4.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant