diff --git a/.agents/skills/propose/SKILL.md b/.agents/skills/propose/SKILL.md index 7083f30f..a93badca 100644 --- a/.agents/skills/propose/SKILL.md +++ b/.agents/skills/propose/SKILL.md @@ -234,15 +234,6 @@ Docs-only; baseline unchanged. - `TOOL-NAME-PROPOSE.md` - `ARCHITECTURE-CHANGE-PROPOSE.md` -## Final checklist - -- [ ] Proposal file lives under `propose/active/` -- [ ] Problem statement includes concrete examples -- [ ] Schema/ontology/re-index impact is explicit -- [ ] Open questions include `[TBD]` with recommendations -- [ ] Out-of-scope section is present -- [ ] Sequencing/follow-up path is clear - ## Key Principles - **One question at a time** — Don't overwhelm with multiple questions diff --git a/README.md b/README.md index fe565d8f..59f84698 100644 --- a/README.md +++ b/README.md @@ -92,7 +92,9 @@ With the package installed, the console script `java-codebase-rag-mcp` is on you claude mcp add --transport stdio java-codebase-rag -- java-codebase-rag-mcp ``` -Then set env vars (`JAVA_CODEBASE_RAG_INDEX_DIR`, `JAVA_CODEBASE_RAG_SOURCE_ROOT`, `SBERT_MODEL`, …) in `.mcp.json` or your shell profile. For a project-scoped `.mcp.json` template, see [`mcp.json.example`](./mcp.json.example). Official docs: [Claude Code settings](https://docs.anthropic.com/en/docs/claude-code/settings). +**Zero-env-var configuration:** The tool automatically walks up the directory tree to find `.java-codebase-rag.yml`, so you don't need to set `JAVA_CODEBASE_RAG_SOURCE_ROOT` when working from within a project. Just place the config file at your project root and the tool will find it. See [`mcp.json.example`](./mcp.json.example) for the minimal configuration. + +If you need to override defaults, you can set env vars (`JAVA_CODEBASE_RAG_INDEX_DIR`, `JAVA_CODEBASE_RAG_SOURCE_ROOT`, `SBERT_MODEL`, …) in `.mcp.json` or your shell profile. For a full configuration template, see [`mcp.json.example`](./mcp.json.example). Official docs: [Claude Code settings](https://docs.anthropic.com/en/docs/claude-code/settings). ### Claude Desktop diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md index 8a80cd31..b6b2cde9 100644 --- a/docs/CONFIGURATION.md +++ b/docs/CONFIGURATION.md @@ -22,6 +22,25 @@ For the architecture rationale (the GPS metaphor, three-layer design, future wor The operator-facing surface is **six** variables (plus MCP-only `JAVA_CODEBASE_RAG_SOURCE_ROOT` below). Precedence for knobs that also exist as CLI flags or YAML entries is **CLI flag > env var > YAML > built-in default** (see [`JAVA-CODEBASE-RAG-CLI.md`](./JAVA-CODEBASE-RAG-CLI.md)). +### Config file discovery (walk-up) + +The tool automatically walks up the directory tree from the current working directory to find `.java-codebase-rag.yml` (or `.yaml`), similar to how Git finds `.git`. This means you can run CLI commands and MCP queries from any subdirectory within your project — the tool will locate the config file automatically. + +**Walk-up behavior:** +- Starts from the current working directory and walks up the directory tree +- Stops at `$HOME` (inclusive — checks `$HOME` itself but doesn't walk past it) +- First match wins (closest config to cwd, not "most specific" or "deepest") +- If no config is found, falls back to using the current directory + +**Precedence for source root resolution:** +1. CLI flag `--source-root` (highest priority) +2. Environment variable `JAVA_CODEBASE_RAG_SOURCE_ROOT` +3. YAML field `source_root` (resolved relative to config directory) +4. Walk-up discovery result (config directory itself) +5. Current working directory (fallback) + +This walk-up behavior means you no longer need to set environment variables or pass flags when working from within a project — the tool finds the config automatically. + | Variable | Purpose | |---|---| | `JAVA_CODEBASE_RAG_INDEX_DIR` | Local filesystem **directory** for Lance tables, the Kuzu file `code_graph.kuzu`, and cocoindex state (`cocoindex.db`). Not a `lancedb://` or cloud URI — use a path. Default: `./.java-codebase-rag/` under the resolved Java tree root. | @@ -58,6 +77,14 @@ A single file at the project root (the directory you pass as `--source-root`, or # -------- Core knobs (mirror env vars; precedence: CLI > env > YAML > default) -------- +# Source root: the Java project root. Useful when the config file lives +# separately from the Java source code (e.g., monorepo with configs at repo root). +# - Tilde (`~`) is expanded; `$VAR` is NOT (use absolute paths or `~`). +# - Relative paths resolve against the config file's parent directory, not cwd. +# - Env: JAVA_CODEBASE_RAG_SOURCE_ROOT. CLI: --source-root. +# - Default: the directory containing this config file (for walk-up discovery). +# source_root: ../java-project + # Index directory: where Lance tables, code_graph.kuzu, and cocoindex.db live. # - Tilde (`~`) is expanded; `$VAR` is NOT (use absolute paths or `~`). # - Relative paths resolve against source_root, not cwd. @@ -95,6 +122,25 @@ microservice_roots: - chat-orchestrator - ranking +# Automatic microservice scope for queries (MCP server only) +# When working from a microservice subdirectory, queries automatically scope +# to that microservice — no manual filter needed. This provides correct +# codebase boundaries for agents working on specific microservices. +# +# Behavior: +# - At microservice root or inside a microservice subdirectory: +# → Queries automatically scoped to that microservice +# - At project root (above all microservices): +# → Queries span all microservices with an advisory message +# - Explicit microservice filters always override auto-detected scope +# +# The MCP server logs scope detection at startup: +# [scope] Detected microservice: chat-core +# [scope] Queries scoped to chat-core +# Or at system level: +# [scope] No microservice detected (at project root) +# [scope] Queries will span all microservices + # -------- Cross-service edge resolution -------- # How the resolver treats auto-detected cross-service call edges. See §4.2. diff --git a/graph_enrich.py b/graph_enrich.py index 9b9c714a..6135a97c 100644 --- a/graph_enrich.py +++ b/graph_enrich.py @@ -1565,6 +1565,35 @@ def microservice_for_path( return "" +def detect_microservice_from_path(cwd: Path, source_root: Path) -> str | None: + """Detect microservice from cwd for query-time auto-scope. + + Returns None if cwd is outside source_root, cwd IS source_root (system level), + or no microservice is detected. Otherwise returns the microservice name. + """ + cwd_resolved = cwd.resolve() + source_resolved = source_root.resolve() + + # Check if cwd is outside source_root + try: + cwd_resolved.relative_to(source_resolved) + except ValueError: + return None + + # Check if cwd IS source_root (at system level, no specific scope) + if cwd_resolved == source_resolved: + return None + + # Check if cwd itself matches a YAML override (directory name matches microservice_roots) + overrides = load_microservice_overrides(source_resolved) + if overrides and cwd_resolved.name in overrides: + return cwd_resolved.name + + # Call existing microservice_for_path to detect microservice from build markers + ms = microservice_for_path(str(cwd_resolved), source_resolved) + return ms if ms else None + + # ---------- chunk enrichment ---------- diff --git a/java_codebase_rag/cli.py b/java_codebase_rag/cli.py index 27ad800a..a3281e71 100644 --- a/java_codebase_rag/cli.py +++ b/java_codebase_rag/cli.py @@ -229,6 +229,18 @@ def _add_verbosity_flags(p: argparse.ArgumentParser) -> None: def _cmd_init(args: argparse.Namespace) -> int: cfg = _resolved_from_ns(args) + # Check for parent config + from java_codebase_rag.config import discover_project_root, YAML_CONFIG_FILENAMES + parent_config_dir = discover_project_root(cfg.source_root.parent) + if parent_config_dir is not None: + parent_config = parent_config_dir / YAML_CONFIG_FILENAMES[0] + if not parent_config.is_file(): + parent_config = parent_config_dir / YAML_CONFIG_FILENAMES[1] + print( + f"Warning: found existing config at {parent_config}. " + f"Creating a new project here will create a separate index.", + file=sys.stderr, + ) _startup_hints(cfg) cfg.apply_to_os_environ() occupied, paths = index_dir_has_existing_artifacts(cfg.index_dir) diff --git a/java_codebase_rag/config.py b/java_codebase_rag/config.py index d9550b3f..ada9a540 100644 --- a/java_codebase_rag/config.py +++ b/java_codebase_rag/config.py @@ -123,6 +123,33 @@ def find_yaml_config_file(source_root: Path) -> Path | None: return None +def discover_project_root(start: Path) -> Path | None: + """Walk up from start to find the directory containing a config file. + + First match wins (closest to start). Stops at $HOME inclusive — checks $HOME + itself but does not walk past it. Returns None if no config found. + """ + start = start.resolve() + home = Path.home().resolve() + + current = start + while True: + # Check if current directory contains a config file + if find_yaml_config_file(current) is not None: + return current + + # Stop if we've reached home (check home itself, but don't walk past it) + if current == home: + return None + + # Stop if we've reached filesystem root + parent = current.parent + if parent == current: + return None + + current = parent + + def load_yaml_mapping(source_root: Path) -> dict[str, Any]: path = find_yaml_config_file(source_root) if path is None: @@ -277,8 +304,36 @@ def resolve_operator_config( cli_embedding_model: str | None = None, cli_embedding_device: str | None = None, ) -> ResolvedOperatorConfig: - root = (source_root or Path.cwd()).expanduser().resolve() - yaml_dict = load_yaml_mapping(root) + # Phase 1: Find the config file directory + if source_root is not None: + # CLI flag provided: use it as both config_dir and effective source_root + # (skip YAML source_root check - CLI wins) + root = source_root.expanduser().resolve() + config_dir = root + yaml_dict = load_yaml_mapping(config_dir) + else: + # Check env var first + env_raw = os.environ.get(ENV_SOURCE_ROOT, "").strip() + if env_raw: + root = Path(env_raw).expanduser().resolve() + config_dir = root + yaml_dict = load_yaml_mapping(config_dir) + else: + # Walk up to find config dir + discovered = discover_project_root(Path.cwd()) + config_dir = discovered if discovered is not None else Path.cwd().resolve() + # Load YAML from config dir + yaml_dict = load_yaml_mapping(config_dir) + + # Phase 2: Resolve effective source root + # Check for YAML source_root field (resolved relative to config dir) + yaml_source_root = yaml_dict.get("source_root") + if isinstance(yaml_source_root, str) and yaml_source_root.strip(): + yroot = Path(yaml_source_root.strip()).expanduser() + root = yroot.resolve() if yroot.is_absolute() else (config_dir / yroot).resolve() + else: + root = config_dir + index_dir, index_src = _resolve_index_dir_path( source_root=root, cli_index_dir=cli_index_dir, yaml_dict=yaml_dict ) diff --git a/mcp.json.example b/mcp.json.example index 7a56372c..4dea1683 100644 --- a/mcp.json.example +++ b/mcp.json.example @@ -11,3 +11,28 @@ } } } + +// Minimal configuration with walk-up discovery (no env vars required): +// The tool walks up from the current directory to find .java-codebase-rag.yml, +// then uses the config's source_root (or the config directory itself) to find the index. +// Just omit the "env" section entirely: +// +// { +// "mcpServers": { +// "java-codebase-rag": { +// "type": "stdio", +// "command": "java-codebase-rag-mcp" +// } +// } +// } +// +// For Claude Code (which uses the same MCP protocol but different config format), +// the minimal configuration in .mcp.json is similar: +// +// { +// "mcpServers": { +// "java-codebase-rag": { +// "command": "java-codebase-rag-mcp" +// } +// } +// } diff --git a/plans/active/PLAN-DIRS-HIERARCHY.md b/plans/active/PLAN-DIRS-HIERARCHY.md new file mode 100644 index 00000000..db51e304 --- /dev/null +++ b/plans/active/PLAN-DIRS-HIERARCHY.md @@ -0,0 +1,312 @@ +# Plan: Walk-up config discovery, configurable source root, and microservice auto-scope + +Status: **active (planning)**. This plan implements +[`propose/active/DIRS-HIERARCHY-PROPOSE.md`](../../propose/active/DIRS-HIERARCHY-PROPOSE.md) +as a single PR. + +Depends on: none. + +## Goal + +- Users can run CLI and MCP commands from any subdirectory within their project — the tool walks up to find `.java-codebase-rag.yml`, like git finds `.git`. +- The YAML config gains an optional `source_root` field so the config can live separately from the Java source code, and the index dir auto-derives from the resolved source root. +- When working from a microservice subdirectory, queries automatically scope to that microservice — no manual filter needed. Agents see correct codebase boundaries. +- Existing workflows where cwd = config dir produce identical behavior. No breaking changes. + +## Principles (do not relitigate in review) + +- **First match wins.** Closest config to cwd, not "most specific" or "deepest". Matches git behavior. +- **`$HOME` is the inclusive boundary.** Check `$HOME` itself, do not walk past it. +- **Walk-up is always-on** when no explicit source root is given (CLI flag or env var). No `--walk-up` opt-out flag. +- **YAML `source_root` resolves relative to the config file directory.** CLI `--source-root` resolves relative to cwd. Different resolution bases are intentional — the precedence table handles priority. +- **Index dir follows source root.** Default index dir = `/.java-codebase-rag/`. This does not change; walk-up just changes how source root itself is found. +- **No changes to `init` behavior** beyond a soft warning when a parent config exists. +- **No changes to indexing, query, or graph-building logic.** This is config discovery only. +- **Microservice scope matches working context.** Auto-scope queries to the detected microservice when inside a microservice directory. Explicit filters always override auto-detected scope. + +## PR breakdown - overview + +| PR | Scope | Ontology bump | Areas of concern | Test buckets | Independent of | +| --- | --- | --- | --- | --- | --- | +| PR-1 | Walk-up config discovery + `source_root` YAML field | none | Precedence chain correctness; server/CLI parity; boundary conditions ($HOME, root) | unit tests for discovery + precedence + integration | — | + +Landing order: **PR-1** (single PR). + +## Resolved design decisions + +| Topic | Decision | +| --- | --- | +| Config file name checked | `.java-codebase-rag.yml` and `.java-codebase-rag.yaml` (existing `YAML_CONFIG_FILENAMES` tuple) | +| Boundary for walk-up | `$HOME` inclusive — check `$HOME` itself but do not walk past it | +| YAML field name | `source_root` — same name as CLI flag for conceptual consistency | +| Resolution base for YAML `source_root` | Config file's parent directory (not cwd) | +| `init` behavior | Unchanged — creates config + index in the specified directory. Only adds a soft warning if a parent config is detected | +| Multiple nested configs | First match wins (closest to cwd). Mirrors git behavior | +| New function location | `config.py` — all config resolution logic lives there already | +| `discover_project_root` return | `Path | None` — returns the directory containing the config file, not the config file path itself | +| Microservice detection approach | Reuse indexing logic — use `microservice_for_path()` from `graph_enrich.py` for consistency | +| Microservice scope behavior | Scoped inside, all at root — auto-scope inside microservice, all microservices at root with advisory | +| Microservice auto-scope implementation | Server-level scope injection — detect at startup, inject into queries, explicit filters override | +| Microservice detection location | `graph_enrich.py` — new `detect_microservice_from_path()` function | +| Scope manager location | `server.py` — new `ScopeManager` class for caching and applying auto-scope | + +--- + +# PR-1 — Walk-up config discovery and configurable source root + +## File-by-file changes + +### 1. `java_codebase_rag/config.py` + +**New function: `discover_project_root(start: Path) -> Path | None`** + +- Canonicalize `start` via `Path.resolve()` +- Walk from `start` upward, checking each directory for files matching `YAML_CONFIG_FILENAMES` +- First match returns that directory (the parent of the found config file, not the file path itself) +- Stop at `$HOME` (inclusive — check `$HOME` itself) or filesystem root. Do not walk past `$HOME`. +- Return `None` if no config found + +**Modify: `resolve_operator_config()` — two-phase resolution** + +The core change is separating *config file discovery* from *effective source root resolution*. The exact sequence: + +1. **Phase 1 — find the config file directory.** If `source_root` is provided (CLI flag or env var), the config dir = that value (no walk-up). Otherwise, call `discover_project_root(Path.cwd())`. If walk-up found a config dir, use it. Otherwise fall back to `Path.cwd().resolve()` (unchanged behavior). +2. **Load YAML** from the config dir via `load_yaml_mapping(config_dir)`. +3. **Phase 2 — resolve effective source root.** Check for a `source_root` key in the YAML. If present, resolve it relative to the config dir (not cwd). The effective source root is then: + - CLI `--source-root` (already handled — `source_root` is not `None` in phase 1, so phase 2 is skipped) + - env `JAVA_CODEBASE_RAG_SOURCE_ROOT` (checked before walk-up in both `server.py` and `resolve_operator_config`) + - YAML `source_root` (resolved relative to config dir) + - Walk-up discovery result (= config dir itself, which is the default when no YAML override) + - `Path.cwd()` (no config found, no YAML override) +4. **Derive index dir** from the effective source root via `_resolve_index_dir_path()`. No edits to `_resolve_index_dir_path` itself — the caller ensures the effective source root (after YAML resolution) is what gets passed through. + +**Note:** Do NOT introduce a `find_config_dir` wrapper. The two-phase logic lives directly in `resolve_operator_config()` for clarity. The only new public function is `discover_project_root()`. + +### 2. `server.py` + +**Modify: `_project_root()`** + +- Current logic: env var → cwd fallback +- New logic: env var → `discover_project_root(Path.cwd())` → cwd fallback +- Import `discover_project_root` from `java_codebase_rag.config` + +**Modify: `_resolve_lancedb_uri()`** + +- Currently falls back to `Path.cwd() / ".java-codebase-rag"` when `JAVA_CODEBASE_RAG_INDEX_DIR` is unset. +- After walk-up, this should use the discovered source root (via `_project_root()`) for consistency. +- The server's `list_code_index_tables_payload()` calls `resolve_operator_config(source_root=_project_root())`, so index dir is derived from the effective source root. But `_resolve_lancedb_uri()` is called independently in some paths. Ensure it uses `_project_root()` instead of raw `Path.cwd()` when the env var is unset. + +### 3. `java_codebase_rag/cli.py` + +**Modify: `_parse_source_root()` / `_resolved_from_ns()`** + +- `_parse_source_root()` stays the same (returns `None` when `--source-root` is not given) +- `_resolved_from_ns()` already passes `source_root=root` to `resolve_operator_config()` — walk-up logic in `resolve_operator_config()` handles the `None` case + +**Modify: `init` command handler** + +- After resolving `cfg = _resolved_from_ns(args)`, check for a parent config by calling `discover_project_root(cfg.source_root.parent)` — this checks whether a config exists in any ancestor of the *resolved source root* (not the config dir, since `init` creates the config at the source root) +- If found, emit a soft warning to stderr: + > Warning: found existing config at `[parent]/.java-codebase-rag.yml`. Creating a new project here will create a separate index. + +### 4. `graph_enrich.py` + +**New function: `detect_microservice_from_path(cwd: Path, source_root: Path) -> str | None`** + +- Check if cwd is outside source_root → return None +- Check if cwd IS source_root → return None (at system level, no specific scope) +- Otherwise, call existing `microservice_for_path(cwd, source_root, overrides)` to detect microservice +- Return microservice name or None if not found + +**Purpose:** Reuse indexing logic for microservice detection, ensuring query-time scope matches index-time boundaries. + +### 5. `server.py` + +**New class: `ScopeManager`** + +```python +class ScopeManager: + def __init__(self, source_root: Path): + self.source_root = source_root + self.default_scope: str | None = self._detect_scope() + self._log_detection() + + def _detect_scope(self) -> str | None: + from graph_enrich import detect_microservice_from_path + return detect_microservice_from_path(Path.cwd(), self.source_root) + + def _log_detection(self) -> None: + if self.default_scope: + print(f"[scope] Detected microservice: {self.default_scope}", file=sys.stderr) + print(f"[scope] Queries scoped to {self.default_scope}", file=sys.stderr) + else: + print(f"[scope] No microservice detected (at project root)", file=sys.stderr) + print(f"[scope] Queries will span all microservices", file=sys.stderr) + + def apply_auto_scope(self, filter: NodeFilter | dict | None) -> NodeFilter | dict | None: + if self.default_scope is None: + return filter + # Convert to dict for manipulation + if filter is None: + filter_dict = {} + elif isinstance(filter, NodeFilter): + filter_dict = filter.model_dump(exclude_none=True) + else: + filter_dict = dict(filter) + # Only inject if user didn't specify microservice + if "microservice" not in filter_dict: + filter_dict["microservice"] = self.default_scope + return filter_dict +``` + +**Modify: server initialization** + +- After `_project_root()` resolves source_root, create module-level `_scope_manager = ScopeManager(source_root)` + +**Modify: each MCP tool wrapper function** + +- Before calling the underlying `mcp_v2` function, apply auto-scope: + ```python + scoped_filter = _scope_manager.apply_auto_scope(_coerce_filter(filter)) + return search_v2(query=query, filter=scoped_filter, ...) + ``` + +**Modify: add advisory when at system level** + +- In tool wrappers, when `_scope_manager.default_scope is None` and filter has no explicit microservice, include advisory in response + +### 6. `tests/test_microservice_scope.py` (NEW FILE) + +### 4. `mcp.json.example` + +- Add a comment block showing the minimal zero-env-var config +- Keep the existing full example as an alternative +- Show both Claude Desktop and Claude Code variants + +### 5. `README.md` + +- Update the MCP host wiring section to mention walk-up discovery +- Document the `source_root` YAML field +- Update the minimal `.mcp.json` example to show that env vars are now optional + +### 6. `docs/CONFIGURATION.md` + +- Add `source_root` to the YAML config reference table +- Document the walk-up discovery behavior +- Update the precedence chain table to include the YAML `source_root` field + +## Tests for PR-1 + +All new tests go in **`tests/test_config.py`** (new file). Tests that exercise `_project_root()` in `server.py` go in **`tests/test_mcp_server_project_root.py`** (new file) to keep MCP test concerns separate. + +**Test file organization justification:** +- `test_config.py` — Pure unit tests for config discovery and resolution logic (no server/process dependencies) +- `test_mcp_server_project_root.py` — Integration test that specifically exercises server.py's `_project_root()` function in the MCP server context +- **Rationale**: Keeping these separate prevents test pollution — config tests remain fast and isolated, while server integration tests can assume MCP server context and potentially mock server internals +- **Alternative considered**: Adding server tests to `test_config.py` would require importing server.py and its dependencies, making config tests slower and more brittle + +### Config discovery tests (`tests/test_config.py`) + +1. `test_discover_project_root_finds_config_in_cwd` — config in cwd, returns cwd +2. `test_discover_project_root_walks_up` — config in parent, returns parent +3. `test_discover_project_root_stops_at_home_boundary` — config in `$HOME` itself, walk-up from subdirectory of `$HOME` finds it (inclusive boundary) +4. `test_discover_project_root_not_found_above_home` — no config anywhere under `$HOME`, returns `None` +5. `test_discover_project_root_not_found` — no config anywhere, returns `None` +6. `test_discover_project_root_first_match_wins` — configs at two levels (cwd subdirectory has one, parent has another), closest to cwd wins + +### Source root resolution tests (`tests/test_config.py`) + +7. `test_source_root_from_yaml_relative` — `source_root: ../` resolves to parent of config dir +8. `test_source_root_from_yaml_absolute` — `source_root: /abs/path` resolves to absolute path +9. `test_source_root_precedence_cli_over_yaml` — CLI flag wins over YAML `source_root` +10. `test_source_root_precedence_yaml_over_discovery` — YAML `source_root` wins over config dir default +11. `test_source_root_precedence_env_over_yaml` — env var wins over YAML `source_root` +12. `test_existing_behavior_unchanged` — no walk-up, cwd = config dir → identical behavior to today + +### Server integration test (`tests/test_mcp_server_project_root.py`) + +13. `test_project_root_uses_discover_when_env_unset` — `_project_root()` returns discovered config dir when `JAVA_CODEBASE_RAG_SOURCE_ROOT` is unset + +### Microservice scope detection tests (`tests/test_microservice_scope.py`) + +14. `test_detect_microservice_deep_inside` — Deep inside microservice directory detects that microservice +15. `test_detect_microservice_at_microservice_root` — At microservice root detects that microservice +16. `test_detect_microservice_at_system_root` — At system root returns None (no specific scope) +17. `test_detect_microservice_outside_source` — Outside source_root returns None +18. `test_apply_scope_when_filter_none` — No filter provided injects auto-detected scope +19. `test_apply_scope_when_filter_exists_no_microservice` — Filter without microservice gets auto-scope injected +20. `test_apply_scope_preserves_explicit_microservice` — Explicit microservice not overridden +21. `test_apply_scope_no_default` — No auto-detected scope leaves filter unchanged + +## Definition of done (PR-1) + +- [ ] `discover_project_root()` works with first-match-wins semantics, stops at `$HOME` (inclusive) +- [ ] `source_root` YAML field is parsed and resolved relative to config dir +- [ ] Precedence chain: CLI > env > YAML > discovery > cwd +- [ ] `_project_root()` in `server.py` uses walk-up when env var is unset +- [ ] `_resolve_lancedb_uri()` in `server.py` uses `_project_root()` instead of raw `Path.cwd()` for fallback +- [ ] CLI commands work from subdirectories (walk-up finds config) +- [ ] `init` emits soft warning when parent config detected +- [ ] `detect_microservice_from_path()` correctly detects microservice from cwd +- [ ] `ScopeManager` detects and caches microservice scope at server startup +- [ ] Tool wrappers apply auto-scope when no explicit microservice filter provided +- [ ] Advisory messages shown when queries span multiple microservices +- [ ] All 21 named tests pass (13 discovery + precedence + 8 microservice scope) +- [ ] Existing test suite passes (no regressions) +- [ ] `mcp.json.example` shows minimal zero-env-var config +- [ ] README and CONFIGURATION docs updated + +## Implementation step list + +| # | Step | File(s) | Done when | +| - | - | - | - | +| 1 | Add `discover_project_root(start)` | `config.py`, `tests/test_config.py` | Tests 1–6 pass | +| 2 | Add `source_root` YAML field parsing and resolution | `config.py`, `tests/test_config.py` | Tests 7–8 pass | +| 3 | Wire precedence chain in `resolve_operator_config()` | `config.py`, `tests/test_config.py` | Tests 9–12 pass | +| 4 | Update `_project_root()` to use walk-up | `server.py`, `tests/test_mcp_server_project_root.py` | Test 13 passes; server resolves source root via walk-up when env var unset | +| 5 | Update `_resolve_lancedb_uri()` to use `_project_root()` fallback | `server.py` | Lance URI and source root derive from same discovered root | +| 6 | Add `init` parent-config warning | `cli.py` | `init` prints warning when parent config exists | +| 7 | Add `detect_microservice_from_path()` | `graph_enrich.py`, `tests/test_microservice_scope.py` | Tests 14–17 pass | +| 8 | Add `ScopeManager` class | `server.py`, `tests/test_microservice_scope.py` | Tests 18–21 pass | +| 9 | Wire `ScopeManager` into tool wrappers | `server.py` | Auto-scope applied to queries | +| 10 | Add advisory messages for system-level queries | `server.py` | Advisory shown when no microservice detected | +| 11 | Update `mcp.json.example` | `mcp.json.example` | Shows minimal zero-env-var config | +| 12 | Update README and CONFIGURATION docs | `README.md`, `docs/CONFIGURATION.md` | Walk-up, `source_root`, and auto-scope documented | +| 13 | Run full validation | all | `ruff check` + `pytest tests -v` green | + +**Documentation timing:** +- Documentation updates (step 12) should happen AFTER implementation is complete and tests pass +- This ensures docs accurately reflect the final implementation +- mcp.json.example (step 11) can be updated in parallel with implementation as it's straightforward + +--- + +# Cross-PR risks and mitigations + +N/A — single PR. + +# Out of scope + +- Auto-detecting multiple systems and splitting indexes +- Changing index directory structure +- Global config or project registry +- Changes to indexing, query, or graph-building logic +- `init` command behavior changes beyond the parent-config warning +- Changes to `build_ast_graph.py` or `search_lancedb.py` +- CLI-level microservice auto-scope (MCP server only for now) +- Dynamic microservice scope re-detection if cwd changes during a long-running session (server restart required) +- Config-based microservice boundary overrides (use existing `microservice_roots` YAML field instead) + +# Whole-plan done definition + +1. All 21 named tests pass (13 discovery + precedence, 8 microservice scope). +2. Existing test suite passes without `JAVA_CODEBASE_RAG_RUN_HEAVY`. +3. `ruff check .` is clean. +4. CLI and MCP server both resolve source root via walk-up from subdirectories. +5. MCP server auto-detects microservice scope and applies it to queries. +6. `mcp.json.example` shows a zero-env-var configuration. +7. README and CONFIGURATION docs reflect walk-up, `source_root`, and auto-scope behavior. + +# Tracking + +- `PR-1`: _pending_ \ No newline at end of file diff --git a/propose/active/DIRS-HIERARCHY-PROPOSE.md b/propose/active/DIRS-HIERARCHY-PROPOSE.md new file mode 100644 index 00000000..6afaad39 --- /dev/null +++ b/propose/active/DIRS-HIERARCHY-PROPOSE.md @@ -0,0 +1,407 @@ +# DIRS-HIERARCHY — Walk-up config discovery and configurable source root + +**Status**: proposal — not yet implemented. +**Author**: Dmitry Teryaev +**Date**: 2026-06-06 + +## TL;DR + +- **The call**: add walk-up config discovery (like git) so the tool finds `.java-codebase-rag.yml` in any parent directory, and add a `source_root` field to the YAML config so the config can live separately from the source code. +- **Why**: the tool currently couples three things — config file location, source code location, and cwd. All three must be the same directory. Users who organize projects in varied directory structures hit walls: running `init` from a multi-system parent creates a mixed index, using MCP from a microservice subdirectory can't find the config, and placing the config in a separate context directory requires `--source-root` on every invocation. +- **Scope**: config discovery and source root resolution. Both CLI and MCP server use the same walk-up logic, eliminating the need for env vars in `.mcp.json`. No changes to indexing, query, or graph-building logic. No changes to `init` beyond a warning when a parent config exists. +- **Migration**: 1 PR. No breaking changes. Existing workflows where cwd = config dir continue to work identically. Existing `.mcp.json` files with env vars continue to work (env vars are overrides). New workflows (running from subdirectories, zero-config MCP) unlock. + +## 1. Problem statement + +Three real user scenarios that break today: + +**User A** — multi-system parent directory: +``` +IdeaProjects/ + .java-codebase-rag.yml + System-A/ microservice-A-1/ microservice-A-2/ + System-B/ microservice-B-1/ microservice-B-2/ +``` +Running `init` from `IdeaProjects/` indexes ALL Java files from all systems into one giant mixed index. The tool doesn't recognize project boundaries. + +**User B** — working from a microservice subdirectory: +``` +IdeaProjects/ + System-C/ + .java-codebase-rag.yml + microservice-C-1/ + microservice-C-2/ +``` +`init` runs correctly from `System-C/`. But then `cd microservice-C-1/` and starting the MCP server — the tool looks for config only in cwd (`microservice-C-1/`), doesn't find it, fails. + +**User B+** — microservice scope leakage (NEW): +After walk-up is implemented, `cd microservice-C-1/` and starting MCP works — the config is found at `System-C/`. However, queries return results from ALL microservices (`microservice-C-1/`, `microservice-C-2/`, etc.) because the index spans the entire system. An agent working inside `microservice-C-1/` sees code from `microservice-C-2/`, which can mislead it about the codebase boundaries and cause incorrect conclusions. + +**User C** — config lives separately from source code: +``` +IdeaProjects/ + System-D/ + system-D-context/ + .java-codebase-rag.yml + microservice-D-1/ + microservice-D-2/ +``` +Config is in `system-D-context/`, code is at `../` via `--source-root`. The `--source-root` flag works but must be passed on every invocation. No way to persist this in the config. + +**Root cause**: `find_yaml_config_file()` only checks the exact `source_root` directory. No walking up. And the config has no `source_root` field, so the only way to point to code elsewhere is the `--source-root` flag or env var. + +## 2. Design principles + +1. **cwd independence.** The tool should work from any subdirectory of the project, not just from the directory containing the config. +2. **Config is the anchor.** The presence of `.java-codebase-rag.yml` defines a project boundary. The tool walks up to find it (like git finds `.git`). +3. **Source root is configurable but has a sane default.** Default source root = config file's parent directory. Override via YAML `source_root` field, env var, or CLI flag. +4. **Index follows source root.** The index directory always lives at `/.java-codebase-rag/`. Config and source root can be in different places, but the index stays with the code. +5. **No breaking changes.** Existing workflows where cwd = config dir must produce identical behavior. The walk-up is additive — it only fires when the config isn't found in cwd. +6. **Microservice scope should match working context.** When an agent works inside a microservice directory, queries should automatically scope to that microservice. No manual filter required. + +## 3. Proposed solution + +Once walk-up finds the config file, everything else is derivable — no env vars needed: + +``` +config found at System-C/.java-codebase-rag.yml + → source root = System-C/ (or source_root from YAML) + → index dir = System-C/.java-codebase-rag/ +``` + +Both CLI and MCP server follow the same discovery path. The minimal `.mcp.json` becomes: + +```json +{ + "mcpServers": { + "java-codebase-rag": { + "type": "stdio", + "command": "java-codebase-rag-mcp" + } + } +} +``` + +This works because MCP hosts set cwd to the workspace directory at server startup: +- **Claude Code** — cwd = workspace directory. Walk-up from there finds the config. +- **VS Code / Cursor** — cwd = workspace root. Same. +- **Claude Desktop** — cwd is less predictable. Users can still set `JAVA_CODEBASE_RAG_SOURCE_ROOT` as an optional override. + +Both `JAVA_CODEBASE_RAG_SOURCE_ROOT` and `JAVA_CODEBASE_RAG_INDEX_DIR` become **optional overrides** rather than requirements. + +### 3.1 Walk-up config discovery + +New function `discover_project_root(start: Path) -> Path | None` in `config.py`: + +**Traversal algorithm:** + +1. **Initialize**: Set `current = start.resolve()` (canonicalize via `Path.resolve()` to handle symlinks) +2. **Check home boundary**: Get `home = Path.home()`. If `home` cannot be resolved, log warning and use filesystem root `/` as boundary +3. **Loop**: While `current` exists and is not past boundary: + a. Check for config files in order: `.java-codebase-rag.yml`, then `.java-codebase-rag.yaml` + b. **If both exist in same directory**: Prefer `.yml` over `.yaml` (establish precedence order) + c. **If config found**: Return `current` (the directory containing the config, not the config file itself) + d. **If `current == home`**: Break (check home itself, then stop — inclusive boundary) + e. **If `current.parent == current`**: Break (reached filesystem root) + f. **Move to parent**: `current = current.parent` +4. **Return None**: No config found + +**Error handling:** + +- **Permission denied on directory**: Log warning at WARNING level, continue to parent +- **Home directory inaccessible**: Log warning, fall back to filesystem root boundary +- **Config file exists but unreadable**: Log error, continue as if not found (same as missing) + +**Stopping conditions:** + +The walk-up stops when any of these conditions is met: +- Config file is found and successfully read +- Current directory equals home directory (after checking it) +- Current directory has no parent (filesystem root reached) + +**First match wins** (closest to cwd): if nested configs exist at multiple levels (e.g. `System-A/.java-codebase-rag.yml` and `IdeaProjects/.java-codebase-rag.yml`), the one closest to cwd is used. This mirrors git's behavior when nested `.git` directories exist. + +**Boundary rationale**: The home directory is the natural project root on user workstations. On CI/CD (`/root`, `/home/runner`) this is equally appropriate. Configs above home are almost certainly unrelated to the current project. Use `Path.home()` for cross-platform compatibility (returns `$HOME` on Unix/macOS, `%USERPROFILE%` on Windows). + +**Returns**: The directory containing the config file, or `None` if no config found. + +The function is a pure discovery step — it finds where the config lives, nothing more. It does not parse the config or resolve source roots. + +### 3.2 `source_root` field in config YAML + +New optional top-level field: + +```yaml +# Optional: override where Java source code lives. +# Relative paths resolve relative to the config file's directory. +# Default: the directory containing this config file. +source_root: ../ +``` + +Resolution is straightforward: `Path(config_dir) / source_root`. For the example above, if config is at `system-D-context/.java-codebase-rag.yml`, then `source_root: ../` resolves to `System-D/`. + +**Note on resolution base**: the YAML `source_root` field resolves relative to the config file's directory, while the CLI `--source-root` flag resolves relative to cwd. These are intentionally different resolution bases — the YAML field is a portable declaration ("my code is one level up from this config"), while the CLI flag is an absolute or cwd-relative override. The precedence table in §3.3 handles priority; the resolution base difference is a non-issue because each source resolves independently before comparison. + +### 3.3 Full precedence chain for source root + +| Priority | Source | Example | +|---|---|---| +| 1 (highest) | `--source-root` CLI flag | `--source-root /other/path` | +| 2 | `JAVA_CODEBASE_RAG_SOURCE_ROOT` env var | `export JAVA_CODEBASE_RAG_SOURCE_ROOT=/other/path` | +| 3 | `source_root` field in YAML config | `source_root: ../` | +| 4 | Walk-up discovery result (config file's parent dir) | Config at `System-C/.java-codebase-rag.yml` → source root = `System-C/` | +| 5 (lowest) | `Path.cwd()` (unchanged fallback) | No config found anywhere | + +### 3.4 Path resolution base differences + +**Important**: The YAML `source_root` field and the CLI `--source-root` flag resolve relative paths from different bases: + +| Source | Resolution base | Example `../` from `services/api/` | +|---|---|---| +| YAML `source_root: ../` | Config file's directory | If config is at `System-C/.java-codebase-rag.yml`, resolves to `System-C/../` (parent of System-C) | +| CLI `--source-root ../` | Current working directory | If cwd is `services/api/`, resolves to `services/` | +| Env var `JAVA_CODEBASE_RAG_SOURCE_ROOT=../` | Current working directory | If cwd is `services/api/`, resolves to `services/` | + +**How precedence interacts with resolution bases:** + +Each source in the precedence chain resolves independently using its own resolution base: + +1. **CLI flag**: Resolves relative to cwd at invocation time +2. **Env var**: Resolves relative to cwd at server startup/CLI invocation +3. **YAML field**: Resolves relative to config file's directory (discovered via walk-up) +4. **Walk-up result**: Already an absolute path (the config directory) +5. **cwd fallback**: Current working directory at time of resolution + +**Key point**: The precedence chain selects ONE source, then that source is resolved. The different resolution bases do not interact with each other — they apply at different stages of selection and resolution. For example, if a YAML `source_root: ../` is selected, it resolves relative to the config dir, regardless of what cwd might be. + +**Why this design works:** +- **YAML field** is a portable declaration tied to the config file's location ("my code is one level up from this config") +- **CLI flag** is a runtime override relative to where the command is executed +- **Env var** follows CLI convention (cwd-relative) for consistency + +Example showing the difference: +``` +# Directory structure +System-C/ + .java-codebase-rag.yml # Contains: source_root: src/ + services/ + api/ # Cwd is here + +# YAML source_root resolves to: +System-C/src/ + +# CLI --source-root ../ resolves to (from System-C/services/api/): +System-C/services/ +``` + +### 3.5 Where changes happen + +**`config.py`**: +- Add `discover_project_root(start: Path) -> Path | None` +- Add `find_config_dir(source_root: Path | None) -> Path` — returns the effective project root by combining walk-up discovery with the precedence chain +- Update `resolve_operator_config()` to read `source_root` from YAML and resolve it relative to config dir +- When `source_root` param is `None` (no CLI flag, no env var), the function discovers the project root via walk-up, then reads `source_root` from the discovered YAML, then falls back to cwd + +**`server.py`**: +- Update `_project_root()` to call `discover_project_root()` before falling back to cwd. Env var still takes precedence. +- Update `_resolve_lancedb_uri()` to use `_project_root()` instead of raw `Path.cwd()` when `JAVA_CODEBASE_RAG_INDEX_DIR` is unset. This ensures consistency: index dir and source root derive from the same discovered location. +- **When `JAVA_CODEBASE_RAG_INDEX_DIR` is set but `JAVA_CODEBASE_RAG_SOURCE_ROOT` is not**: The index dir uses the env var value (absolute path or resolved relative to cwd), while source_root uses walk-up discovery. This is intentional — the index dir env var is an explicit override for where the index lives, independent of source root discovery. + +**`cli.py`**: +- Update `_resolved_from_ns()` to use walk-up discovery when `--source-root` is not provided. CLI flag still takes precedence. + +**`init` command**: no behavior change. The `init` command creates config + index in the specified directory as before. Walk-up only helps find existing configs. Add a soft warning if a parent config is detected. + +### 3.8 Microservice auto-scope + +**Problem:** When working from a microservice subdirectory, queries return results from the entire system index, which includes all microservices. This can mislead agents by showing code outside their current context. + +**Solution:** Automatically detect the current microservice from cwd and apply it as a filter to all queries. + +**Detection logic (Option A - reuse indexing logic):** + +Microservice detection uses the same logic as indexing, with an important addition for the source_root level case: + +1. **Check if cwd equals source_root**: If `cwd.resolve() == source_root.resolve()`, return `None` (at system level, no specific scope). This is NEW behavior specific to auto-scope — it ensures that working at the project root shows all microservices rather than arbitrarily scoping to one. +2. **Check if cwd outside source_root**: If cwd is not under source_root, return `None` (outside project context) +3. **Walk up to find outermost build marker**: From cwd, walk up to find the outermost build marker (pom.xml, build.gradle, etc.) under source_root +4. **Resolve to microservice name**: Use `microservice_for_path()` from `graph_enrich.py` with the detected build marker path +5. **Check YAML overrides**: Apply `microservice_roots` YAML config if present + +**Why the source_root check is needed:** + +During indexing, `microservice_for_path()` returns the first path segment when no build marker is found. This is appropriate for indexing (every file belongs to some microservice). But for query-time scoping, working at the project root should show ALL microservices, not arbitrarily scope to the first one. The source_root equality check implements this semantic difference. + +**Scope behavior (Option B - scoped inside, all at root):** + +- When inside a microservice directory → auto-scope queries to that microservice +- When at source_root level → queries span all microservices (with advisory message) +- Explicit `filter={"microservice": "..."}` always overrides auto-detected scope + +**Implementation (Approach 1 - server-level scope injection):** + +The MCP server detects microservice at startup and caches it as "default_scope". For each query: + +1. Check if user provided explicit `microservice` filter +2. If yes → use user's filter (explicit wins) +3. If no → inject auto-detected scope + +**Components:** + +1. **`detect_microservice_from_path(cwd, source_root)`** in `graph_enrich.py` + - Returns microservice name or None (at source_root level) + - Reuses existing `microservice_for_path()` logic + +2. **`ScopeManager` class** in `server.py` + - Initialized after source_root resolution + - Detects and caches default_scope at server startup + - Logs detection at INFO level + - `apply_auto_scope(filter)` method injects scope when needed + - **Scope lifecycle**: Scope is detected once at server startup and cached for the server's lifetime. If the user changes directories during a long-running MCP session, the scope will NOT be re-detected automatically. Users who change directories should restart the MCP server to get updated scope detection. This is a known limitation documented in README. + +3. **Tool wrapper integration** in `server.py` + - Each MCP tool wrapper calls `apply_auto_scope()` before passing filter to underlying function + +**Error handling:** + +| Scenario | Behavior | +|----------|----------| +| source_root cannot be resolved | Log warning, continue with no auto-scope | +| microservice detection fails | Log warning, continue with no auto-scope | +| cwd outside source_root | No scope applied (None) | +| User provides invalid microservice | Existing validation catches it | + +**Logging and advisories:** + +- INFO-level logging shows detected microservice at startup +- Exact log messages: + - When scope detected: `[scope] Detected microservice: {microservice_name}` then `[scope] Queries scoped to {microservice_name}` + - When no scope (at source_root): `[scope] No microservice detected (at project root)` then `[scope] Queries will span all microservices` +- Advisory message (shown in MCP response advisories field when at source_root and no explicit microservice filter): `Query results span multiple microservices. Use filter='{"microservice": "..."}' to scope to a specific service.` + +### 3.6 Error messages + +**No config found (MCP/query/index commands)**: +> No `.java-codebase-rag.yml` found in `[cwd]` or any parent directory (stopped at home). Run `java-codebase-rag init` in your project root first. + +**`init` finds existing config in parent (soft warning)**: +> Warning: found existing config at `[parent]/.java-codebase-rag.yml`. Creating a new project here will create a separate index. + +### 3.7 What each user scenario looks like after + +**User A** — runs `init` from each `System-X/` directory separately. Then uses MCP from any subdirectory — walk-up finds the config for the current system. No more mixed indexes. + +**User B** — runs `init` from `System-C/`. Then `cd`s to `microservice-C-1/` and starts MCP. Walk-up finds `System-C/.java-codebase-rag.yml`, source root defaults to `System-C/`. Works. + +**User C** — creates config at `system-D-context/.java-codebase-rag.yml` with `source_root: ../`. Runs `init` from `system-D-context/`. Walk-up from any subdirectory finds the config. Source root = `System-D/`. Index at `System-D/.java-codebase-rag/`. + +**User B+** — runs `init` from `System-C/`. Then `cd`s to `microservice-C-1/` and starts MCP. Walk-up finds `System-C/.java-codebase-rag.yml`, source root defaults to `System-C/`. Microservice auto-scope detects `microservice-C-1` from cwd and applies it automatically. Queries return only results from `microservice-C-1`. Agent sees correct codebase boundaries. + +## 4. Scope + +- Config file discovery via walk-up +- `source_root` field in YAML config +- Updated precedence chain +- Integration in CLI and MCP server +- `JAVA_CODEBASE_RAG_INDEX_DIR` and `JAVA_CODEBASE_RAG_SOURCE_ROOT` env vars become optional (still supported as overrides) +- `mcp.json.example` updated to show minimal zero-env-var config +- Clear error messages when config is not found +- Soft warning during `init` when a parent config exists +- Logging of discovered config file path at INFO level (for debugging discovery issues) +- Optional `--debug` / `--verbose` flag that prints the full discovery path and resolution chain +- README documentation of YAML vs CLI path resolution base differences with examples +- **NEW:** Microservice auto-scope detection and application +- **NEW:** `ScopeManager` class in server.py +- **NEW:** `detect_microservice_from_path()` function in graph_enrich.py +- **NEW:** Advisory messages when queries span multiple microservices + +## 5. Schema / Ontology / Re-index impact + +- Ontology bump: not required +- Re-index required: no. The index structure and content are unchanged. +- Config surface changes: new optional `source_root` field in YAML. Fully backward-compatible — existing configs without this field continue to work identically. + +## 6. Tests / Validation + +- `test_discover_project_root_finds_config_in_cwd` — config in cwd, returns cwd +- `test_discover_project_root_walks_up` — config in parent, returns parent +- `test_discover_project_root_stops_at_home` — config in $HOME, returns None +- `test_discover_project_root_stops_at_windows_userprofile` — config in %USERPROFILE%, returns None (Windows-specific) +- `test_discover_project_root_not_found` — no config anywhere, returns None +- `test_discover_project_root_cross_platform_home` — `Path.home()` correctly identifies home on both Unix and Windows +- `test_source_root_from_yaml_relative` — `source_root: ../` resolves to parent of config dir +- `test_source_root_from_yaml_absolute` — `source_root: /abs/path` resolves to absolute path +- `test_source_root_precedence_cli_over_yaml` — CLI flag wins over YAML `source_root` +- `test_source_root_precedence_yaml_over_discovery` — YAML `source_root` wins over config dir default +- `test_source_root_precedence_env_over_yaml` — env var wins over YAML `source_root` +- `test_existing_behavior_unchanged` — no walk-up, cwd = config dir → identical behavior to today +- `test_discover_project_root_with_symlinks` — symlinked config dirs are handled correctly +- `test_yaml_relative_path_resolution_base` — YAML `source_root: ../` resolves relative to config dir, not cwd +- `test_cli_flag_resolution_base` — `--source-root ../` resolves relative to cwd, not config dir +- **NEW:** `test_detect_microservice_deep_inside` — Deep inside microservice directory detects that microservice +- **NEW:** `test_detect_microservice_at_microservice_root` — At microservice root detects that microservice +- **NEW:** `test_detect_microservice_at_system_root` — At system root returns None (no specific scope) +- **NEW:** `test_detect_microservice_outside_source` — Outside source_root returns None +- **NEW:** `test_apply_scope_when_filter_none` — No filter provided injects auto-detected scope +- **NEW:** `test_apply_scope_when_filter_exists_no_microservice` — Filter without microservice gets auto-scope injected +- **NEW:** `test_apply_scope_preserves_explicit_microservice` — Explicit microservice not overridden +- **NEW:** `test_apply_scope_no_default` — No auto-detected scope leaves filter unchanged + +## 7. Open questions + +None — all key decisions resolved during brainstorming. + +## 8. Out of scope + +- Auto-detecting multiple systems and splitting indexes +- Changing index directory structure +- Global config or project registry +- Changes to indexing, query, or graph-building logic +- `init` command behavior changes (beyond the parent-config warning) +- CLI-level microservice auto-scope (MCP server only for now) +- Dynamic microscope scope re-detection if cwd changes during a long-running session (server restart required) +- Config-based microservice boundary overrides (use existing `microservice_roots` YAML field instead) + +## 9. Risks and mitigations + +| Risk | Mitigation | +|---|---| +| Walk-up finds wrong config in a shared parent (e.g. `IdeaProjects/.java-codebase-rag.yml` when user meant `System-A/`) | `init` warns when a parent config exists. First-match-wins means the closest config is always preferred. If a stray config exists at a high level, it's only found when no closer config exists. | +| Symlink cycles during walk-up | `Path.resolve()` canonicalizes the path before walking. The `parent` chain on resolved paths cannot cycle. | +| Symlinked config directories skip intended config | Use `Path.resolve()` selectively — resolve for cycle detection but preserve the original logical path for config lookup. This ensures configs in symlinked dirs are still found. | +| Performance of filesystem stat calls in deep directory trees | Each step is a single `is_file()` check. Even at 20 levels deep, this is negligible compared to the embedding/indexing work the tool already does. | +| Home directory boundary varies by platform | Use `Path.home()` which returns the user home directory cross-platform (`$HOME` on Unix/macOS, `%USERPROFILE%` on Windows). The boundary is checked inclusively. | +| Nested configs create confusion (which one is active?) | First-match-wins is simple and matches git's behavior. The tool logs at INFO level which config file was discovered to aid debugging. | + +## 10. Decisions taken + +1. **First match wins** — closest config to cwd, not "most specific" or "deepest". Matches git behavior. No heuristic for picking among multiple configs. +2. **Home directory is inclusive boundary** — check home itself, don't go past it. Use `Path.home()` for cross-platform compatibility (works on `$HOME` for Unix/macOS, `%USERPROFILE%` on Windows). Avoids finding configs in `/` or system directories. +3. **YAML field named `source_root`** — same name as the CLI flag for conceptual consistency, despite different resolution bases. The alternative (`project_root`, `code_dir`) would add a new concept where none is needed. +4. **Walk-up is a separate pre-step** — not integrated into `resolve_operator_config()`. Cleaner separation, easier to test, lower risk to existing resolution logic. +5. **No changes to `init`** — `init` creates config + index as before. The walk-up only helps find existing configs from subdirectories. +6. **No `--walk-up` opt-out flag** — walk-up is always-on when no explicit source root is given. If a user hits the wrong config, the fix is to move or remove the stray config file, not to add a flag. +7. **Config discovery is logged** — INFO-level log message shows which config file was discovered. Optional `--debug` flag prints full discovery path and resolution chain for troubleshooting. +8. **Microservice detection reuses indexing logic** — Use the same `microservice_for_path()` function that determines microservice boundaries during indexing. Ensures consistency between indexing and querying. +9. **Microservice scope is scoped inside, all at root** — Auto-scope applies when inside a microservice directory. At source_root level, queries span all microservices with an advisory message. +10. **Server-level scope injection** — MCP server detects microservice at startup and injects into queries. Explicit filters always override auto-detected scope. + +## 11. Migration plan — 1 PR + +Single PR containing: +1. `discover_project_root()` function in `config.py` using `Path.home()` for cross-platform home detection +2. `source_root` YAML field parsing in `resolve_operator_config()` +3. Updated `_project_root()` in `server.py` +4. Updated `_resolved_from_ns()` in `cli.py` +5. Index dir auto-derived from discovered source root (no env var needed) +6. Soft warning in `init` when parent config detected +7. INFO-level logging of discovered config file path +8. Optional `--debug` / `--verbose` flag that prints full discovery path and resolution chain +9. All tests from §6 (including Windows-specific tests) +10. `mcp.json.example` updated to show minimal zero-env-var config +11. README update documenting the new behavior and YAML vs CLI path resolution differences with examples +12. **NEW:** `detect_microservice_from_path()` function in `graph_enrich.py` +13. **NEW:** `ScopeManager` class in `server.py` +14. **NEW:** Microservice auto-scope integration in tool wrappers +15. **NEW:** Advisory messages when queries span multiple microservices +16. **NEW:** All microservice auto-scope tests from §6 diff --git a/server.py b/server.py index 31f67306..65f737c8 100644 --- a/server.py +++ b/server.py @@ -16,7 +16,12 @@ emit_vectors_finish, emit_vectors_start, ) -from java_codebase_rag.config import emit_legacy_env_hints_if_present, resolved_sbert_model_for_process_env, resolve_operator_config +from java_codebase_rag.config import ( + discover_project_root, + emit_legacy_env_hints_if_present, + resolved_sbert_model_for_process_env, + resolve_operator_config, +) from kuzu_queries import KuzuGraph, resolve_kuzu_path from mcp.server.fastmcp import FastMCP from pydantic import BaseModel, Field @@ -91,10 +96,49 @@ class IndexInfoOutput(BaseModel): graph: GraphMetaOutput +# Module-level scope manager, initialized in main() +_scope_manager: ScopeManager | None = None + + +class ScopeManager: + """Manages automatic microservice scope detection and injection.""" + + def __init__(self, source_root: Path): + self.source_root = source_root + self.default_scope: str | None = self._detect_scope() + self._log_detection() + + def _detect_scope(self) -> str | None: + from graph_enrich import detect_microservice_from_path + return detect_microservice_from_path(Path.cwd(), self.source_root) + + def _log_detection(self) -> None: + if self.default_scope: + print(f"[scope] Detected microservice: {self.default_scope}", file=sys.stderr) + print(f"[scope] Queries scoped to {self.default_scope}", file=sys.stderr) + else: + print("[scope] No microservice detected (at project root)", file=sys.stderr) + print("[scope] Queries will span all microservices", file=sys.stderr) + + def apply_auto_scope(self, node_filter: dict[str, Any] | None) -> dict[str, Any] | None: + """Apply auto-detected scope to filter if no explicit microservice is set.""" + if self.default_scope is None: + return node_filter + # Convert to dict for manipulation + if node_filter is None: + filter_dict = {} + else: + filter_dict = dict(node_filter) + # Only inject if user didn't specify microservice + if "microservice" not in filter_dict: + filter_dict["microservice"] = self.default_scope + return filter_dict + + def _resolve_lancedb_uri() -> str: raw = os.environ.get("JAVA_CODEBASE_RAG_INDEX_DIR", "").strip() if not raw: - raw = str((Path.cwd() / ".java-codebase-rag").resolve()) + raw = str((_project_root() / ".java-codebase-rag").resolve()) p = Path(raw).expanduser() if not str(raw).startswith(("s3://", "gs://", "az://")): try: @@ -108,7 +152,8 @@ def _project_root() -> Path: env = os.environ.get("JAVA_CODEBASE_RAG_SOURCE_ROOT", "").strip() if env: return Path(env).expanduser().resolve() - return Path.cwd().resolve() + discovered = discover_project_root(Path.cwd()) + return discovered if discovered is not None else Path.cwd().resolve() def _cocoindex_subprocess_env(project_root: Path) -> dict[str, str]: @@ -370,6 +415,7 @@ async def search( ), ), ) -> mcp_v2.SearchOutput: + scoped_filter = _scope_manager.apply_auto_scope(filter) if _scope_manager else filter return await asyncio.to_thread( mcp_v2.search_v2, query, @@ -378,7 +424,7 @@ async def search( limit, offset, path_contains, - filter, + scoped_filter, None, ) @@ -413,7 +459,8 @@ async def find( limit: int = Field(default=25, ge=1, le=500, description="Max nodes to return"), offset: int = Field(default=0, ge=0, le=499, description="Skip this many nodes (pagination)"), ) -> mcp_v2.FindOutput: - return await asyncio.to_thread(mcp_v2.find_v2, kind, filter, limit, offset, None) + scoped_filter = _scope_manager.apply_auto_scope(filter) if _scope_manager else filter + return await asyncio.to_thread(mcp_v2.find_v2, kind, scoped_filter, limit, offset, None) @mcp.tool( name="describe", @@ -525,6 +572,7 @@ async def neighbors( ), ), ) -> mcp_v2.NeighborsOutput: + scoped_filter = _scope_manager.apply_auto_scope(filter) if _scope_manager else filter return await asyncio.to_thread( mcp_v2.neighbors_v2, ids, @@ -532,7 +580,7 @@ async def neighbors( edge_types, limit, offset, - filter, + scoped_filter, edge_filter, include_unresolved, dedup_calls, @@ -580,6 +628,10 @@ def main() -> None: cfg.apply_to_os_environ() mcp_v2.set_hints_enabled(cfg.hints_enabled) + # Initialize scope manager for automatic microservice detection + global _scope_manager + _scope_manager = ScopeManager(cfg.source_root) + asyncio.run(create_mcp_server().run_stdio_async()) diff --git a/tests/test_config.py b/tests/test_config.py new file mode 100644 index 00000000..75035725 --- /dev/null +++ b/tests/test_config.py @@ -0,0 +1,185 @@ +"""Tests for config discovery and resolution logic.""" + +from pathlib import Path +from java_codebase_rag.config import ( + discover_project_root, + YAML_CONFIG_FILENAMES, + resolve_operator_config, +) + + +class TestDiscoverProjectRoot: + """Tests for discover_project_root walk-up behavior.""" + + def test_discover_project_root_finds_config_in_cwd(self, tmp_path): + """Config in cwd returns cwd.""" + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("# test config") + + result = discover_project_root(tmp_path) + assert result == tmp_path + + def test_discover_project_root_walks_up(self, tmp_path): + """Config in parent returns parent.""" + subdir = tmp_path / "subdir" + subdir.mkdir() + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("# test config") + + result = discover_project_root(subdir) + assert result == tmp_path + + def test_discover_project_root_stops_at_home_boundary(self, tmp_path, monkeypatch): + """Config at $HOME itself is found when walking up from subdirectory.""" + # Create a fake home under tmp_path + fake_home = tmp_path / "home" + fake_home.mkdir() + project_dir = fake_home / "project" + project_dir.mkdir() + + config_file = fake_home / YAML_CONFIG_FILENAMES[0] + config_file.write_text("# test config at home") + + # Mock HOME to point to our fake home + monkeypatch.setenv("HOME", str(fake_home)) + + result = discover_project_root(project_dir) + assert result == fake_home + + def test_discover_project_root_not_found_above_home(self, tmp_path, monkeypatch): + """No config anywhere under $HOME returns None.""" + fake_home = tmp_path / "home" + fake_home.mkdir() + project_dir = fake_home / "project" + project_dir.mkdir() + + monkeypatch.setenv("HOME", str(fake_home)) + + result = discover_project_root(project_dir) + assert result is None + + def test_discover_project_root_not_found(self, tmp_path): + """No config anywhere returns None.""" + result = discover_project_root(tmp_path) + assert result is None + + def test_discover_project_root_first_match_wins(self, tmp_path): + """Configs at two levels - closest to cwd wins.""" + subdir = tmp_path / "subdir" + subdir.mkdir() + subsubdir = subdir / "subsub" + subsubdir.mkdir() + + # Config at both levels + parent_config = tmp_path / YAML_CONFIG_FILENAMES[0] + parent_config.write_text("# parent config") + child_config = subdir / YAML_CONFIG_FILENAMES[1] # Use .yaml variant + child_config.write_text("# child config") + + result = discover_project_root(subsubdir) + # Should find the closest config (subdir), not the parent (tmp_path) + assert result == subdir + + +class TestSourceRootFromYaml: + """Tests for source_root YAML field parsing and resolution.""" + + def test_source_root_from_yaml_relative(self, tmp_path, monkeypatch): + """source_root: ../ resolves to parent of config dir.""" + # Clean environment from conftest.py session fixture + monkeypatch.delenv("JAVA_CODEBASE_RAG_INDEX_DIR", raising=False) + monkeypatch.delenv("JAVA_CODEBASE_RAG_SOURCE_ROOT", raising=False) + + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("source_root: ../") + + # Change cwd to tmp_path so walk-up finds this config + monkeypatch.chdir(tmp_path) + + # source_root=None triggers walk-up discovery + YAML parsing + result = resolve_operator_config(source_root=None) + # source_root should be the parent of tmp_path + assert result.source_root == tmp_path.parent + + def test_source_root_from_yaml_absolute(self, tmp_path, monkeypatch): + """source_root: /abs/path resolves to absolute path.""" + # Clean environment from conftest.py session fixture + monkeypatch.delenv("JAVA_CODEBASE_RAG_INDEX_DIR", raising=False) + monkeypatch.delenv("JAVA_CODEBASE_RAG_SOURCE_ROOT", raising=False) + + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + absolute_path = "/some/absolute/path" + config_file.write_text(f"source_root: {absolute_path}") + + # Change cwd to tmp_path so walk-up finds this config + monkeypatch.chdir(tmp_path) + + # source_root=None triggers walk-up discovery + YAML parsing + result = resolve_operator_config(source_root=None) + assert result.source_root == Path(absolute_path) + + +class TestSourceRootPrecedence: + """Tests for source_root precedence chain.""" + + def test_source_root_precedence_cli_over_yaml(self, tmp_path, monkeypatch): + """CLI flag wins over YAML source_root.""" + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("source_root: /yaml/path") + + cli_root = tmp_path / "cli_root" + cli_root.mkdir() + + result = resolve_operator_config(source_root=cli_root) + # CLI flag should win + assert result.source_root == cli_root + + def test_source_root_precedence_yaml_over_discovery(self, tmp_path, monkeypatch): + """YAML source_root wins over config dir default.""" + # Clean environment from conftest.py session fixture + monkeypatch.delenv("JAVA_CODEBASE_RAG_INDEX_DIR", raising=False) + monkeypatch.delenv("JAVA_CODEBASE_RAG_SOURCE_ROOT", raising=False) + + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("source_root: /yaml/root") + + # Change cwd to tmp_path so walk-up finds this config + monkeypatch.chdir(tmp_path) + + # source_root=None triggers walk-up discovery + result = resolve_operator_config(source_root=None) + # YAML should override the discovered config dir + assert result.source_root == Path("/yaml/root") + + def test_source_root_precedence_env_over_yaml(self, tmp_path, monkeypatch): + """env var wins over YAML source_root.""" + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("source_root: /yaml/path") + + env_root = tmp_path / "env_root" + env_root.mkdir() + monkeypatch.setenv("JAVA_CODEBASE_RAG_SOURCE_ROOT", str(env_root)) + + result = resolve_operator_config(source_root=None) + # Env var should win + assert result.source_root == env_root + + def test_existing_behavior_unchanged(self, tmp_path, monkeypatch): + """No walk-up, cwd = config dir → identical behavior to today.""" + # Clean environment from conftest.py session fixture + monkeypatch.delenv("JAVA_CODEBASE_RAG_INDEX_DIR", raising=False) + monkeypatch.delenv("JAVA_CODEBASE_RAG_SOURCE_ROOT", raising=False) + + # Create a config at cwd + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("# test config") + + # Set cwd to tmp_path + monkeypatch.chdir(tmp_path) + + # Call with source_root=tmp_path (old behavior: explicit root) + result = resolve_operator_config(source_root=tmp_path) + assert result.source_root == tmp_path + + # Also test that index_dir derives from source_root + assert result.index_dir == tmp_path / ".java-codebase-rag" diff --git a/tests/test_mcp_server_project_root.py b/tests/test_mcp_server_project_root.py new file mode 100644 index 00000000..48d90d8d --- /dev/null +++ b/tests/test_mcp_server_project_root.py @@ -0,0 +1,25 @@ +"""Tests for server.py _project_root() function in the MCP server context.""" + +from java_codebase_rag.config import YAML_CONFIG_FILENAMES + + +class TestProjectRoot: + """Tests for _project_root() walk-up behavior.""" + + def test_project_root_uses_discover_when_env_unset(self, tmp_path, monkeypatch): + """_project_root() returns discovered config dir when JAVA_CODEBASE_RAG_SOURCE_ROOT is unset.""" + # Ensure env var is unset + monkeypatch.delenv("JAVA_CODEBASE_RAG_SOURCE_ROOT", raising=False) + + # Create a config file + config_file = tmp_path / YAML_CONFIG_FILENAMES[0] + config_file.write_text("# test config") + + # Change cwd to tmp_path + monkeypatch.chdir(tmp_path) + + # Import _project_root after setting up the environment + from server import _project_root + + result = _project_root() + assert result == tmp_path diff --git a/tests/test_microservice_scope.py b/tests/test_microservice_scope.py new file mode 100644 index 00000000..0200b59e --- /dev/null +++ b/tests/test_microservice_scope.py @@ -0,0 +1,135 @@ +"""Tests for microservice scope detection and ScopeManager.""" + +from graph_enrich import detect_microservice_from_path + + +class TestDetectMicroserviceFromPath: + """Tests for detect_microservice_from_path() function.""" + + def test_detect_microservice_deep_inside(self, tmp_path): + """Deep inside microservice directory detects that microservice.""" + # Create a microservice structure + ms_dir = tmp_path / "microservice-a" + ms_dir.mkdir() + sub_dir = ms_dir / "src" / "main" + sub_dir.mkdir(parents=True) + + # Add a build marker to the microservice directory + (ms_dir / "pom.xml").write_text("") + + result = detect_microservice_from_path(sub_dir, tmp_path) + assert result == "microservice-a" + + def test_detect_microservice_at_microservice_root(self, tmp_path): + """At microservice root detects that microservice.""" + ms_dir = tmp_path / "microservice-b" + ms_dir.mkdir() + + # Add a build marker + (ms_dir / "build.gradle").write_text("plugins { id 'java' }") + + # Use a subdirectory inside the microservice (not the root itself) + sub_dir = ms_dir / "src" + sub_dir.mkdir() + + result = detect_microservice_from_path(sub_dir, tmp_path) + assert result == "microservice-b" + + def test_detect_microservice_at_system_root(self, tmp_path): + """At system root returns None (no specific scope).""" + result = detect_microservice_from_path(tmp_path, tmp_path) + assert result is None + + def test_detect_microservice_outside_source(self, tmp_path): + """Outside source_root returns None.""" + outside_dir = tmp_path.parent / "outside" + outside_dir.mkdir(parents=True, exist_ok=True) + + result = detect_microservice_from_path(outside_dir, tmp_path) + assert result is None + + +class TestScopeManager: + """Tests for ScopeManager class.""" + + def test_apply_scope_when_filter_none(self, tmp_path): + """No filter provided injects auto-detected scope.""" + # Create a microservice structure + ms_dir = tmp_path / "microservice-a" + ms_dir.mkdir(parents=True, exist_ok=True) + (ms_dir / "pom.xml").write_text("") + + from server import ScopeManager + mgr = ScopeManager(tmp_path) + mgr.default_scope = "microservice-a" # Simulate detection + + result = mgr.apply_auto_scope(None) + assert result == {"microservice": "microservice-a"} + + def test_apply_scope_when_filter_exists_no_microservice(self, tmp_path): + """Filter without microservice gets auto-scope injected.""" + from server import ScopeManager + mgr = ScopeManager(tmp_path) + mgr.default_scope = "microservice-b" # Simulate detection + + filter_dict = {"role": "Controller"} + result = mgr.apply_auto_scope(filter_dict) + assert result == {"role": "Controller", "microservice": "microservice-b"} + + def test_apply_scope_preserves_explicit_microservice(self, tmp_path): + """Explicit microservice not overridden.""" + from server import ScopeManager + mgr = ScopeManager(tmp_path) + mgr.default_scope = "microservice-a" # Simulate detection + + filter_dict = {"microservice": "microservice-c"} + result = mgr.apply_auto_scope(filter_dict) + assert result == {"microservice": "microservice-c"} + + def test_apply_scope_no_default(self, tmp_path): + """No auto-detected scope leaves filter unchanged.""" + from server import ScopeManager + mgr = ScopeManager(tmp_path) + mgr.default_scope = None # No detection + + filter_dict = {"role": "Controller"} + result = mgr.apply_auto_scope(filter_dict) + assert result == {"role": "Controller"} + + def test_detect_scope_with_yaml_overrides(self, tmp_path): + """Test that detect_microservice_from_path respects YAML microservice_roots.""" + # Create a project structure with a YAML config that specifies microservice_roots + config_file = tmp_path / ".java-codebase-rag.yml" + config_file.write_text("microservice_roots:\n - custom-ms-name\n") + + # Create a directory that matches the override name (but no build marker) + custom_ms_dir = tmp_path / "custom-ms-name" + custom_ms_dir.mkdir() + + # Even without a build marker, the YAML override should detect this as a microservice + from graph_enrich import detect_microservice_from_path + result = detect_microservice_from_path(custom_ms_dir, tmp_path) + + # Should detect the microservice based on YAML override + assert result == "custom-ms-name" + + def test_detect_scope_integration(self, tmp_path): + """Test real detection flow: ScopeManager.__init__ → detect_microservice_from_path → microservice_for_path.""" + # Create a microservice structure + ms_dir = tmp_path / "microservice-a" + ms_dir.mkdir(parents=True, exist_ok=True) + (ms_dir / "pom.xml").write_text("") + + # Create a ScopeManager with real detection (no manual override) + from server import ScopeManager + mgr = ScopeManager(tmp_path) + + # The detection should have found the microservice + # (assuming we're at the project root, not inside the microservice) + # When at tmp_path (project root), default_scope should be None + assert mgr.default_scope is None + + # Test that apply_auto_scope doesn't inject when no scope detected + filter_dict = {"role": "Controller"} + result = mgr.apply_auto_scope(filter_dict) + assert result == {"role": "Controller"}