Skip to content

Commit 21b7366

Browse files
committed
Merge remote-tracking branch 'origin/main' into claude/security-code-review-C1C5t
2 parents 798e93a + f760d87 commit 21b7366

41 files changed

Lines changed: 1089 additions & 213 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
- **Mypy: type analyzer, shared, agents, badges, worker_uploads; introduce shared protocols** (Issue #1335): Brought the five smaller, interface-rich target packages over the ≥70% return-annotation bar called for by the issue and seeded `opencontractserver/types/protocols.py` with the four protocols requested in the scope:
13+
- `VectorStoreProtocol` — minimum surface (`search` / `async_search`) implemented by `CoreAnnotationVectorStore` (`opencontractserver/llms/vector_stores/core_vector_stores.py`); imported and re-exported from that module so consumers can annotate against the protocol rather than the concrete dataclass.
14+
- `PipelineComponentProtocol``title` / `description` / `author` / `dependencies` surface that the pipeline registry duck-types against; imported from `opencontractserver/pipeline/base/base_component.py` so any future parser/embedder/thumbnailer registered outside the inheritance hierarchy still type-checks against the same contract.
15+
- `ToolProtocol``name` / `description` / `parameters` / `requires_approval` mirror of `CoreTool` (`opencontractserver/llms/tools/tool_factory.py`); framework adapters can accept any object satisfying it, no inheritance required.
16+
- `PermissionedQueryManagerProtocol``visible_to_user(user) -> QuerySet` contract that `BaseVisibilityManager`, `PermissionManager`, `DocumentManager`, `AnnotationManager`, and `NoteManager` all satisfy (`opencontractserver/shared/Managers.py`); imported there so callers receiving "a permissioned manager" can type against the protocol instead of a concrete class.
17+
- `StreamObserverProtocol` — duplicated `__call__(event)` shape from `opencontractserver/llms/types.StreamObserver` for callers in non-LLM modules (notifications, websockets) that need the contract without importing `llms.types`.
18+
- **Coverage delta** (return-annotation coverage measured by AST walk, excluding `__init__.py`):
19+
- `analyzer/`: 12.5% → **87.5%** (target ≥70%)
20+
- `shared/`: 38.1% → **96.3%** (target ≥70%)
21+
- `agents/`: 34.4% → **71.9%** (target ≥70%)
22+
- `badges/`: 47.1% → **100%** (target ≥70%)
23+
- `worker_uploads/`: 46.4% → **100%** (target ≥70%)
24+
- **Files touched** (annotations only, zero behavior changes): `analyzer/{apps,checks,signals,startup,utils,admin,admin_views}.py`, `analyzer/management/commands/sync_doc_analyzers.py`, `agents/{apps,admin,memory}.py`, `badges/{apps,signals,models}.py`, `worker_uploads/{apps,auth,serializers,views,models,tasks}.py`, `shared/{utils,defaults,fields,mixins,Managers,QuerySets,decorators}.py`, plus the four protocol-consumer wirings above.
25+
- **Bare-generic promotion**: every `dict` / `list` / `set` in newly-touched public signatures was widened to a parametrised form (`dict[str, Any]`, `list[Any]`, `dict[str, int]`, etc.). The `HasEmbeddingMixin` docstring example was tightened to match (`-> dict[str, Any]`).
26+
- **`# type: ignore` audit**: every bare `# type: ignore` comment in the codebase now carries a specific error code. `opencontractserver/llms/tools/core_tools.py:413,416,418,420` (channels/asgiref import + `partial` kwarg-aware wrapper) tightened to `[import-not-found]` / `[call-arg]` with a one-line comment explaining why; `opencontractserver/utils/embeddings.py:171` to `[attr-defined]`; `opencontractserver/tests/test_pipeline_utils.py:395,423,431,438,446` cleaned up the duplicated `# type: ignore; type: ignore` markers and scoped them to `[import-not-found]`; `opencontractserver/tests/test_core_tool_factory.py:34,35,47` widened from bare to `[attr-defined]`. The total count went from 62 → 61, and the ratio of bare-to-scoped dropped to zero.
1227
- **Return-type annotations across `config/graphql/` resolvers and mutations** (Issue #1332, follow-up to #1331): The largest, least-typed subtree in the backend (459 function definitions, ~4.4% return-annotation coverage at baseline) is now at 100% return-annotation coverage. Touched files include every `*_mutations.py`, every `*_queries.py`, every `*_types.py`, plus `filters.py`, `base.py`, `base_types.py`, `security.py`, `optimized_file_resolvers.py`, `permissioning/permission_annotator/{middleware,mixins,utils}.py`, and the small utility modules. No behavioral changes — annotations only.
1328
- **`mutate(...)` on `graphene.Mutation` subclasses**: typed as forward references to the enclosing class (`-> "ClassName"`). Discovered and fixed the latent bug in `config/graphql/analysis_mutations.py:179` (`DeleteAnalysisMutation.mutate`) where the success path had no `return` statement; annotation is `-> "DeleteAnalysisMutation | None"` and an explicit `return None` was added to preserve the original implicit-None behavior on success.
1429
- **`resolve_*` methods**: typed as `-> Any` by default, refined where the GraphQL field type makes the runtime return obvious (e.g. `resolve_in_use -> bool`, `resolve_datacell_count -> int`).
@@ -23,6 +38,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2338

2439
### Fixed
2540

41+
- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
42+
- **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
43+
- **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
44+
- **Bug — duplicate `OC_EXTRACT_SOURCE` annotations on Celery retry** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation` & `_create_span_annotation`): nothing prevented the grounding pipeline from creating fresh annotations and re-linking them via `datacell.sources.add(*annotations)` if `ground_extraction_to_annotations` ran twice on the same datacell (Celery retry after partial failure was the realistic trigger). Replaced the construct-then-`save()` flow with `Annotation.objects.get_or_create()` keyed on `(document, annotation_label, annotation_type, raw_text, …)` so retries reuse existing rows. `datacell.sources` is a `ManyToManyField`, so re-linking the same row is already a no-op once the row is shared.
45+
- **Constant — extracted `DOCX_MIME_TYPE`** (`opencontractserver/constants/document_processing.py`): the long `application/vnd.openxmlformats-officedocument.wordprocessingml.document` literal previously lived inline in `_load_document_text_and_layer`. Per the project's no-magic-strings rule it now sits next to `MARKDOWN_MIME_TYPE` and is imported from one place.
46+
- **Type annotations** (`opencontractserver/utils/extraction_grounding.py`): `Document`, `Corpus`, `Datacell`, `Annotation`, and `AnnotationLabel` parameters and return types added via a `TYPE_CHECKING` block on every public and helper function. No runtime change.
47+
- **Documentation** — the `page=1` placeholder for SPAN_LABEL annotations (text/DOCX) is now documented in the function docstring, explaining that the `txt_extract_file` pipeline does not preserve a page-break map and the actual location lives in the character offsets in `json`.
48+
- **Tests**`opencontractserver/tests/test_extraction_grounding.py`:
49+
- `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
50+
- `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
51+
2652
- **`CreateCorpusActionModal` opened with the wrong default agent instructions for document triggers** (Issue #1385, `frontend/src/components/corpuses/CreateCorpusActionModal.tsx:136-144,168-171`): the `inlineAgentInstructions` state was initialised with `DEFAULT_MODERATOR_INSTRUCTIONS` even though the default trigger is `add_document` (a document trigger). The trigger-change handler at line 611 swaps to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS`, but a user who created an inline agent on the default-selected trigger without first re-selecting the trigger would submit the moderator copy as the new agent's system instructions. Initialised both the `useState` default and `resetForm()` to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` so the pre-interaction value matches the default trigger. Updated `frontend/tests/CreateCorpusActionModal.ct.tsx` "inline-agent create: full happy path" mutation mock to expect `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` — the previous mock variable masked this bug because `MockedProvider` was matching the stale moderator default rather than the trigger-appropriate one.
2753

2854
### Changed

opencontractserver/agents/admin.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
from django.contrib import admin
2+
from django.db.models import QuerySet
3+
from django.http import HttpRequest
24
from django.utils.html import format_html
35
from guardian.admin import GuardedModelAdmin
46

@@ -146,7 +148,7 @@ class AgentActionResultAdmin(GuardedModelAdmin):
146148
),
147149
)
148150

149-
def status_badge(self, obj):
151+
def status_badge(self, obj: AgentActionResult) -> str:
150152
"""Display status as a colored badge."""
151153
colors = {
152154
"pending": "#6c757d", # gray
@@ -165,7 +167,7 @@ def status_badge(self, obj):
165167
status_badge.short_description = "Status"
166168
status_badge.admin_order_field = "status"
167169

168-
def corpus_action_link(self, obj):
170+
def corpus_action_link(self, obj: AgentActionResult) -> str:
169171
"""Link to the corpus action in admin."""
170172
if obj.corpus_action:
171173
name = (
@@ -182,7 +184,7 @@ def corpus_action_link(self, obj):
182184

183185
corpus_action_link.short_description = "Corpus Action"
184186

185-
def document_link(self, obj):
187+
def document_link(self, obj: AgentActionResult) -> str:
186188
"""Link to the document in admin."""
187189
if obj.document:
188190
title = (
@@ -199,7 +201,7 @@ def document_link(self, obj):
199201

200202
document_link.short_description = "Document"
201203

202-
def tools_count(self, obj):
204+
def tools_count(self, obj: AgentActionResult) -> str:
203205
"""Display count of tools executed."""
204206
if obj.tools_executed:
205207
count = len(obj.tools_executed)
@@ -212,7 +214,7 @@ def tools_count(self, obj):
212214

213215
tools_count.short_description = "Tools"
214216

215-
def duration_display(self, obj):
217+
def duration_display(self, obj: AgentActionResult) -> str:
216218
"""Display execution duration."""
217219
duration = obj.duration_seconds
218220
if duration is not None:
@@ -228,7 +230,7 @@ def duration_display(self, obj):
228230

229231
duration_display.short_description = "Duration"
230232

231-
def agent_response_display(self, obj):
233+
def agent_response_display(self, obj: AgentActionResult) -> str:
232234
"""Display agent response with formatting."""
233235
if obj.agent_response:
234236
# Truncate very long responses in admin
@@ -240,7 +242,7 @@ def agent_response_display(self, obj):
240242

241243
agent_response_display.short_description = "Agent Response"
242244

243-
def tools_executed_display(self, obj):
245+
def tools_executed_display(self, obj: AgentActionResult) -> str:
244246
"""Display tools executed as formatted JSON."""
245247
if obj.tools_executed:
246248
import json
@@ -254,7 +256,7 @@ def tools_executed_display(self, obj):
254256

255257
tools_executed_display.short_description = "Tools Executed"
256258

257-
def execution_metadata_display(self, obj):
259+
def execution_metadata_display(self, obj: AgentActionResult) -> str:
258260
"""Display execution metadata as formatted JSON."""
259261
if obj.execution_metadata:
260262
import json
@@ -265,7 +267,7 @@ def execution_metadata_display(self, obj):
265267

266268
execution_metadata_display.short_description = "Execution Metadata"
267269

268-
def get_queryset(self, request):
270+
def get_queryset(self, request: HttpRequest) -> QuerySet[AgentActionResult]:
269271
"""Optimize queryset with select_related."""
270272
return (
271273
super()

opencontractserver/agents/apps.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@
33

44

55
class AgentsConfig(AppConfig):
6-
default_auto_field = "django.db.models.BigAutoField"
7-
name = "opencontractserver.agents"
6+
default_auto_field: str = "django.db.models.BigAutoField"
7+
name: str = "opencontractserver.agents"
88
verbose_name = _("Agents")

opencontractserver/agents/memory.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import logging
1212
import re
1313
from datetime import datetime, timezone
14-
from typing import TYPE_CHECKING
14+
from typing import TYPE_CHECKING, Any
1515

1616
from channels.db import database_sync_to_async
1717
from django.core.files.base import ContentFile
@@ -74,7 +74,7 @@ def _build_empty_memory(corpus_id: int) -> str:
7474
# ---------------------------------------------------------------------------
7575

7676

77-
async def get_or_create_memory_document(corpus: Corpus, user) -> Document:
77+
async def get_or_create_memory_document(corpus: Corpus, user: Any) -> Document:
7878
"""Get the existing memory document or create a new empty one.
7979
8080
If the corpus already has a valid ``memory_document``, return it.
@@ -95,7 +95,7 @@ async def get_or_create_memory_document(corpus: Corpus, user) -> Document:
9595

9696
from opencontractserver.corpuses.models import Corpus as CorpusModel
9797

98-
def _get_or_create_sync():
98+
def _get_or_create_sync() -> Document:
9999
# Phase 1: Check under lock whether a memory document already exists.
100100
with transaction.atomic():
101101
locked = CorpusModel.objects.select_for_update().get(pk=corpus.pk)
@@ -164,7 +164,7 @@ async def read_memory_content(corpus: Corpus) -> str:
164164
if not corpus.memory_document_id:
165165
return ""
166166

167-
def _read():
167+
def _read() -> str:
168168
doc = corpus.memory_document
169169
if doc is None or not doc.txt_extract_file:
170170
return ""
@@ -187,7 +187,9 @@ def _read():
187187
return await database_sync_to_async(_read)()
188188

189189

190-
async def update_memory_content(corpus: Corpus, new_content: str, user) -> Document:
190+
async def update_memory_content(
191+
corpus: Corpus, new_content: str, user: Any
192+
) -> Document:
191193
"""Update the memory document with new content.
192194
193195
Creates the memory document if it doesn't exist. Overwrites the
@@ -215,7 +217,7 @@ async def update_memory_content(corpus: Corpus, new_content: str, user) -> Docum
215217

216218
doc = await get_or_create_memory_document(corpus, user)
217219

218-
def _update():
220+
def _update() -> Document:
219221
from django.db import transaction
220222

221223
with transaction.atomic():
@@ -431,7 +433,7 @@ def merge_curation_into_memory(
431433
current_content: str,
432434
collection_patterns: list[str],
433435
query_patterns: list[str],
434-
refinements: list[dict],
436+
refinements: list[dict[str, Any]],
435437
) -> str:
436438
"""Merge curation results into the existing memory document.
437439

opencontractserver/analyzer/admin.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from django.contrib import admin
2-
from django.urls import path
2+
from django.urls import URLPattern, path
33
from guardian.admin import GuardedModelAdmin
44

55
from opencontractserver.analyzer.admin_views import AnalyzerSyncView
@@ -16,7 +16,7 @@ class AnalyzerAdmin(GuardedModelAdmin):
1616
list_display = ["id", "description", "task_name", "host_gremlin"]
1717
change_list_template = "admin/analyzer/analyzer_changelist.html"
1818

19-
def get_urls(self):
19+
def get_urls(self) -> list[URLPattern]:
2020
urls = super().get_urls()
2121
custom_urls = [
2222
path(

opencontractserver/analyzer/admin_views.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1+
from typing import Any
2+
13
from django.contrib import messages
24
from django.contrib.admin.views.decorators import staff_member_required
5+
from django.http import HttpRequest, HttpResponse
36
from django.shortcuts import redirect, render
47
from django.urls import reverse
58
from django.utils.decorators import method_decorator
@@ -17,11 +20,11 @@
1720
class AnalyzerSyncView(View):
1821
"""Custom admin view for syncing doc analyzer tasks"""
1922

20-
template_name = "admin/analyzer/analyzer_sync.html"
23+
template_name: str = "admin/analyzer/analyzer_sync.html"
2124

22-
def get_available_analyzers(self):
25+
def get_available_analyzers(self) -> list[dict[str, Any]]:
2326
"""Get info about all available doc analyzer tasks"""
24-
analyzers = []
27+
analyzers: list[dict[str, Any]] = []
2528

2629
for task_name in celery_app.tasks.keys():
2730
analyzer_task = get_doc_analyzer_task_by_name(task_name)
@@ -48,8 +51,8 @@ def get_available_analyzers(self):
4851

4952
return sorted(analyzers, key=lambda x: (x["exists"], x["task_name"]))
5053

51-
def get(self, request):
52-
context = {
54+
def get(self, request: HttpRequest) -> HttpResponse:
55+
context: dict[str, Any] = {
5356
"title": "Sync Doc Analyzer Tasks",
5457
"analyzers": self.get_available_analyzers(),
5558
"opts": Analyzer._meta,
@@ -59,7 +62,7 @@ def get(self, request):
5962
}
6063
return render(request, self.template_name, context)
6164

62-
def post(self, request):
65+
def post(self, request: HttpRequest) -> HttpResponse:
6366
if not request.user.has_perm("analyzer.add_analyzer"):
6467
messages.error(request, "You don't have permission to create analyzers.")
6568
return redirect(reverse("admin:analyzer_sync"))

opencontractserver/analyzer/apps.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33

44

55
class AnnotationsConfig(AppConfig):
6-
default_auto_field = "django.db.models.BigAutoField"
7-
name = "opencontractserver.analyzer"
6+
default_auto_field: str = "django.db.models.BigAutoField"
7+
name: str = "opencontractserver.analyzer"
88

9-
def ready(self):
9+
def ready(self) -> None:
1010
try:
1111
import opencontractserver.analyzer.signals # noqa F401
1212
from opencontractserver.analyzer.models import GremlinEngine

opencontractserver/analyzer/checks.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
1+
from typing import Any
2+
13
from django.core.checks import Warning, register
24

35

46
@register()
5-
def check_unsynced_analyzers(app_configs, **kwargs):
7+
def check_unsynced_analyzers(app_configs: Any, **kwargs: Any) -> list[Warning]:
68
"""
79
Check if there are doc_analyzer_task decorated functions
810
that haven't been synced to the database.
911
"""
10-
warnings = []
12+
warnings: list[Warning] = []
1113

1214
try:
1315
from opencontractserver.analyzer.models import Analyzer

opencontractserver/analyzer/management/commands/sync_doc_analyzers.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
from typing import Any
2+
13
from django.contrib.auth import get_user_model
2-
from django.core.management.base import BaseCommand
4+
from django.core.management.base import BaseCommand, CommandParser
35

46
from opencontractserver.analyzer.models import Analyzer
57
from opencontractserver.analyzer.utils import auto_create_doc_analyzers
@@ -8,14 +10,14 @@
810
class Command(BaseCommand):
911
help = "Synchronize doc_analyzer_task decorated functions with Analyzer database"
1012

11-
def add_arguments(self, parser):
13+
def add_arguments(self, parser: CommandParser) -> None:
1214
parser.add_argument(
1315
"--dry-run",
1416
action="store_true",
1517
help="Show what would be created without making changes",
1618
)
1719

18-
def handle(self, *args, **options):
20+
def handle(self, *args: Any, **options: Any) -> None:
1921
UserModel = get_user_model()
2022

2123
if options["dry_run"]:

0 commit comments

Comments
 (0)