Skip to content

Commit 9d8226e

Browse files
committed
Merge remote-tracking branch 'origin/main' into pr-1239-clean
Resolves divergent-history merge against main. PR's prior main-merge commit (0883061) was tree-identical to main's f3ae088 but had a different SHA, so naive merge produced 74 spurious "both added" conflicts. Used the equivalent main commit as the virtual merge base to reduce that to one real content conflict in opencontractserver/llms/vector_stores/core_vector_stores.py: - Combined reranker imports (ours) with VectorStoreProtocol import (theirs) at module top. - Kept the rerank-aware return in the empty-query-text branch (ours) and added the __all__ re-export from theirs at module level.
2 parents 35cc004 + f760d87 commit 9d8226e

41 files changed

Lines changed: 1089 additions & 213 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
- **Mypy: type analyzer, shared, agents, badges, worker_uploads; introduce shared protocols** (Issue #1335): Brought the five smaller, interface-rich target packages over the ≥70% return-annotation bar called for by the issue and seeded `opencontractserver/types/protocols.py` with the four protocols requested in the scope:
13+
- `VectorStoreProtocol` — minimum surface (`search` / `async_search`) implemented by `CoreAnnotationVectorStore` (`opencontractserver/llms/vector_stores/core_vector_stores.py`); imported and re-exported from that module so consumers can annotate against the protocol rather than the concrete dataclass.
14+
- `PipelineComponentProtocol``title` / `description` / `author` / `dependencies` surface that the pipeline registry duck-types against; imported from `opencontractserver/pipeline/base/base_component.py` so any future parser/embedder/thumbnailer registered outside the inheritance hierarchy still type-checks against the same contract.
15+
- `ToolProtocol``name` / `description` / `parameters` / `requires_approval` mirror of `CoreTool` (`opencontractserver/llms/tools/tool_factory.py`); framework adapters can accept any object satisfying it, no inheritance required.
16+
- `PermissionedQueryManagerProtocol``visible_to_user(user) -> QuerySet` contract that `BaseVisibilityManager`, `PermissionManager`, `DocumentManager`, `AnnotationManager`, and `NoteManager` all satisfy (`opencontractserver/shared/Managers.py`); imported there so callers receiving "a permissioned manager" can type against the protocol instead of a concrete class.
17+
- `StreamObserverProtocol` — duplicated `__call__(event)` shape from `opencontractserver/llms/types.StreamObserver` for callers in non-LLM modules (notifications, websockets) that need the contract without importing `llms.types`.
18+
- **Coverage delta** (return-annotation coverage measured by AST walk, excluding `__init__.py`):
19+
- `analyzer/`: 12.5% → **87.5%** (target ≥70%)
20+
- `shared/`: 38.1% → **96.3%** (target ≥70%)
21+
- `agents/`: 34.4% → **71.9%** (target ≥70%)
22+
- `badges/`: 47.1% → **100%** (target ≥70%)
23+
- `worker_uploads/`: 46.4% → **100%** (target ≥70%)
24+
- **Files touched** (annotations only, zero behavior changes): `analyzer/{apps,checks,signals,startup,utils,admin,admin_views}.py`, `analyzer/management/commands/sync_doc_analyzers.py`, `agents/{apps,admin,memory}.py`, `badges/{apps,signals,models}.py`, `worker_uploads/{apps,auth,serializers,views,models,tasks}.py`, `shared/{utils,defaults,fields,mixins,Managers,QuerySets,decorators}.py`, plus the four protocol-consumer wirings above.
25+
- **Bare-generic promotion**: every `dict` / `list` / `set` in newly-touched public signatures was widened to a parametrised form (`dict[str, Any]`, `list[Any]`, `dict[str, int]`, etc.). The `HasEmbeddingMixin` docstring example was tightened to match (`-> dict[str, Any]`).
26+
- **`# type: ignore` audit**: every bare `# type: ignore` comment in the codebase now carries a specific error code. `opencontractserver/llms/tools/core_tools.py:413,416,418,420` (channels/asgiref import + `partial` kwarg-aware wrapper) tightened to `[import-not-found]` / `[call-arg]` with a one-line comment explaining why; `opencontractserver/utils/embeddings.py:171` to `[attr-defined]`; `opencontractserver/tests/test_pipeline_utils.py:395,423,431,438,446` cleaned up the duplicated `# type: ignore; type: ignore` markers and scoped them to `[import-not-found]`; `opencontractserver/tests/test_core_tool_factory.py:34,35,47` widened from bare to `[attr-defined]`. The total count went from 62 → 61, and the ratio of bare-to-scoped dropped to zero.
1227
- **Return-type annotations across `config/graphql/` resolvers and mutations** (Issue #1332, follow-up to #1331): The largest, least-typed subtree in the backend (459 function definitions, ~4.4% return-annotation coverage at baseline) is now at 100% return-annotation coverage. Touched files include every `*_mutations.py`, every `*_queries.py`, every `*_types.py`, plus `filters.py`, `base.py`, `base_types.py`, `security.py`, `optimized_file_resolvers.py`, `permissioning/permission_annotator/{middleware,mixins,utils}.py`, and the small utility modules. No behavioral changes — annotations only.
1328
- **`mutate(...)` on `graphene.Mutation` subclasses**: typed as forward references to the enclosing class (`-> "ClassName"`). Discovered and fixed the latent bug in `config/graphql/analysis_mutations.py:179` (`DeleteAnalysisMutation.mutate`) where the success path had no `return` statement; annotation is `-> "DeleteAnalysisMutation | None"` and an explicit `return None` was added to preserve the original implicit-None behavior on success.
1429
- **`resolve_*` methods**: typed as `-> Any` by default, refined where the GraphQL field type makes the runtime return obvious (e.g. `resolve_in_use -> bool`, `resolve_datacell_count -> int`).
@@ -23,6 +38,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2338

2439
### Fixed
2540

41+
- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
42+
- **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
43+
- **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
44+
- **Bug — duplicate `OC_EXTRACT_SOURCE` annotations on Celery retry** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation` & `_create_span_annotation`): nothing prevented the grounding pipeline from creating fresh annotations and re-linking them via `datacell.sources.add(*annotations)` if `ground_extraction_to_annotations` ran twice on the same datacell (Celery retry after partial failure was the realistic trigger). Replaced the construct-then-`save()` flow with `Annotation.objects.get_or_create()` keyed on `(document, annotation_label, annotation_type, raw_text, …)` so retries reuse existing rows. `datacell.sources` is a `ManyToManyField`, so re-linking the same row is already a no-op once the row is shared.
45+
- **Constant — extracted `DOCX_MIME_TYPE`** (`opencontractserver/constants/document_processing.py`): the long `application/vnd.openxmlformats-officedocument.wordprocessingml.document` literal previously lived inline in `_load_document_text_and_layer`. Per the project's no-magic-strings rule it now sits next to `MARKDOWN_MIME_TYPE` and is imported from one place.
46+
- **Type annotations** (`opencontractserver/utils/extraction_grounding.py`): `Document`, `Corpus`, `Datacell`, `Annotation`, and `AnnotationLabel` parameters and return types added via a `TYPE_CHECKING` block on every public and helper function. No runtime change.
47+
- **Documentation** — the `page=1` placeholder for SPAN_LABEL annotations (text/DOCX) is now documented in the function docstring, explaining that the `txt_extract_file` pipeline does not preserve a page-break map and the actual location lives in the character offsets in `json`.
48+
- **Tests**`opencontractserver/tests/test_extraction_grounding.py`:
49+
- `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
50+
- `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
51+
2652
- **`CreateCorpusActionModal` opened with the wrong default agent instructions for document triggers** (Issue #1385, `frontend/src/components/corpuses/CreateCorpusActionModal.tsx:136-144,168-171`): the `inlineAgentInstructions` state was initialised with `DEFAULT_MODERATOR_INSTRUCTIONS` even though the default trigger is `add_document` (a document trigger). The trigger-change handler at line 611 swaps to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS`, but a user who created an inline agent on the default-selected trigger without first re-selecting the trigger would submit the moderator copy as the new agent's system instructions. Initialised both the `useState` default and `resetForm()` to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` so the pre-interaction value matches the default trigger. Updated `frontend/tests/CreateCorpusActionModal.ct.tsx` "inline-agent create: full happy path" mutation mock to expect `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` — the previous mock variable masked this bug because `MockedProvider` was matching the stale moderator default rather than the trigger-appropriate one.
2753

2854
### Changed

opencontractserver/agents/admin.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
from django.contrib import admin
2+
from django.db.models import QuerySet
3+
from django.http import HttpRequest
24
from django.utils.html import format_html
35
from guardian.admin import GuardedModelAdmin
46

@@ -146,7 +148,7 @@ class AgentActionResultAdmin(GuardedModelAdmin):
146148
),
147149
)
148150

149-
def status_badge(self, obj):
151+
def status_badge(self, obj: AgentActionResult) -> str:
150152
"""Display status as a colored badge."""
151153
colors = {
152154
"pending": "#6c757d", # gray
@@ -165,7 +167,7 @@ def status_badge(self, obj):
165167
status_badge.short_description = "Status"
166168
status_badge.admin_order_field = "status"
167169

168-
def corpus_action_link(self, obj):
170+
def corpus_action_link(self, obj: AgentActionResult) -> str:
169171
"""Link to the corpus action in admin."""
170172
if obj.corpus_action:
171173
name = (
@@ -182,7 +184,7 @@ def corpus_action_link(self, obj):
182184

183185
corpus_action_link.short_description = "Corpus Action"
184186

185-
def document_link(self, obj):
187+
def document_link(self, obj: AgentActionResult) -> str:
186188
"""Link to the document in admin."""
187189
if obj.document:
188190
title = (
@@ -199,7 +201,7 @@ def document_link(self, obj):
199201

200202
document_link.short_description = "Document"
201203

202-
def tools_count(self, obj):
204+
def tools_count(self, obj: AgentActionResult) -> str:
203205
"""Display count of tools executed."""
204206
if obj.tools_executed:
205207
count = len(obj.tools_executed)
@@ -212,7 +214,7 @@ def tools_count(self, obj):
212214

213215
tools_count.short_description = "Tools"
214216

215-
def duration_display(self, obj):
217+
def duration_display(self, obj: AgentActionResult) -> str:
216218
"""Display execution duration."""
217219
duration = obj.duration_seconds
218220
if duration is not None:
@@ -228,7 +230,7 @@ def duration_display(self, obj):
228230

229231
duration_display.short_description = "Duration"
230232

231-
def agent_response_display(self, obj):
233+
def agent_response_display(self, obj: AgentActionResult) -> str:
232234
"""Display agent response with formatting."""
233235
if obj.agent_response:
234236
# Truncate very long responses in admin
@@ -240,7 +242,7 @@ def agent_response_display(self, obj):
240242

241243
agent_response_display.short_description = "Agent Response"
242244

243-
def tools_executed_display(self, obj):
245+
def tools_executed_display(self, obj: AgentActionResult) -> str:
244246
"""Display tools executed as formatted JSON."""
245247
if obj.tools_executed:
246248
import json
@@ -254,7 +256,7 @@ def tools_executed_display(self, obj):
254256

255257
tools_executed_display.short_description = "Tools Executed"
256258

257-
def execution_metadata_display(self, obj):
259+
def execution_metadata_display(self, obj: AgentActionResult) -> str:
258260
"""Display execution metadata as formatted JSON."""
259261
if obj.execution_metadata:
260262
import json
@@ -265,7 +267,7 @@ def execution_metadata_display(self, obj):
265267

266268
execution_metadata_display.short_description = "Execution Metadata"
267269

268-
def get_queryset(self, request):
270+
def get_queryset(self, request: HttpRequest) -> QuerySet[AgentActionResult]:
269271
"""Optimize queryset with select_related."""
270272
return (
271273
super()

opencontractserver/agents/apps.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@
33

44

55
class AgentsConfig(AppConfig):
6-
default_auto_field = "django.db.models.BigAutoField"
7-
name = "opencontractserver.agents"
6+
default_auto_field: str = "django.db.models.BigAutoField"
7+
name: str = "opencontractserver.agents"
88
verbose_name = _("Agents")

opencontractserver/agents/memory.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import logging
1212
import re
1313
from datetime import datetime, timezone
14-
from typing import TYPE_CHECKING
14+
from typing import TYPE_CHECKING, Any
1515

1616
from channels.db import database_sync_to_async
1717
from django.core.files.base import ContentFile
@@ -74,7 +74,7 @@ def _build_empty_memory(corpus_id: int) -> str:
7474
# ---------------------------------------------------------------------------
7575

7676

77-
async def get_or_create_memory_document(corpus: Corpus, user) -> Document:
77+
async def get_or_create_memory_document(corpus: Corpus, user: Any) -> Document:
7878
"""Get the existing memory document or create a new empty one.
7979
8080
If the corpus already has a valid ``memory_document``, return it.
@@ -95,7 +95,7 @@ async def get_or_create_memory_document(corpus: Corpus, user) -> Document:
9595

9696
from opencontractserver.corpuses.models import Corpus as CorpusModel
9797

98-
def _get_or_create_sync():
98+
def _get_or_create_sync() -> Document:
9999
# Phase 1: Check under lock whether a memory document already exists.
100100
with transaction.atomic():
101101
locked = CorpusModel.objects.select_for_update().get(pk=corpus.pk)
@@ -164,7 +164,7 @@ async def read_memory_content(corpus: Corpus) -> str:
164164
if not corpus.memory_document_id:
165165
return ""
166166

167-
def _read():
167+
def _read() -> str:
168168
doc = corpus.memory_document
169169
if doc is None or not doc.txt_extract_file:
170170
return ""
@@ -187,7 +187,9 @@ def _read():
187187
return await database_sync_to_async(_read)()
188188

189189

190-
async def update_memory_content(corpus: Corpus, new_content: str, user) -> Document:
190+
async def update_memory_content(
191+
corpus: Corpus, new_content: str, user: Any
192+
) -> Document:
191193
"""Update the memory document with new content.
192194
193195
Creates the memory document if it doesn't exist. Overwrites the
@@ -215,7 +217,7 @@ async def update_memory_content(corpus: Corpus, new_content: str, user) -> Docum
215217

216218
doc = await get_or_create_memory_document(corpus, user)
217219

218-
def _update():
220+
def _update() -> Document:
219221
from django.db import transaction
220222

221223
with transaction.atomic():
@@ -431,7 +433,7 @@ def merge_curation_into_memory(
431433
current_content: str,
432434
collection_patterns: list[str],
433435
query_patterns: list[str],
434-
refinements: list[dict],
436+
refinements: list[dict[str, Any]],
435437
) -> str:
436438
"""Merge curation results into the existing memory document.
437439

opencontractserver/analyzer/admin.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from django.contrib import admin
2-
from django.urls import path
2+
from django.urls import URLPattern, path
33
from guardian.admin import GuardedModelAdmin
44

55
from opencontractserver.analyzer.admin_views import AnalyzerSyncView
@@ -16,7 +16,7 @@ class AnalyzerAdmin(GuardedModelAdmin):
1616
list_display = ["id", "description", "task_name", "host_gremlin"]
1717
change_list_template = "admin/analyzer/analyzer_changelist.html"
1818

19-
def get_urls(self):
19+
def get_urls(self) -> list[URLPattern]:
2020
urls = super().get_urls()
2121
custom_urls = [
2222
path(

opencontractserver/analyzer/admin_views.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1+
from typing import Any
2+
13
from django.contrib import messages
24
from django.contrib.admin.views.decorators import staff_member_required
5+
from django.http import HttpRequest, HttpResponse
36
from django.shortcuts import redirect, render
47
from django.urls import reverse
58
from django.utils.decorators import method_decorator
@@ -17,11 +20,11 @@
1720
class AnalyzerSyncView(View):
1821
"""Custom admin view for syncing doc analyzer tasks"""
1922

20-
template_name = "admin/analyzer/analyzer_sync.html"
23+
template_name: str = "admin/analyzer/analyzer_sync.html"
2124

22-
def get_available_analyzers(self):
25+
def get_available_analyzers(self) -> list[dict[str, Any]]:
2326
"""Get info about all available doc analyzer tasks"""
24-
analyzers = []
27+
analyzers: list[dict[str, Any]] = []
2528

2629
for task_name in celery_app.tasks.keys():
2730
analyzer_task = get_doc_analyzer_task_by_name(task_name)
@@ -48,8 +51,8 @@ def get_available_analyzers(self):
4851

4952
return sorted(analyzers, key=lambda x: (x["exists"], x["task_name"]))
5053

51-
def get(self, request):
52-
context = {
54+
def get(self, request: HttpRequest) -> HttpResponse:
55+
context: dict[str, Any] = {
5356
"title": "Sync Doc Analyzer Tasks",
5457
"analyzers": self.get_available_analyzers(),
5558
"opts": Analyzer._meta,
@@ -59,7 +62,7 @@ def get(self, request):
5962
}
6063
return render(request, self.template_name, context)
6164

62-
def post(self, request):
65+
def post(self, request: HttpRequest) -> HttpResponse:
6366
if not request.user.has_perm("analyzer.add_analyzer"):
6467
messages.error(request, "You don't have permission to create analyzers.")
6568
return redirect(reverse("admin:analyzer_sync"))

opencontractserver/analyzer/apps.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33

44

55
class AnnotationsConfig(AppConfig):
6-
default_auto_field = "django.db.models.BigAutoField"
7-
name = "opencontractserver.analyzer"
6+
default_auto_field: str = "django.db.models.BigAutoField"
7+
name: str = "opencontractserver.analyzer"
88

9-
def ready(self):
9+
def ready(self) -> None:
1010
try:
1111
import opencontractserver.analyzer.signals # noqa F401
1212
from opencontractserver.analyzer.models import GremlinEngine

opencontractserver/analyzer/checks.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
1+
from typing import Any
2+
13
from django.core.checks import Warning, register
24

35

46
@register()
5-
def check_unsynced_analyzers(app_configs, **kwargs):
7+
def check_unsynced_analyzers(app_configs: Any, **kwargs: Any) -> list[Warning]:
68
"""
79
Check if there are doc_analyzer_task decorated functions
810
that haven't been synced to the database.
911
"""
10-
warnings = []
12+
warnings: list[Warning] = []
1113

1214
try:
1315
from opencontractserver.analyzer.models import Analyzer

opencontractserver/analyzer/management/commands/sync_doc_analyzers.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
from typing import Any
2+
13
from django.contrib.auth import get_user_model
2-
from django.core.management.base import BaseCommand
4+
from django.core.management.base import BaseCommand, CommandParser
35

46
from opencontractserver.analyzer.models import Analyzer
57
from opencontractserver.analyzer.utils import auto_create_doc_analyzers
@@ -8,14 +10,14 @@
810
class Command(BaseCommand):
911
help = "Synchronize doc_analyzer_task decorated functions with Analyzer database"
1012

11-
def add_arguments(self, parser):
13+
def add_arguments(self, parser: CommandParser) -> None:
1214
parser.add_argument(
1315
"--dry-run",
1416
action="store_true",
1517
help="Show what would be created without making changes",
1618
)
1719

18-
def handle(self, *args, **options):
20+
def handle(self, *args: Any, **options: Any) -> None:
1921
UserModel = get_user_model()
2022

2123
if options["dry_run"]:

0 commit comments

Comments
 (0)