Open-Source-Legal
diff --git a/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎opencontractserver/agents/admin.py‎
Lines changed: 11 additions & 9 deletions b/‎opencontractserver/agents/admin.py‎
Lines changed: 11 additions & 9 deletions
diff --git a/‎opencontractserver/agents/apps.py‎
Lines changed: 2 additions & 2 deletions b/‎opencontractserver/agents/apps.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎opencontractserver/agents/memory.py‎
Lines changed: 9 additions & 7 deletions b/‎opencontractserver/agents/memory.py‎
Lines changed: 9 additions & 7 deletions
diff --git a/‎opencontractserver/analyzer/admin.py‎
Lines changed: 2 additions & 2 deletions b/‎opencontractserver/analyzer/admin.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎opencontractserver/analyzer/admin_views.py‎
Lines changed: 9 additions & 6 deletions b/‎opencontractserver/analyzer/admin_views.py‎
Lines changed: 9 additions & 6 deletions
diff --git a/‎opencontractserver/analyzer/apps.py‎
Lines changed: 3 additions & 3 deletions b/‎opencontractserver/analyzer/apps.py‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎opencontractserver/analyzer/checks.py‎
Lines changed: 4 additions & 2 deletions b/‎opencontractserver/analyzer/checks.py‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎opencontractserver/analyzer/management/commands/sync_doc_analyzers.py‎
Lines changed: 5 additions & 3 deletions b/‎opencontractserver/analyzer/management/commands/sync_doc_analyzers.py‎
Lines changed: 5 additions & 3 deletions
@@ -9,6 +9,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- **Mypy: type analyzer, shared, agents, badges, worker_uploads; introduce shared protocols** (Issue #1335): Brought the five smaller, interface-rich target packages over the ≥70% return-annotation bar called for by the issue and seeded `opencontractserver/types/protocols.py` with the four protocols requested in the scope:
+  - `VectorStoreProtocol` — minimum surface (`search` / `async_search`) implemented by `CoreAnnotationVectorStore` (`opencontractserver/llms/vector_stores/core_vector_stores.py`); imported and re-exported from that module so consumers can annotate against the protocol rather than the concrete dataclass.
+  - `PipelineComponentProtocol` — `title` / `description` / `author` / `dependencies` surface that the pipeline registry duck-types against; imported from `opencontractserver/pipeline/base/base_component.py` so any future parser/embedder/thumbnailer registered outside the inheritance hierarchy still type-checks against the same contract.
+  - `ToolProtocol` — `name` / `description` / `parameters` / `requires_approval` mirror of `CoreTool` (`opencontractserver/llms/tools/tool_factory.py`); framework adapters can accept any object satisfying it, no inheritance required.
+  - `PermissionedQueryManagerProtocol` — `visible_to_user(user) -> QuerySet` contract that `BaseVisibilityManager`, `PermissionManager`, `DocumentManager`, `AnnotationManager`, and `NoteManager` all satisfy (`opencontractserver/shared/Managers.py`); imported there so callers receiving "a permissioned manager" can type against the protocol instead of a concrete class.
+  - `StreamObserverProtocol` — duplicated `__call__(event)` shape from `opencontractserver/llms/types.StreamObserver` for callers in non-LLM modules (notifications, websockets) that need the contract without importing `llms.types`.
+  - **Coverage delta** (return-annotation coverage measured by AST walk, excluding `__init__.py`):
+    - `analyzer/`: 12.5% → **87.5%** (target ≥70%)
+    - `shared/`: 38.1% → **96.3%** (target ≥70%)
+    - `agents/`: 34.4% → **71.9%** (target ≥70%)
+    - `badges/`: 47.1% → **100%** (target ≥70%)
+    - `worker_uploads/`: 46.4% → **100%** (target ≥70%)
+  - **Files touched** (annotations only, zero behavior changes): `analyzer/{apps,checks,signals,startup,utils,admin,admin_views}.py`, `analyzer/management/commands/sync_doc_analyzers.py`, `agents/{apps,admin,memory}.py`, `badges/{apps,signals,models}.py`, `worker_uploads/{apps,auth,serializers,views,models,tasks}.py`, `shared/{utils,defaults,fields,mixins,Managers,QuerySets,decorators}.py`, plus the four protocol-consumer wirings above.
+  - **Bare-generic promotion**: every `dict` / `list` / `set` in newly-touched public signatures was widened to a parametrised form (`dict[str, Any]`, `list[Any]`, `dict[str, int]`, etc.). The `HasEmbeddingMixin` docstring example was tightened to match (`-> dict[str, Any]`).
+  - **`# type: ignore` audit**: every bare `# type: ignore` comment in the codebase now carries a specific error code. `opencontractserver/llms/tools/core_tools.py:413,416,418,420` (channels/asgiref import + `partial` kwarg-aware wrapper) tightened to `[import-not-found]` / `[call-arg]` with a one-line comment explaining why; `opencontractserver/utils/embeddings.py:171` to `[attr-defined]`; `opencontractserver/tests/test_pipeline_utils.py:395,423,431,438,446` cleaned up the duplicated `# type: ignore; type: ignore` markers and scoped them to `[import-not-found]`; `opencontractserver/tests/test_core_tool_factory.py:34,35,47` widened from bare to `[attr-defined]`. The total count went from 62 → 61, and the ratio of bare-to-scoped dropped to zero.
 - **Return-type annotations across `config/graphql/` resolvers and mutations** (Issue #1332, follow-up to #1331): The largest, least-typed subtree in the backend (459 function definitions, ~4.4% return-annotation coverage at baseline) is now at 100% return-annotation coverage. Touched files include every `*_mutations.py`, every `*_queries.py`, every `*_types.py`, plus `filters.py`, `base.py`, `base_types.py`, `security.py`, `optimized_file_resolvers.py`, `permissioning/permission_annotator/{middleware,mixins,utils}.py`, and the small utility modules. No behavioral changes — annotations only.
   - **`mutate(...)` on `graphene.Mutation` subclasses**: typed as forward references to the enclosing class (`-> "ClassName"`). Discovered and fixed the latent bug in `config/graphql/analysis_mutations.py:179` (`DeleteAnalysisMutation.mutate`) where the success path had no `return` statement; annotation is `-> "DeleteAnalysisMutation | None"` and an explicit `return None` was added to preserve the original implicit-None behavior on success.
   - **`resolve_*` methods**: typed as `-> Any` by default, refined where the GraphQL field type makes the runtime return obvious (e.g. `resolve_in_use -> bool`, `resolve_datacell_count -> int`).
@@ -23,6 +38,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
+  - **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
+  - **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
+  - **Bug — duplicate `OC_EXTRACT_SOURCE` annotations on Celery retry** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation` & `_create_span_annotation`): nothing prevented the grounding pipeline from creating fresh annotations and re-linking them via `datacell.sources.add(*annotations)` if `ground_extraction_to_annotations` ran twice on the same datacell (Celery retry after partial failure was the realistic trigger). Replaced the construct-then-`save()` flow with `Annotation.objects.get_or_create()` keyed on `(document, annotation_label, annotation_type, raw_text, …)` so retries reuse existing rows. `datacell.sources` is a `ManyToManyField`, so re-linking the same row is already a no-op once the row is shared.
+  - **Constant — extracted `DOCX_MIME_TYPE`** (`opencontractserver/constants/document_processing.py`): the long `application/vnd.openxmlformats-officedocument.wordprocessingml.document` literal previously lived inline in `_load_document_text_and_layer`. Per the project's no-magic-strings rule it now sits next to `MARKDOWN_MIME_TYPE` and is imported from one place.
+  - **Type annotations** (`opencontractserver/utils/extraction_grounding.py`): `Document`, `Corpus`, `Datacell`, `Annotation`, and `AnnotationLabel` parameters and return types added via a `TYPE_CHECKING` block on every public and helper function. No runtime change.
+  - **Documentation** — the `page=1` placeholder for SPAN_LABEL annotations (text/DOCX) is now documented in the function docstring, explaining that the `txt_extract_file` pipeline does not preserve a page-break map and the actual location lives in the character offsets in `json`.
+  - **Tests** — `opencontractserver/tests/test_extraction_grounding.py`:
+    - `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
+    - `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
+
 - **`CreateCorpusActionModal` opened with the wrong default agent instructions for document triggers** (Issue #1385, `frontend/src/components/corpuses/CreateCorpusActionModal.tsx:136-144,168-171`): the `inlineAgentInstructions` state was initialised with `DEFAULT_MODERATOR_INSTRUCTIONS` even though the default trigger is `add_document` (a document trigger). The trigger-change handler at line 611 swaps to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS`, but a user who created an inline agent on the default-selected trigger without first re-selecting the trigger would submit the moderator copy as the new agent's system instructions. Initialised both the `useState` default and `resetForm()` to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` so the pre-interaction value matches the default trigger. Updated `frontend/tests/CreateCorpusActionModal.ct.tsx` "inline-agent create: full happy path" mutation mock to expect `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` — the previous mock variable masked this bug because `MockedProvider` was matching the stale moderator default rather than the trigger-appropriate one.
 
 ### Changed
 
@@ -1,4 +1,6 @@
 from django.contrib import admin
+from django.db.models import QuerySet
+from django.http import HttpRequest
 from django.utils.html import format_html
 from guardian.admin import GuardedModelAdmin
 
@@ -146,7 +148,7 @@ class AgentActionResultAdmin(GuardedModelAdmin):
         ),
     )
 
-    def status_badge(self, obj):
+    def status_badge(self, obj: AgentActionResult) -> str:
         """Display status as a colored badge."""
         colors = {
             "pending": "#6c757d",  # gray
@@ -165,7 +167,7 @@ def status_badge(self, obj):
     status_badge.short_description = "Status"
     status_badge.admin_order_field = "status"
 
-    def corpus_action_link(self, obj):
+    def corpus_action_link(self, obj: AgentActionResult) -> str:
         """Link to the corpus action in admin."""
         if obj.corpus_action:
             name = (
@@ -182,7 +184,7 @@ def corpus_action_link(self, obj):
 
     corpus_action_link.short_description = "Corpus Action"
 
-    def document_link(self, obj):
+    def document_link(self, obj: AgentActionResult) -> str:
         """Link to the document in admin."""
         if obj.document:
             title = (
@@ -199,7 +201,7 @@ def document_link(self, obj):
 
     document_link.short_description = "Document"
 
-    def tools_count(self, obj):
+    def tools_count(self, obj: AgentActionResult) -> str:
         """Display count of tools executed."""
         if obj.tools_executed:
             count = len(obj.tools_executed)
@@ -212,7 +214,7 @@ def tools_count(self, obj):
 
     tools_count.short_description = "Tools"
 
-    def duration_display(self, obj):
+    def duration_display(self, obj: AgentActionResult) -> str:
         """Display execution duration."""
         duration = obj.duration_seconds
         if duration is not None:
@@ -228,7 +230,7 @@ def duration_display(self, obj):
 
     duration_display.short_description = "Duration"
 
-    def agent_response_display(self, obj):
+    def agent_response_display(self, obj: AgentActionResult) -> str:
         """Display agent response with formatting."""
         if obj.agent_response:
             # Truncate very long responses in admin
@@ -240,7 +242,7 @@ def agent_response_display(self, obj):
 
     agent_response_display.short_description = "Agent Response"
 
-    def tools_executed_display(self, obj):
+    def tools_executed_display(self, obj: AgentActionResult) -> str:
         """Display tools executed as formatted JSON."""
         if obj.tools_executed:
             import json
@@ -254,7 +256,7 @@ def tools_executed_display(self, obj):
 
     tools_executed_display.short_description = "Tools Executed"
 
-    def execution_metadata_display(self, obj):
+    def execution_metadata_display(self, obj: AgentActionResult) -> str:
         """Display execution metadata as formatted JSON."""
         if obj.execution_metadata:
             import json
@@ -265,7 +267,7 @@ def execution_metadata_display(self, obj):
 
     execution_metadata_display.short_description = "Execution Metadata"
 
-    def get_queryset(self, request):
+    def get_queryset(self, request: HttpRequest) -> QuerySet[AgentActionResult]:
         """Optimize queryset with select_related."""
         return (
             super()
 
@@ -3,6 +3,6 @@
 
 
 class AgentsConfig(AppConfig):
-    default_auto_field = "django.db.models.BigAutoField"
-    name = "opencontractserver.agents"
+    default_auto_field: str = "django.db.models.BigAutoField"
+    name: str = "opencontractserver.agents"
     verbose_name = _("Agents")
@@ -11,7 +11,7 @@
 import logging
 import re
 from datetime import datetime, timezone
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any
 
 from channels.db import database_sync_to_async
 from django.core.files.base import ContentFile
@@ -74,7 +74,7 @@ def _build_empty_memory(corpus_id: int) -> str:
 # ---------------------------------------------------------------------------
 
 
-async def get_or_create_memory_document(corpus: Corpus, user) -> Document:
+async def get_or_create_memory_document(corpus: Corpus, user: Any) -> Document:
     """Get the existing memory document or create a new empty one.
 
     If the corpus already has a valid ``memory_document``, return it.
@@ -95,7 +95,7 @@ async def get_or_create_memory_document(corpus: Corpus, user) -> Document:
 
     from opencontractserver.corpuses.models import Corpus as CorpusModel
 
-    def _get_or_create_sync():
+    def _get_or_create_sync() -> Document:
         # Phase 1: Check under lock whether a memory document already exists.
         with transaction.atomic():
             locked = CorpusModel.objects.select_for_update().get(pk=corpus.pk)
@@ -164,7 +164,7 @@ async def read_memory_content(corpus: Corpus) -> str:
     if not corpus.memory_document_id:
         return ""
 
-    def _read():
+    def _read() -> str:
         doc = corpus.memory_document
         if doc is None or not doc.txt_extract_file:
             return ""
@@ -187,7 +187,9 @@ def _read():
     return await database_sync_to_async(_read)()
 
 
-async def update_memory_content(corpus: Corpus, new_content: str, user) -> Document:
+async def update_memory_content(
+    corpus: Corpus, new_content: str, user: Any
+) -> Document:
     """Update the memory document with new content.
 
     Creates the memory document if it doesn't exist.  Overwrites the
@@ -215,7 +217,7 @@ async def update_memory_content(corpus: Corpus, new_content: str, user) -> Docum
 
     doc = await get_or_create_memory_document(corpus, user)
 
-    def _update():
+    def _update() -> Document:
         from django.db import transaction
 
         with transaction.atomic():
@@ -431,7 +433,7 @@ def merge_curation_into_memory(
     current_content: str,
     collection_patterns: list[str],
     query_patterns: list[str],
-    refinements: list[dict],
+    refinements: list[dict[str, Any]],
 ) -> str:
     """Merge curation results into the existing memory document.
 
 
@@ -1,5 +1,5 @@
 from django.contrib import admin
-from django.urls import path
+from django.urls import URLPattern, path
 from guardian.admin import GuardedModelAdmin
 
 from opencontractserver.analyzer.admin_views import AnalyzerSyncView
@@ -16,7 +16,7 @@ class AnalyzerAdmin(GuardedModelAdmin):
     list_display = ["id", "description", "task_name", "host_gremlin"]
     change_list_template = "admin/analyzer/analyzer_changelist.html"
 
-    def get_urls(self):
+    def get_urls(self) -> list[URLPattern]:
         urls = super().get_urls()
         custom_urls = [
             path(
 
@@ -1,5 +1,8 @@
+from typing import Any
+
 from django.contrib import messages
 from django.contrib.admin.views.decorators import staff_member_required
+from django.http import HttpRequest, HttpResponse
 from django.shortcuts import redirect, render
 from django.urls import reverse
 from django.utils.decorators import method_decorator
@@ -17,11 +20,11 @@
 class AnalyzerSyncView(View):
     """Custom admin view for syncing doc analyzer tasks"""
 
-    template_name = "admin/analyzer/analyzer_sync.html"
+    template_name: str = "admin/analyzer/analyzer_sync.html"
 
-    def get_available_analyzers(self):
+    def get_available_analyzers(self) -> list[dict[str, Any]]:
         """Get info about all available doc analyzer tasks"""
-        analyzers = []
+        analyzers: list[dict[str, Any]] = []
 
         for task_name in celery_app.tasks.keys():
             analyzer_task = get_doc_analyzer_task_by_name(task_name)
@@ -48,8 +51,8 @@ def get_available_analyzers(self):
 
         return sorted(analyzers, key=lambda x: (x["exists"], x["task_name"]))
 
-    def get(self, request):
-        context = {
+    def get(self, request: HttpRequest) -> HttpResponse:
+        context: dict[str, Any] = {
             "title": "Sync Doc Analyzer Tasks",
             "analyzers": self.get_available_analyzers(),
             "opts": Analyzer._meta,
@@ -59,7 +62,7 @@ def get(self, request):
         }
         return render(request, self.template_name, context)
 
-    def post(self, request):
+    def post(self, request: HttpRequest) -> HttpResponse:
         if not request.user.has_perm("analyzer.add_analyzer"):
             messages.error(request, "You don't have permission to create analyzers.")
             return redirect(reverse("admin:analyzer_sync"))
 
@@ -3,10 +3,10 @@
 
 
 class AnnotationsConfig(AppConfig):
-    default_auto_field = "django.db.models.BigAutoField"
-    name = "opencontractserver.analyzer"
+    default_auto_field: str = "django.db.models.BigAutoField"
+    name: str = "opencontractserver.analyzer"
 
-    def ready(self):
+    def ready(self) -> None:
         try:
             import opencontractserver.analyzer.signals  # noqa F401
             from opencontractserver.analyzer.models import GremlinEngine
 
@@ -1,13 +1,15 @@
+from typing import Any
+
 from django.core.checks import Warning, register
 
 
 @register()
-def check_unsynced_analyzers(app_configs, **kwargs):
+def check_unsynced_analyzers(app_configs: Any, **kwargs: Any) -> list[Warning]:
     """
     Check if there are doc_analyzer_task decorated functions
     that haven't been synced to the database.
     """
-    warnings = []
+    warnings: list[Warning] = []
 
     try:
         from opencontractserver.analyzer.models import Analyzer
 
@@ -1,5 +1,7 @@
+from typing import Any
+
 from django.contrib.auth import get_user_model
-from django.core.management.base import BaseCommand
+from django.core.management.base import BaseCommand, CommandParser
 
 from opencontractserver.analyzer.models import Analyzer
 from opencontractserver.analyzer.utils import auto_create_doc_analyzers
@@ -8,14 +10,14 @@
 class Command(BaseCommand):
     help = "Synchronize doc_analyzer_task decorated functions with Analyzer database"
 
-    def add_arguments(self, parser):
+    def add_arguments(self, parser: CommandParser) -> None:
         parser.add_argument(
             "--dry-run",
             action="store_true",
             help="Show what would be created without making changes",
         )
 
-    def handle(self, *args, **options):
+    def handle(self, *args: Any, **options: Any) -> None:
         UserModel = get_user_model()
 
         if options["dry_run"]: