From 4eb358f254b51e2c0f7307587dc91550c343e507 Mon Sep 17 00:00:00 2001 From: Donald Gray Date: Thu, 21 May 2026 11:40:13 +0100 Subject: [PATCH 1/3] Regenerate claude.md file Previous contained information about inital implementation with reference implementations, which is no longer relevant. --- CLAUDE.md | 315 ++++++++++++++++++++++++++---------------------------- 1 file changed, 153 insertions(+), 162 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 97907b6..e6f122a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,207 +2,198 @@ ## Project Overview -A standalone .NET 10 C# solution providing IIIF Text Services — specifically: -1. A **class library** for building `Text` search index objects from ALTO (and future text-segmentation) files. -2. A **Builder API** — async job-based HTTP API that accepts IIIF Manifests or simple page sequences and produces stored `Text` + `AutoComplete` artefacts. -3. A **Search API** — public-facing HTTP API providing IIIF Search v1 (and later v2) endpoints backed by those artefacts. - -### Background reading -- METS-ALTO standard: https://www.loc.gov/standards/alto/ -- IIIF Search API v1: https://iiif.io/api/search/1.0/ -- IIIF Search API v2: https://iiif.io/api/search/2.0/ -- Key IIIF Manifest example (has seeAlso ALTO links): https://iiif.wellcomecollection.org/presentation/b21211024 - -### Reference implementations to consult -These are **read-only references** — do not modify them. -- **Wellcome**: `C:/git/wellcomecollection/iiif-builder/src/Wellcome.Dds/` - - Core models: `Wellcome.Dds/WordsAndPictures/` — `Text`, `Word`, `Image`, `ComposedBlock`, `ResultRect` - - Builder: `Wellcome.Dds.Repositories/WordsAndPictures/AltoSearchTextProvider.cs` - - Search controller: `Wellcome.Dds.Server/Controllers/SearchController.cs` -- **St Louis Fed**: `C:/git/digirati-co-uk/st-louis-fed/src/IIIFBuilder/` - - Models: `IIIFBuilder.Models/WordsAndPictures/` - - Builder: `IIIFBuilder.Processor/Alto/Building/TextBuilder.cs`, `AltoRescaler.cs` - - Search: `IIIFBuilder.API/Features/Search/` - -The goal is a recognisably similar implementation to these two, using the same proven `Text`/`Word`/`Image`/`ComposedBlock` model with Protobuf serialisation, then improving later if needed. +A .NET 10 C# solution providing IIIF Text Services: + +1. **TextServices.Core** — class library for building `Text` search index objects from ALTO, hOCR, VTT, and W3C Annotation files. +2. **TextServices.Builder.Api** — async job-based HTTP API that accepts IIIF Manifests or inline page sequences and produces stored artefacts. +3. **TextServices.Search.Api** — public-facing HTTP API providing IIIF Search v1/v2, autocomplete, annotations, plain text, PDF, figures, and text-augmented Manifest endpoints. +4. **TextServices.Pdf** — iText7-based PDF builder (used by Search API). +5. **TextServices.Demo** — lightweight static-file web app + config endpoint for local development and demos. --- ## Solution Structure ``` -C:/git/tomcrane/TextServices/ -├── instructions/ # Project brief and decision log (not code) +TextServices/ +├── instructions/ # Project brief and decision log (not code) ├── src/ │ ├── TextServices.sln -│ ├── TextServices.Core/ # Class library — models, text building, format providers -│ ├── TextServices.Storage/ # Storage abstraction + filesystem + S3 implementations -│ ├── TextServices.Infrastructure/ # Shared ASP.NET middleware — Serilog, CorrelationId -│ ├── TextServices.Builder.Api/ # ASP.NET 10 — async job API for building Text artefacts -│ ├── TextServices.Search.Api/ # ASP.NET 10 — IIIF Search and Autocomplete API -│ ├── TextServices.Tests/ # XUnit + FluentAssertions unit/integration tests -│ └── TextServices.Tests.E2E/ # Playwright end-to-end tests +│ ├── TextServices.Core/ # Models, text building, format providers (no HTTP/storage) +│ ├── TextServices.Storage/ # ITextStore abstraction + filesystem + S3 implementations +│ ├── TextServices.Pdf/ # iText7 PDF builder +│ ├── TextServices.Infrastructure/ # Shared middleware — Serilog, CorrelationId +│ ├── TextServices.Builder.Api/ # ASP.NET 10 — job management + Hangfire pipeline +│ ├── TextServices.Search.Api/ # ASP.NET 10 — all read/search endpoints +│ ├── TextServices.Demo/ # ASP.NET 10 — static demo site + /demo-config +│ ├── TextServices.Tests/ # XUnit + FluentAssertions unit tests +│ └── TextServices.Tests.E2E/ # Playwright / WebApplicationFactory integration tests └── CLAUDE.md ``` -### TextServices.Core -The portable class library. No ASP.NET, no storage, no HTTP. Callers supply `XElement` instances per page. +--- -Key classes (modelled closely on the reference implementations): -- `Text` — Protobuf-serialised root object: `NormalisedFullText`, `RawFullText`, `Words` (Dictionary), `Images` (Image[]), `ComposedBlocks` (ComposedBlock[]) -- `Word` — Protobuf-serialised word with bounding box and position fields -- `Image` — Protobuf-serialised page boundary (`StartCharacter`, `ImageIdentifier`) -- `ComposedBlock` — Protobuf-serialised table/illustration region -- `ResultRect` — search hit bounding box (not persisted, computed at query time) -- `AutoComplete` — **separate Protobuf object** (`Buckets: Dictionary>`); stored independently from `Text` -- `TextBuildResult` — wraps `Text` + `AutoComplete` + `IsEmpty` flag -- `ITextFormatProvider` — interface for pluggable text-segmentation formats (ALTO, hOCR, etc.) -- `AltoTextFormatProvider` — METS-ALTO implementation; handles ns-v2 and ns-v3 namespaces, hyphenation (SUBS_TYPE="HypPart1"), composedblocks -- `TextBuilder` — accumulator that takes a sequence of `(id, width, height, XElement)` entries and returns a `TextBuildResult` +## TextServices.Core -**Important design note**: The class library does NOT fetch URIs. The caller (Builder API) fetches ALTO files and passes `XElement` to the library. +No ASP.NET, no storage, no HTTP. Callers supply parsed content per page. -### TextServices.Storage -Storage abstraction and implementations. +**Models** (Protobuf-serialised): +- `Text` — root object: `NormalisedFullText`, `RawFullText`, `Words` (`Dictionary`), `Images` (`Image[]`), `ComposedBlocks` (`ComposedBlock[]`) +- `Word` — word with bounding box and position +- `Image` — page boundary (`StartCharacter`, `ImageIdentifier`) +- `ComposedBlock` — table/illustration region +- `AutoComplete` — separate Protobuf object (`Buckets: Dictionary>`); stored independently from `Text` +- `ResultRect` — search hit bounding box (computed at query time, not persisted) +- `TextBuildResult` — wraps `Text` + `AutoComplete` + `IsEmpty` -- `ITextStore` — interface: `SaveText(key, Text)`, `LoadText(key)`, `SaveAutoComplete(key, AutoComplete)`, `LoadAutoComplete(key)`, `SaveManifest(key, string json)`, `LoadManifest(key)`, `Exists(key)` -- `FileSystemTextStore` — stores files under a configured root path -- `S3TextStore` — stores objects in a configured S3 bucket +**Providers** (implement `ITextFormatProvider`): +- `AltoTextFormatProvider` — METS-ALTO; handles ns-v2, ns-v3, namespace-free; hyphenation; ComposedBlocks +- `HocrTextFormatProvider` — hOCR HTML +- `VttTextFormatProvider` — WebVTT (implements `ITranscriptFormatProvider`) +- `W3cAnnotationTextFormatProvider` — W3C Web Annotation JSON (implements `IStringFormatProvider`) +- `AltoTextFormatProvider` is the default when profile/label are absent -Keys are job IDs (e.g. `"2/books/my-book"`) which may contain `/` characters — implementations must handle this (e.g. as path segments or S3 key prefix). +**Entry point**: `TextBuilder` — accumulates `(id, width, height, source)` entries and returns a `TextBuildResult`. -### TextServices.Builder.Api -ASP.NET 10 minimal API or controller-based. +--- -**Infrastructure:** -- PostgreSQL via EntityFramework Core + EF Migrations -- Hangfire (with `Hangfire.PostgreSql`) for durable background job processing -- `ITextStore` injected (filesystem for local dev, S3 for cloud) -- `HttpClient` (via `IHttpClientFactory`) for fetching Manifests and ALTO files +## TextServices.Storage -**Endpoints:** -- `POST /textbuilder` — accepts a job instruction (see data shapes below), enqueues a Hangfire job, returns HTTP 202 with `Location: /textbuilder/{id}` -- `GET /textbuilder/{**id}` — returns current job state (polling) -- `DELETE /textbuilder/{**id}` — cancels/removes a job (TBD) +**`ITextStore`** methods (key = job ID, may contain `/`): +- `SaveText` / `LoadText` +- `SaveAutoComplete` / `LoadAutoComplete` +- `SaveManifest` / `LoadManifest` — raw IIIF Manifest JSON +- `SaveRawText` / `LoadRawText` — un-normalised full text +- `SavePdf` / `LoadPdf` — searchable PDF derivative +- `SaveFigures` / `LoadFigures` — IIIF AnnotationPage JSON for ComposedBlocks +- `SaveAnnotations` / `LoadAnnotations` — manifest-level line annotations +- `SaveCapabilities` / `LoadCapabilities` — `JobServices` bitmask (null = all enabled) +- `SavePageSequence` / `LoadPageSequence` — ordered page list for sourceData jobs +- `DeleteArtefacts` — removes all artefacts for a key (call before reprocessing) +- `Exists` — checks whether a Text artefact exists + +Implementations: `FileSystemTextStore`, `S3TextStore`. + +--- + +## TextServices.Builder.Api + +**Infrastructure**: PostgreSQL via EF Core + Migrations, Hangfire (PostgreSql), MediatR. -**Job instruction (POST body):** +**Builder API endpoints:** +- `POST /textbuilder` — enqueue a job; returns 202 with `Location: /textbuilder/{id}` +- `GET /textbuilder` — list jobs (paged, filterable by status) +- `GET /textbuilder/{**id}` — job state +- `PUT /textbuilder/{**id}` — reprocess (re-enqueue) an existing job +- `DELETE /textbuilder/{**id}` — delete job record and stored artefacts + +**Job instruction (POST/PUT body):** ```json { "id": "2/books/my-book", "sourceUri": "https://iiif.wellcomecollection.org/presentation/b21211024" } ``` -or with inline data: +or with inline pages: ```json { "id": "2/books/my-book", "sourceData": [ - { "id": "page/1", "width": 4000, "height": 6000, "text": "https://..." }, - { "id": "page/2", "width": 4004, "height": 6006, "text": "https://..." } + { "id": "page/1", "width": 4000, "height": 6000, "text": "https://..." } ] } ``` -Exactly one of `sourceUri` or `sourceData` must be present. -**Job state (GET response):** -```json -{ - "id": "2/books/my-book", - "sourceUri": "https://...", - "sourceData": null, - "created": "2026-03-28T17:56:52Z", - "started": null, - "finished": null, - "totalPages": 0, - "pagesCompleted": 0, - "totalWordCount": 0, - "totalImageCount": 0, - "errors": null, - "searchV1": null, - "autocompleteV1": null, - "searchV2": null, - "autocompleteV2": null -} -``` -The `searchV1` etc. URLs are computed from configuration (base URL) + job ID, not stored in the DB. +**`BuilderJob` entity** (PostgreSQL, snake_case naming): `Id`, `SourceUri`, `SourceDataJson`, `Status`, `Created`, `Started`, `Finished`, `TotalPages`, `PagesCompleted`, `TotalWordCount`, `TotalImageCount`, `Errors`, `HangfireJobId`, `Services` (bitmask), `Title`, `CustomTypesJson`. + +**`TextBuildJob`** (Hangfire): fetches manifest/resources concurrently (bounded by `MaxConcurrentAltoFetches`), feeds `TextBuilder`, persists all artefacts, records per-page warnings without aborting the job. + +**Services that Builder.Api registers**: `IResourceFetcher`, `IManifestFetcher`, `IManifestReducer`, `IManifestSynthesiser`, `IAltoFetcher`, `IVttFetcher`, `IAnnotationPageFetcher`, `ITextStore`. -**IIIF Manifest handling (when sourceUri points to a Manifest):** -- Fetch and parse as IIIF Presentation 3 JSON (not older versions — reject if not v3) -- Reduce to the `sourceData` page sequence: Canvas `id` → `id`, Canvas `width`/`height` → dimensions, `seeAlso` ALTO link → `text` URI -- Detect ALTO seeAlso links by `profile` containing `alto` (case-insensitive) or `label` containing "ALTO" or "METS-ALTO" -- Canvases without an ALTO seeAlso are included with `text: null` (sparse sequences are normal, not errors) -- Store a copy of the original Manifest JSON alongside the Text artefacts +**`ManifestReducer`**: uses `iiif-net` to parse IIIF Presentation 3 JSON. Reduces to a page sequence (Canvas id, width, height, seeAlso text URI). Detects ALTO by `profile` or `label` containing "alto". Canvases without ALTO are silently included with `text: null`. -### TextServices.Search.Api -ASP.NET 10, separate application (separate process/deployment from Builder API). +**`ManifestSynthesiser`**: builds a synthetic IIIF Manifest from a `sourceData` page sequence (used when no source Manifest URI is available). -Uses MediatR for request/response pattern. +--- + +## TextServices.Search.Api + +MediatR for all request/response. `ITextCache` with `AsyncKeyedLock` prevents thundering-herd on cache misses. **Endpoints:** -- `GET /search/v1/{**id}?q={term}` — IIIF Search API v1 response -- `GET /autocomplete/v1/{**id}?q={term}` — IIIF Search API v1 autocomplete response -- `GET /text-augmented/v3/{**id}` — stored Manifest JSON decorated with search service links +- `GET /search/v1/{**id}?q=` — IIIF Search API v1 +- `GET /search/v2/{**id}?q=` — IIIF Search API v2 +- `GET /autocomplete/v1/{**id}?q=` — IIIF Search API v1 autocomplete +- `GET /autocomplete/v2/{**id}?q=` — IIIF Search API v2 autocomplete +- `GET /text-augmented/v3/{**id}` — stored Manifest decorated with search/autocomplete service descriptors +- `GET /plain-text/{**id}` — raw full text +- `GET /pdf/{**id}` — streaming searchable PDF +- `GET /figures/{**id}` — IIIF AnnotationPage of ComposedBlock regions +- `GET /annotations/{**id}` — manifest-level line annotations +- `GET /annotations/word/{**id}` — word-level annotations +- `GET /annotations/line/{**id}` — line-level annotations +- `GET /proxy/image` — optional file-URI image proxy (local dev only) +- `GET /cache/clear/{**id}` — evict a key from the in-memory cache + +CORS: `*` on all responses (IIIF requirement). Response compression: Brotli + Gzip for JSON/ld+json. + +--- + +## TextServices.Pdf + +`PdfBuilder` class using **iText7**. Builds a searchable PDF from a Text artefact + image URLs. Registered in Search API as a singleton; images are fetched via a named `HttpClient`. + +--- -**text-augmented behaviour:** -- Load stored Manifest JSON -- Set `id` to the fully-qualified URL of `/text-augmented/v3/{id}` -- Append search/autocomplete service descriptors to the `service` array (create if absent) -- Return as `application/json` (or `application/ld+json`) -- The application treats the Manifest as plain JSON (no full IIIF parser needed) +## TextServices.Demo -**Caching:** -- Memory-cache loaded `Text` and `AutoComplete` objects (sliding expiration) -- Use `AsyncKeyedLock` to prevent thundering-herd on cache misses +Static-file web app (index.html, builder.html, viewer.html, compare.html, annotations.html) plus `GET /demo-config` which injects Builder/Search API URLs and local fixture paths into the browser so JavaScript never hard-codes them. --- ## Key Technical Decisions -| Decision | Choice | Rationale | -|---|---|---| -| Root namespace | `TextServices` | Simple, unambiguous | -| Text serialisation | protobuf-net | Proven in production; compact binary; fast | -| AutoComplete storage | Separate file from Text | St Louis pattern — Search API can load only what it needs | -| Job queue | Hangfire + PostgreSQL | Durable, survives restarts; works locally, on AWS RDS, Azure PostgreSQL | -| IIIF JSON building | Plain `System.Text.Json` | No iiif-net dependency for now; may add later | -| Request/response pattern | MediatR | Testable; decouples HTTP from business logic | -| ORM | EntityFramework Core + Migrations | Standard; PostgreSQL via Npgsql | -| IIIF Presentation support | v3 only | Brief specifies v3 only for Builder API | -| Search API version | v1 first, v2 later | Brief specifies this phasing | -| Text format extensibility | `ITextFormatProvider` interface | Don't tie to ALTO; hOCR etc. to follow | -| ALTO namespace support | Both ns-v2 and ns-v3 | Multiple ALTO versions exist in the wild | -| Builder/Search separation | Two separate ASP.NET apps | No resource contention; builder can be turned off | +| Decision | Choice | +|---|---| +| Text serialisation | protobuf-net | +| AutoComplete storage | Separate file from Text | +| Job queue | Hangfire + PostgreSQL | +| IIIF parsing | iiif-net (Builder.Api) | +| IIIF JSON manipulation | `JsonNode`/`JsonDocument` (Search.Api) | +| Request/response pattern | MediatR | +| ORM | EF Core + Npgsql + snake_case naming | +| PDF generation | iText7 | +| IIIF Presentation support | v3 only | +| Search API versions | v1 + v2 both implemented | --- ## Development Workflow -- Work in feature branches; submit well-described PRs to `main` -- The solution lives in `src/`; instructions in `instructions/`; this file at repo root -- Build up code incrementally — do not attempt to deliver the whole solution in one PR -- Tests should be written alongside the code they test, not after - -### Suggested PR sequence -1. Solution scaffold + `TextServices.Core` models (Protobuf contracts) -2. `TextServices.Core` ALTO parsing + `TextBuilder` -3. `TextServices.Storage` abstraction + filesystem implementation -4. `TextServices.Builder.Api` — EF schema, job CRUD, Hangfire wiring -5. `TextServices.Builder.Api` — IIIF Manifest fetching + reduction to page sequence -6. `TextServices.Builder.Api` — full job processing pipeline -7. `TextServices.Search.Api` — search and autocomplete endpoints -8. `TextServices.Search.Api` — text-augmented Manifest endpoint -9. `TextServices.Storage` — S3 implementation -10. `TextServices.Tests.E2E` — integration test suite +- Feature branches → PRs to `main` +- Solution in `src/`, instructions in `instructions/`, this file at repo root +- EF migrations in `TextServices.Builder.Api/Migrations/`; apply via `dotnet ef database update` or the auto-apply on startup in Development ---- +### Running locally + +Both APIs need `appsettings.Development.json` with: +```json +{ + "ConnectionStrings": { "BuilderDb": "Host=localhost;Database=textservices;..." }, + "TextServices": { + "Storage": { "RootPath": "C:/some/local/path" }, + "SearchApiBaseUrl": "https://localhost:7001" + } +} +``` + +The Demo app needs `appsettings.Development.json` with Builder and Search API base URLs configured under the `Demo` section. -## Testing +### Testing -- **Unit/integration tests**: XUnit + FluentAssertions in `TextServices.Tests` -- **Integration tests**: `Microsoft.AspNetCore.Mvc.Testing` in `TextServices.Tests.E2E` -- **Test fixtures**: Random Wellcome Manifests from `https://iiif.wellcomecollection.org/service/suggest-b-number?q=imfeelinglucky`; use returned b-number in `https://iiif.wellcomecollection.org/presentation/{b-number}` -- Fixtures cover a good mix: small manifests, large (100s of canvases), manifests with no ALTO links -- Existing Wellcome search services can be used for comparative validation +- **Unit tests**: `TextServices.Tests` — XUnit + FluentAssertions; covers Core parsing, Search handlers, Builder services, Infrastructure middleware +- **E2E / integration tests**: `TextServices.Tests.E2E` — `WebApplicationFactory` + Playwright; fixtures under `TextServices.Tests.E2E/Fixtures/` +- Test fixtures: Wellcome Collection manifests; `https://iiif.wellcomecollection.org/service/suggest-b-number?q=imfeelinglucky` for random b-numbers --- @@ -210,27 +201,27 @@ Uses MediatR for request/response pattern. | Package | Project(s) | Purpose | |---|---|---| -| `protobuf-net` | Core, Storage | Binary serialisation of Text models | -| `Hangfire.Core` + `Hangfire.AspNetCore` + `Hangfire.PostgreSql` | Builder.Api | Durable background job queue | +| `protobuf-net` | Core, Storage | Protobuf serialisation | +| `iiif-net` | Builder.Api | IIIF Presentation 3 parsing | +| `Hangfire.Core` + `Hangfire.AspNetCore` + `Hangfire.PostgreSql` | Builder.Api | Background job queue | | `Npgsql.EntityFrameworkCore.PostgreSQL` | Builder.Api | PostgreSQL EF provider | +| `EFCore.NamingConventions` | Builder.Api | snake_case column naming | | `MediatR` | Builder.Api, Search.Api | Request/response decoupling | | `AsyncKeyedLock` | Search.Api | Per-key async locking for cache | -| `Microsoft.Extensions.Caching.Memory` | Search.Api | In-process memory cache | +| `itext7` | Pdf | PDF generation | | `AWSSDK.S3` | Storage | S3 implementation | -| `Microsoft.AspNetCore.Mvc.Testing` | Tests.E2E | In-process HTTP integration tests | -| `FluentAssertions` | Tests | Readable test assertions | +| `Serilog.AspNetCore` | Builder.Api, Search.Api | Structured logging | --- ## Important Implementation Notes -- **Coordinate rescaling**: ALTO records image dimensions at OCR time, which may differ from the IIIF Canvas dimensions. Always rescale word bounding boxes from ALTO dimensions to Canvas dimensions. See `AltoRescaler` in the St Louis reference. -- **Hyphenation**: Merge words with `SUBS_TYPE="HypPart1"` with the following word, omitting the hyphen from the normalised text. -- **Text normalisation**: Lowercase, collapse whitespace, strip non-alphanumeric characters. Consistent with both reference implementations. -- **Autocomplete buckets**: 3-character prefix bucketing. Only words longer than 2 characters. Results ordered by length then alphabetically. -- **Sparse manifests**: Canvases without ALTO are not errors — skip them silently and record the gap. -- **Job IDs with slashes**: Job IDs like `"2/books/my-book"` must be supported in URL routing (use catch-all route parameters `{**id}`). -- **Protobuf field numbers**: Once assigned, never change them. See reference implementations for the established numbering. -- **ComposedBlocks**: Capture table/illustration/figure bounding boxes from ALTO `` elements (type="Table", "Illustration", "Figure"). See Wellcome reference. -- **IIIF Manifest as JSON**: The Search API treats stored Manifests as plain JSON objects — no need for a full IIIF object model. Use `JsonNode` or `JsonDocument` for patching. -- **Database credentials**: Ask the user for PostgreSQL admin credentials before running the first EF migration. +- **Coordinate rescaling**: ALTO image dimensions may differ from Canvas dimensions — always rescale word bounding boxes. +- **Hyphenation**: Merge words with `SUBS_TYPE="HypPart1"` with the following word, omitting the hyphen from normalised text. +- **Text normalisation**: Lowercase, collapse whitespace, strip non-alphanumeric. +- **Autocomplete buckets**: 3-character prefix. Words longer than 2 characters only. Ordered by length then alphabetically. +- **Job IDs with slashes**: Handled throughout with catch-all route parameters `{**id}`. +- **Protobuf field numbers**: Never change once assigned. +- **`JobServices` bitmask**: Controls which derivatives are built and which endpoints are active. A null capabilities file means all services are enabled. +- **`DeleteArtefacts`**: Call before reprocessing a job so stale derivatives from a previous run don't persist. +- **`AllowFileImageProxy`**: Only enable in trusted local-dev environments. From c74c5223b4c3992d6b6bbef033e12fc219619cd6 Mon Sep 17 00:00:00 2001 From: Donald Gray Date: Thu, 21 May 2026 11:41:52 +0100 Subject: [PATCH 2/3] Cleanup reference terminology --- instructions/decisions.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/instructions/decisions.md b/instructions/decisions.md index eb1e22d..1a60ecb 100644 --- a/instructions/decisions.md +++ b/instructions/decisions.md @@ -11,7 +11,7 @@ Code-level decisions are in CLAUDE.md. This log covers the *why*. Tom Crane (Digirati) briefed Claude on the requirements via `instructions/text-services.md`. Claude read both reference implementations in full: - Wellcome Collection: `C:/git/wellcomecollection/iiif-builder/src/Wellcome.Dds/` -- St Louis Fed: `C:/git/digirati-co-uk/st-louis-fed/src/IIIFBuilder/` +- SL: `C:/git/digirati-co-uk/st-louis-fed/src/IIIFBuilder/` ### Questions asked and answers given @@ -23,8 +23,8 @@ Claude read both reference implementations in full: *Confirmed: Hangfire uses a pluggable storage backend (PostgreSQL, Redis, SQL Server). Works identically with local PostgreSQL, AWS RDS, and Azure Database for PostgreSQL. Just change the connection string.* -**Q3: AutoComplete storage — embedded in Text object (Wellcome pattern) vs separate file (St Louis pattern)?** -> "Let's go for the St Louis separate file pattern - unless you recommend otherwise" +**Q3: AutoComplete storage — embedded in Text object (Wellcome pattern) vs separate file (SL pattern)?** +> "Let's go for the SL separate file pattern - unless you recommend otherwise" *No counter-recommendation — separate storage is the better design: the Search API can load AutoComplete independently from the (larger) Text object when serving autocomplete-only requests.* @@ -93,7 +93,7 @@ reference's use of the two-argument `Select` overload. **PR review question:** Is the normalisation correct re the reference implementations? -Both Wellcome and St Louis Fed use `ToAlphanumericOrWhitespace` which **drops** non-alphanumeric characters (keeping existing whitespace), rather than replacing them with spaces. Corrected in response to review: +Both Wellcome and SL use `ToAlphanumericOrWhitespace` which **drops** non-alphanumeric characters (keeping existing whitespace), rather than replacing them with spaces. Corrected in response to review: - `"it's"` → `"its"` (not `"it s"`) - `"foo-bar"` → `"foobar"` (not `"foo bar"`) - `"hello, world"` → `"hello world"` (comma dropped; adjacent space preserved) @@ -159,9 +159,9 @@ After both passes, the following are confirmed correct against both references: Intentional extensions beyond the references: - `ComposedBlock` has X, Y, W, H, BlockType (Wellcome has coordinates and type embedded differently; added here for completeness) -- AutoComplete stored as a separate file (St Louis pattern), not inside Text (Wellcome pattern) +- AutoComplete stored as a separate file (SL pattern), not inside Text (Wellcome pattern) - `StringComparison.Ordinal` in `IndexOf` rather than `InvariantCultureIgnoreCase` (equivalent for lowercase-only normalised text; Ordinal is marginally faster) -- Empty-norm words are skipped entirely by TextAccumulator rather than being written to raw but not norm text (St Louis partial-skip behaviour); our approach is cleaner +- Empty-norm words are skipped entirely by TextAccumulator rather than being written to raw but not norm text (SL partial-skip behaviour); our approach is cleaner --- From 69b221e1a22c67c0955050eb20dd9c48af97d627 Mon Sep 17 00:00:00 2001 From: Donald Gray Date: Thu, 21 May 2026 11:46:28 +0100 Subject: [PATCH 3/3] Skip CI when only markdown files change Add paths-ignore for **/*.md to the push trigger (it was already present on pull_request). Tag pushes are unaffected. Co-Authored-By: Claude Sonnet 4.6 --- .github/workflows/ci.yml | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 2db2684..aaa4b3a 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -4,9 +4,14 @@ on: push: branches: [ "main", "develop" ] tags: [ "v*" ] + paths-ignore: + - "**/*.md" + - "docs/**" + - "instructions/**" pull_request: branches: [ "main", "develop" ] paths-ignore: + - "**/*.md" - "docs/**" - "instructions/**"