Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Dockerfile.Builder
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ RUN dotnet publish TextServices.Builder.Api/TextServices.Builder.Api.csproj \
-c Release -o /app/publish --no-restore

FROM mcr.microsoft.com/dotnet/aspnet:10.0 AS runtime

RUN apt-get update && apt-get install -y libgssapi-krb5-2 libkrb5-3 krb5-user
WORKDIR /app
COPY --from=build /app/publish .
EXPOSE 8080
Expand Down
32 changes: 32 additions & 0 deletions docs/search-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -517,5 +517,37 @@ Search API configuration lives under the `TextServices` key in `appsettings.json
| `StorageRootPath` | `textservices-data` | Root directory of the text artefact store. Must point to the same location as the Builder API's `Storage:RootPath`. |
| `PdfTriggerQueueCapacity` | `50` | Maximum number of PDF trigger requests that can be queued for background generation. Requests beyond this limit receive `503 Service Unavailable`. |
| `PdfTriggerMaxConcurrency` | `2` | Maximum number of PDFs generated concurrently by the background trigger queue. Each in-flight generation buffers the full PDF in memory — keep this low on memory-constrained hosts. |
| `AllowFileImageProxy` | `false` | When `true`, the `/proxy/image` endpoint streams local `file://` images. Only enable in trusted local-dev environments where those files are not access-controlled. |
| `AllowedCustomHosts` | `[]` | Hostnames accepted from the `X-Forwarded-Host` request header (e.g. custom CloudFront distributions). See [Forwarded-header URL rewriting](#forwarded-header-url-rewriting) below. |

---

## Forwarded-header URL rewriting

When the Search API sits behind a reverse proxy that rewrites the public URL (e.g. a CloudFront
distribution with a custom domain), the `id` values in IIIF responses must reflect the
public-facing URL rather than the internal one.

Configure `AllowedCustomHosts` with the public hostnames you trust:

```json
{
"TextServices": {
"AllowedCustomHosts": ["custom.example.org"]
}
}
```

When a request arrives carrying `X-Forwarded-Host: custom.example.org` and that value matches
an entry in `AllowedCustomHosts`:

- The host in all generated IIIF URLs is replaced with the forwarded host.
- If `X-Forwarded-Path` is also present, the Search API extracts the effective job ID from it
(stripping the route prefix), so the `id` values in the response reflect the public path
rather than the internal route. This is useful when the proxy maps a path like
`/iiif/search/my-book` to the internal `/search/v2/my-book`.

Hosts not in `AllowedCustomHosts` are always ignored, regardless of what headers the request
carries. The default empty array means `X-Forwarded-Host` is never honoured.

All responses include `Access-Control-Allow-Origin: *`. The Search API is entirely read-only, so open CORS is required by the IIIF specification and safe without restriction.
106 changes: 106 additions & 0 deletions instructions/alternative-paths.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
## Path Rewrites

By paths generated for the search api are:

* `/search/v2/{**id}?q={term}`
* `/search/v1/{**id}?q={term}`
* `/autocomplete/v2/{**id}?q={term}`
* `/autocomplete/v1/{**id}?q={term}`
* `/annotations/lines/v1/{n}/{**id}`
* `/annotations/words/v1/{n}/{**id}`
* `/text-augmented/v3/{**id}`
* `/proxy/image?uri={uri}`
* `/text/v1/{**id}`
* `/pdf/v1/{**id}`
* `/identified/figures/{**id}`

These are rendered onto generated Manifest using `{protocol}://{host}/{above-path}`

### Canonical Paths

This is the "as is" processing.

Currently we have a `SearchBaseUrl` (say `https://search.default`). When generating a Manifest this is used to construct every `id`, so the above list of paths are appended to `SearchBaseUrl`.

The important thing is the `{**id}` is _always_ the job-id, it can contain any number of slashes and is replaced in it's entirety.

### Requirement

We need to be able to have some degree of control over what paths are rendered when returning. To do so we will support X-Forwarded-Host and X-Forwarded-Path

We need to be a way to be able to translate incoming requests, so that they reflect in the outgoing request - without adding any sort of rule-based _stuff_.

### Solution

One solution to this is `X-Forwarded-Proto`, `X-Forwarded-Host` (standard HTTP headers) and `X-Forwarded-Path` (non-standard). These will all be added by proxy (e.g. CloudFront), if there are rewrite rules in place.

* `X-Forwarded-Proto` - configured via standard middleware. Ensure that the HttpContext has appropriate protocol.
* `X-Forwarded-Host` - will be used for the host if it is part of known whitelist. Added by proxy.
* `X-Forwarded-Path` - will be used if it is accompanied by a whitelisted `X-Forwarded-Host`. Added by proxy.
* This isn't perfect adds an degree of safety. You can only rewrite path + host if we expect the host.
* If we want to rewrite a path for canonical host it would need to be whitelisted, which feels like a safe trade-off

### Examples

Below examples work through requirements. Assume we're requesting the text-augmented adjunct, this looks at resulting value for `/autocomplete/v1` path. For all of these examples:
* Canonical hostname is `search.default`
* JobId is `2/cc/123`
* The actual http request that hits the search API is `https://search.default/text-augmented/v3/2/cc/123`

| Incoming (maybe via proxy) | X-Forwarded-Host | X-Forwarded-Path | Autocomplete `id` | Notes |
| ------------------------------------------------- | ---------------- | -------------------------- | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| https://search.default/text-augmented/v3/2/cc/123 | | | https://search.default/autocomplete/v1/2/cc/123 | Default, no proxy |
| https://unknown.host/text-augmented/v3/2/cc/123 | unknown.host | | https://search.default/autocomplete/v1/2/cc/123 | x-forwarded-host but unknown |
| https://unknown.host/text-augmented/v3/2/cc/123 | | text-augmented/v3/2/cc/123 | https://search.default/autocomplete/v1/2/cc/123 | x-forwarded-path but no accompanying x-forwarded-host |
| https://unknown.host/text-augmented/v3/2/cc/123 | unknown.host | text-augmented/v3/2/cc/123 | https://search.default/autocomplete/v1/2/cc/123 | x-forwarded-path but accompanying x-forwarded-host is unknown |
| https://known.host/text-augmented/v3/2/cc/123 | known.host | | https://known.host/autocomplete/v1/2/cc/123 | x-forwarded-host is whitelisted |
| https://known.host/text-augmented/v3/cc/123 | known.host | text-augmented/v3/cc/123 | https://known.host/autocomplete/v1/cc/123 | x-forwarded-host is whitelisted and x-forwarded-path is set (crucially it is NOT the `id`) |
| https://known.host/text-augmented/v3/cc/123 | known.host | | https://known.host/autocomplete/v1/2/cc/123 | x-forwarded-host is whitelisted. x-forwarded-path not set so `id` is used. This would be a misconfigured proxy |

> [!NOTE]
> Some points to now from above:
> * The above outlines how `id` path is constructed for autocomplete path on generated Manifest but the same process would apply for any generated `id`
> * The `X-Forwarded-Path` may contain a query parameter (e.g. for search results), this should be removed from ids.

#### Implementation

The rough implementation would be to use the `X-Forwarded-Path` to determine the `{**id}` element to use in generated paths.

To do so (assuming `X-Forwarded-Path` is provided and valid) we will remove the current root (minus `{**id}`) from the start of the `X-Forwarded-Path`, this will yield the usable `id` for path generation.

### Implementation

All forwarded-header logic is centralised in `EndpointHelpers.Resolve` (`Features/EndpointHelpers.cs`), which reads `X-Forwarded-Host` and `X-Forwarded-Path` once and returns a `ResolvedRequest` record:

```csharp
internal record ResolvedRequest(string EffectiveId, string SelfUrl, string BaseUrl);
```

* `EffectiveId` — the job id to use in generated URLs (extracted from `X-Forwarded-Path` when the host is whitelisted; otherwise the original route id).
* `SelfUrl` — absolute URL for the current endpoint, already incorporating the effective id and optional query term.
* `BaseUrl` — scheme + authority only; used by `TextAugmentedEndpoints` as the base for all cross-endpoint service URLs.

Every endpoint calls `Resolve` once:

```csharp
var resolved = EndpointHelpers.Resolve(options.Value, ctx, "search/v1/", id, q);
// resolved.SelfUrl → passed to the handler as the response @id
// resolved.BaseUrl → used by TextAugmented to build service descriptor URLs
// resolved.EffectiveId → passed to TextAugmentedRequest as UrlId (see below)
```

`X-Forwarded-Proto` is handled separately by `ForwardedHeadersMiddleware` (configured in `ServiceCollectionExtensions.ConfigureForwardedHeaders`), which sets `Request.Scheme`. Trusted sources are restricted via `KnownNetworks` / `KnownProxies` config keys.

#### TextAugmented specifics

`TextAugmentedHandler` builds cross-endpoint URLs using both a storage id (to load artefacts) and a URL id (to generate service descriptors). These differ when `X-Forwarded-Path` rewrites the id. `TextAugmentedRequest` carries both:

```csharp
record TextAugmentedRequest(string Id, string SelfUrl, string SearchBaseUrl, string? UrlId = null)
```

The handler uses `UrlId ?? Id` for URL generation and `Id` for all storage lookups. The endpoint passes `resolved.EffectiveId` as `UrlId`.

#### Allowlist configuration

Permitted custom hosts are configured under `TextServices:AllowedCustomHosts` in `appsettings.json`. An empty array (the default) means both `X-Forwarded-Host` and `X-Forwarded-Path` are always ignored.
8 changes: 8 additions & 0 deletions src/TextServices.Search.Api/Configuration/SearchApiOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -57,4 +57,12 @@ public class SearchApiOptions
/// </para>
/// </summary>
public bool AllowFileImageProxy { get; set; } = false;

/// <summary>
/// Hostnames accepted from the <c>X-Forwarded-Host</c> request header (e.g. custom CloudFront distributions).
/// When a request carries <c>X-Forwarded-Host</c> and its value matches an entry here, that host
/// replaces the canonical host in generated IIIF URLs. An empty array (the default) means
/// <c>X-Forwarded-Host</c> is never honoured.
/// </summary>
public string[] AllowedCustomHosts { get; set; } = [];
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
using TextServices.Pdf;
using Microsoft.AspNetCore.HttpOverrides;
using Serilog;
using Serilog.Extensions.Logging;
using TextServices.Pdf;
using TextServices.Search.Api.Features.Pdf;

namespace TextServices.Search.Api.Configuration;
Expand All @@ -20,4 +23,56 @@ public static IServiceCollection AddPdfServices(this IServiceCollection services

return services;
}

/// <summary>
/// Configures host to use x-forwarded-proto to set httpContext.Request.Scheme
/// "KnownNetworks" (CIDR ranges) and/or "KnownProxies" (individual IPs) configuration keys restrict which
/// upstream sources are trusted. If neither is present, headers are accepted from all sources (with a warning).
/// </summary>
public static IServiceCollection ConfigureForwardedHeaders(this IServiceCollection services,
IConfiguration configuration)
{
var knownNetworks = configuration.GetValue<string>("KnownNetworks");
var knownProxies = configuration.GetValue<string>("KnownProxies");

var logger = new SerilogLoggerFactory(Log.Logger).CreateLogger("ServiceCollection");

return services.Configure<ForwardedHeadersOptions>(opts =>
{
opts.ForwardedHeaders = ForwardedHeaders.XForwardedProto;

var networks = knownNetworks.SplitSeparatedString(",").ToList();
var proxies = knownProxies.SplitSeparatedString(",").ToList();

if (networks.Count == 0 && proxies.Count == 0)
{
logger.LogWarning("Forwarded header values accepted from all networks and proxies");
opts.KnownIPNetworks.Clear();
opts.KnownProxies.Clear();
}
else
{
if (networks.Count > 0)
{
logger.LogInformation("Forwarded header values accepted from networks: {KnownNetworks}", knownNetworks);
foreach (var network in networks)
{
opts.KnownIPNetworks.Add(System.Net.IPNetwork.Parse(network));
}
}

if (proxies.Count > 0)
{
logger.LogInformation("Forwarded header values accepted from proxies: {KnownProxies}", knownProxies);
foreach (var proxy in proxies)
{
opts.KnownProxies.Add(System.Net.IPAddress.Parse(proxy));
}
}
}
});
}

private static IEnumerable<string> SplitSeparatedString(this string? str, string separator)
=> str?.Trim().Split(separator, StringSplitOptions.RemoveEmptyEntries) ?? Enumerable.Empty<string>();
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ internal static IEndpointRouteBuilder MapAnnotationEndpoints(this IEndpointRoute
IOptions<SearchApiOptions> options,
HttpContext ctx) =>
{
var selfUrl = EndpointHelpers.BuildSelfUrl(options.Value, ctx, $"annotations/manifest/v1/{id}", null);
var result = await sender.Send(new ManifestAnnotationsRequest(id, selfUrl));
var resolved = EndpointHelpers.Resolve(options.Value, ctx, "annotations/manifest/v1/", id);
var result = await sender.Send(new ManifestAnnotationsRequest(id, resolved.SelfUrl));
if (result == null) return Results.NotFound();
return Results.Json(result, contentType: "application/ld+json");
});
Expand All @@ -26,8 +26,8 @@ internal static IEndpointRouteBuilder MapAnnotationEndpoints(this IEndpointRoute
IOptions<SearchApiOptions> options,
HttpContext ctx) =>
{
var selfUrl = EndpointHelpers.BuildSelfUrl(options.Value, ctx, $"annotations/lines/v1/{n}/{id}", null);
var result = await sender.Send(new LineAnnotationsRequest(id, n, selfUrl));
var resolved = EndpointHelpers.Resolve(options.Value, ctx, $"annotations/lines/v1/{n}/", id);
var result = await sender.Send(new LineAnnotationsRequest(id, n, resolved.SelfUrl));
if (result == null) return Results.NotFound();
return Results.Json(result, contentType: "application/ld+json");
});
Expand All @@ -38,8 +38,8 @@ internal static IEndpointRouteBuilder MapAnnotationEndpoints(this IEndpointRoute
IOptions<SearchApiOptions> options,
HttpContext ctx) =>
{
var selfUrl = EndpointHelpers.BuildSelfUrl(options.Value, ctx, $"annotations/words/v1/{n}/{id}", null);
var result = await sender.Send(new WordAnnotationsRequest(id, n, selfUrl));
var resolved = EndpointHelpers.Resolve(options.Value, ctx, $"annotations/words/v1/{n}/", id);
var result = await sender.Send(new WordAnnotationsRequest(id, n, resolved.SelfUrl));
if (result == null) return Results.NotFound();
return Results.Json(result, contentType: "application/ld+json");
});
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ internal static IEndpointRouteBuilder MapAutocompleteEndpoints(this IEndpointRou
IOptions<SearchApiOptions> options,
HttpContext ctx) =>
{
var selfUrl = EndpointHelpers.BuildSelfUrl(options.Value, ctx, $"autocomplete/v1/{id}", q);
var result = await sender.Send(new AutocompleteRequest(id, q ?? string.Empty, selfUrl));
var resolved = EndpointHelpers.Resolve(options.Value, ctx, "autocomplete/v1/", id, q);
var result = await sender.Send(new AutocompleteRequest(id, q ?? string.Empty, resolved.SelfUrl));
if (result == null) return Results.NotFound();
return Results.Json(result, contentType: "application/ld+json");
});
Expand All @@ -26,8 +26,8 @@ internal static IEndpointRouteBuilder MapAutocompleteEndpoints(this IEndpointRou
IOptions<SearchApiOptions> options,
HttpContext ctx) =>
{
var selfUrl = EndpointHelpers.BuildSelfUrl(options.Value, ctx, $"autocomplete/v2/{id}", q);
var result = await sender.Send(new AutocompleteV2Request(id, q ?? string.Empty, selfUrl));
var resolved = EndpointHelpers.Resolve(options.Value, ctx, "autocomplete/v2/", id, q);
var result = await sender.Send(new AutocompleteV2Request(id, q ?? string.Empty, resolved.SelfUrl));
if (result == null) return Results.NotFound();
return Results.Json(result, contentType: "application/ld+json");
});
Expand Down
Loading
Loading