Skip to content

Always allow access to the http body and smarter encoding handling #58

@ahenket

Description

@ahenket

Use case: csv from Dutch national body comes back with Content-Type utf-8 but actual body utf-16 LE (BOM FF FE)

Reproduction (based on eXist-db 6)

declare namespace http              = "http://expath.org/ns/http-client";

http:send-request(
    <http:request method="GET" href="https://publicaties.rvig.nl/media/13307/download">
        <http:header name="Accept" value="text/csv"/>
        <http:header name="Cache-Control" value="no-cache"/>
        <http:header name="Max-Forwards" value="1"/>
    </http:request>
)[2]

The response comes back with Content-Type header containing utf-8 encoding, but since the actual contents are utf-16 I now get: "Failed to parse server's response: An invalid XML character (Unicode: 0x0) was found in the element content of the document."

I can override the server provided Content-Type using override-media-type="text/csv; charset=utf-16" but this requires me to know the encoding beforehand. I have reported the mismatched content-type to the responsible party but doubtful what or when that has any effect.

I would like to get to a place were I can always access the contents of a send-request() so I can work out some fall back scheme.

Ideally:

  • Always allow me access to the body, as binary if all else fails so prevent hard uncatchable errors e.g. about hex 0
  • Process body based on BOM if present before relying on Content-Type encoding
  • Process body based on Content-Type encoding if no BOM present
  • Process body based on UTF-8 if no BOM or Content-Type encoding present

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions