Skip to content

Add lenient UTF-8 handling for protobuf message decoding#333

Open
tonyalaribe wants to merge 1 commit into
masterfrom
claude/fix-wiretap-logs-FO1dn
Open

Add lenient UTF-8 handling for protobuf message decoding#333
tonyalaribe wants to merge 1 commit into
masterfrom
claude/fix-wiretap-logs-FO1dn

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Summary

This PR adds lenient UTF-8 decoding for protobuf messages to gracefully handle malformed UTF-8 sequences in incoming data. Instead of failing on invalid UTF-8 bytes, the decoder now sanitizes the wire format by replacing invalid bytes with '?' characters and retries decoding.

Key Changes

  • Added decodeMessageLenient function that attempts standard decoding first, then falls back to sanitized decoding on UTF-8 errors
  • Implemented sanitizeProtobufUtf8 to walk the protobuf wire format and sanitize length-delimited fields
  • Added sanitizeTaggedValue to recursively handle nested protobuf messages while sanitizing string/bytes fields
  • Added sanitizeUtf8Bytes utility to replace invalid UTF-8 sequences with '?' characters
  • Updated processBatchPipeline to use decodeMessageLenient instead of decodeMessage for all incoming messages
  • Added necessary imports from Data.ProtoLens.Encoding modules

Implementation Details

The sanitization process:

  1. Parses the protobuf wire format into tagged values
  2. For length-delimited fields, attempts to parse them as nested messages
  3. If successful, recursively sanitizes the nested fields
  4. If not a message, treats the content as a string/bytes field and replaces invalid UTF-8 bytes with '?'
  5. Rebuilds the wire format with sanitized content

This approach ensures that valid protobuf structures are preserved while gracefully handling encoding issues in string and bytes fields.

https://claude.ai/code/session_01MmF2VFF2zfknCJNUzgJACT

proto-lens uses strict UTF-8 decoding for protobuf string fields, which
causes ~1200 message decode failures per minute when OTLP producers send
non-UTF-8 bytes in string fields (e.g. AnyValue.string_value, Metric.name,
Span.name). The entire message is dropped even though the protobuf structure
is valid.

Add decodeMessageLenient which first tries normal decoding, then on UTF-8
errors walks the protobuf wire format to replace invalid UTF-8 bytes with
'?' in length-delimited fields and retries. This preserves ~77% of
previously-dropped messages.

https://claude.ai/code/session_01MmF2VFF2zfknCJNUzgJACT
@claude

claude Bot commented Mar 8, 2026

Copy link
Copy Markdown

Code Review

Good approach to handling malformed UTF-8 in protobuf ingestion. A few issues worth addressing:


Bug: Overly broad error matching

decodeMessageLenient checks only for "Cannot decode byte", but the existing recordProtoError (line 651) correctly uses both substrings to identify UTF-8 errors:

| "Cannot decode byte" `L.isInfixOf` err && "Invalid UTF-8 stream" `L.isInfixOf` err

The new code will trigger sanitization for non-UTF-8 "Cannot decode byte" errors (e.g., varint/fixed-width field errors), silently corrupting or dropping data that should surface as real parse failures. Fix:

Left err
  | "Cannot decode byte" `L.isInfixOf` err
  , "Invalid UTF-8 stream" `L.isInfixOf` err -> decodeMessage (sanitizeProtobufUtf8 bs)
  | otherwise -> Left err

Correctness concern: ambiguous wire-level field classification

sanitizeTaggedValue treats any Lengthy content that happens to parse as a non-empty FieldSet as a nested protobuf message and recursively walks it. But at the wire level, Lengthy fields are indistinguishable between bytes, string, and message — any arbitrary binary payload can accidentally parse as a valid FieldSet. This means:

  • Raw byte fields (serialised IPs, timestamps, hashes, custom binary blobs) could be incorrectly "sanitized" as nested messages.
  • The sanitized payload is re-serialised from parsed fields, not original bytes, potentially mutating data silently even when no UTF-8 error is present.

Consider only sanitizing at the top level (skipping recursive descent), or documenting this as an explicit best-effort trade-off.


Redundant guard

| BS.null content = TaggedValue tag (Lengthy content)

Unnecessary. When content is empty, runParser parseFieldSet content returns Right [], which fails the (_ : _) pattern, and sanitizeUtf8Bytes "" returns "". Adds noise without effect.


Style: leverage available extensions

LambdaCase and point-free style (both standard in this codebase) can tighten this considerably:

sanitizeProtobufUtf8 :: ByteString -> ByteString
sanitizeProtobufUtf8 bs =
  either (const bs) (runBuilder . buildFieldSet . map sanitizeTaggedValue) (runParser parseFieldSet bs)

sanitizeTaggedValue :: TaggedValue -> TaggedValue
sanitizeTaggedValue = \case
  TaggedValue tag (Lengthy content) ->
    TaggedValue tag . Lengthy $
      case runParser parseFieldSet content of
        Right subFields@(_ : _) -> runBuilder . buildFieldSet $ map sanitizeTaggedValue subFields
        _ -> encodeUtf8 $ decodeUtf8With (\_ _ -> Just '?') content
  other -> other

Inlining sanitizeUtf8Bytes at its one call site removes a top-level definition.


No tests

None of the three new pure functions (decodeMessageLenient, sanitizeProtobufUtf8, sanitizeTaggedValue) have test coverage. Given the correctness sensitivity (silent data mutation vs. surfaced errors), at minimum a round-trip property test with a known invalid-UTF-8 payload and a valid nested-message payload would be valuable.


Summary: The error-string broadening is a real bug to fix before merge. The field-type ambiguity is a fundamental protobuf limitation worth documenting. The style points align with the project's preference for concise code using available extensions.

@blacksmith-sh

blacksmith-sh Bot commented Mar 8, 2026

Copy link
Copy Markdown
Contributor

Found 1 test failure on Blacksmith runners:

Failure

Test View Logs
Pages.ErrorPatterns, Error Pattern Pipeline, 5e. Concurrent spikes: two patterns spikin
g simultaneously both get issues
View Logs

Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants