Add lenient UTF-8 handling for protobuf message decoding by tonyalaribe · Pull Request #333 · monoscope-tech/monoscope

tonyalaribe · 2026-03-08T22:44:45Z

Summary

This PR adds lenient UTF-8 decoding for protobuf messages to gracefully handle malformed UTF-8 sequences in incoming data. Instead of failing on invalid UTF-8 bytes, the decoder now sanitizes the wire format by replacing invalid bytes with '?' characters and retries decoding.

Key Changes

Added decodeMessageLenient function that attempts standard decoding first, then falls back to sanitized decoding on UTF-8 errors
Implemented sanitizeProtobufUtf8 to walk the protobuf wire format and sanitize length-delimited fields
Added sanitizeTaggedValue to recursively handle nested protobuf messages while sanitizing string/bytes fields
Added sanitizeUtf8Bytes utility to replace invalid UTF-8 sequences with '?' characters
Updated processBatchPipeline to use decodeMessageLenient instead of decodeMessage for all incoming messages
Added necessary imports from Data.ProtoLens.Encoding modules

Implementation Details

The sanitization process:

Parses the protobuf wire format into tagged values
For length-delimited fields, attempts to parse them as nested messages
If successful, recursively sanitizes the nested fields
If not a message, treats the content as a string/bytes field and replaces invalid UTF-8 bytes with '?'
Rebuilds the wire format with sanitized content

This approach ensures that valid protobuf structures are preserved while gracefully handling encoding issues in string and bytes fields.

https://claude.ai/code/session_01MmF2VFF2zfknCJNUzgJACT

proto-lens uses strict UTF-8 decoding for protobuf string fields, which causes ~1200 message decode failures per minute when OTLP producers send non-UTF-8 bytes in string fields (e.g. AnyValue.string_value, Metric.name, Span.name). The entire message is dropped even though the protobuf structure is valid. Add decodeMessageLenient which first tries normal decoding, then on UTF-8 errors walks the protobuf wire format to replace invalid UTF-8 bytes with '?' in length-delimited fields and retries. This preserves ~77% of previously-dropped messages. https://claude.ai/code/session_01MmF2VFF2zfknCJNUzgJACT

claude · 2026-03-08T22:47:03Z

Code Review

Good approach to handling malformed UTF-8 in protobuf ingestion. A few issues worth addressing:

Bug: Overly broad error matching

decodeMessageLenient checks only for "Cannot decode byte", but the existing recordProtoError (line 651) correctly uses both substrings to identify UTF-8 errors:

| "Cannot decode byte" `L.isInfixOf` err && "Invalid UTF-8 stream" `L.isInfixOf` err

The new code will trigger sanitization for non-UTF-8 "Cannot decode byte" errors (e.g., varint/fixed-width field errors), silently corrupting or dropping data that should surface as real parse failures. Fix:

Left err
  | "Cannot decode byte" `L.isInfixOf` err
  , "Invalid UTF-8 stream" `L.isInfixOf` err -> decodeMessage (sanitizeProtobufUtf8 bs)
  | otherwise -> Left err

Correctness concern: ambiguous wire-level field classification

sanitizeTaggedValue treats any Lengthy content that happens to parse as a non-empty FieldSet as a nested protobuf message and recursively walks it. But at the wire level, Lengthy fields are indistinguishable between bytes, string, and message — any arbitrary binary payload can accidentally parse as a valid FieldSet. This means:

Raw byte fields (serialised IPs, timestamps, hashes, custom binary blobs) could be incorrectly "sanitized" as nested messages.
The sanitized payload is re-serialised from parsed fields, not original bytes, potentially mutating data silently even when no UTF-8 error is present.

Consider only sanitizing at the top level (skipping recursive descent), or documenting this as an explicit best-effort trade-off.

Redundant guard

| BS.null content = TaggedValue tag (Lengthy content)

Unnecessary. When content is empty, runParser parseFieldSet content returns Right [], which fails the (_ : _) pattern, and sanitizeUtf8Bytes "" returns "". Adds noise without effect.

Style: leverage available extensions

LambdaCase and point-free style (both standard in this codebase) can tighten this considerably:

sanitizeProtobufUtf8 :: ByteString -> ByteString
sanitizeProtobufUtf8 bs =
  either (const bs) (runBuilder . buildFieldSet . map sanitizeTaggedValue) (runParser parseFieldSet bs)

sanitizeTaggedValue :: TaggedValue -> TaggedValue
sanitizeTaggedValue = \case
  TaggedValue tag (Lengthy content) ->
    TaggedValue tag . Lengthy $
      case runParser parseFieldSet content of
        Right subFields@(_ : _) -> runBuilder . buildFieldSet $ map sanitizeTaggedValue subFields
        _ -> encodeUtf8 $ decodeUtf8With (\_ _ -> Just '?') content
  other -> other

Inlining sanitizeUtf8Bytes at its one call site removes a top-level definition.

No tests

None of the three new pure functions (decodeMessageLenient, sanitizeProtobufUtf8, sanitizeTaggedValue) have test coverage. Given the correctness sensitivity (silent data mutation vs. surfaced errors), at minimum a round-trip property test with a known invalid-UTF-8 payload and a valid nested-message payload would be valuable.

Summary: The error-string broadening is a real bug to fix before merge. The field-type ambiguity is a fundamental protobuf limitation worth documenting. The style points align with the project's preference for concise code using available extensions.

blacksmith-sh · 2026-03-08T22:53:18Z

Found 1 test failure on Blacksmith runners:

Failure

Test	View Logs
`Pages.ErrorPatterns, Error Pattern Pipeline, 5e. Concurrent spikes: two patterns spikin` `g simultaneously both get issues`	View Logs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lenient UTF-8 handling for protobuf message decoding#333

Add lenient UTF-8 handling for protobuf message decoding#333
tonyalaribe wants to merge 1 commit into
masterfrom
claude/fix-wiretap-logs-FO1dn

tonyalaribe commented Mar 8, 2026

Uh oh!

claude Bot commented Mar 8, 2026

Uh oh!

blacksmith-sh Bot commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tonyalaribe commented Mar 8, 2026

Summary

Key Changes

Implementation Details

Uh oh!

claude Bot commented Mar 8, 2026

Code Review

Bug: Overly broad error matching

Correctness concern: ambiguous wire-level field classification

Redundant guard

Style: leverage available extensions

No tests

Uh oh!

blacksmith-sh Bot commented Mar 8, 2026

Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants