Skip to content

Skip trailing null bytes when searching for EOF marker#586

Open
55728 wants to merge 1 commit into
yob:mainfrom
55728:fix-trailing-null-after-eof
Open

Skip trailing null bytes when searching for EOF marker#586
55728 wants to merge 1 commit into
yob:mainfrom
55728:fix-trailing-null-after-eof

Conversation

@55728
Copy link
Copy Markdown
Contributor

@55728 55728 commented Apr 3, 2026

Problem

PDFs generated by some tools (notably Atos/Fonet) append thousands of null bytes (\x00) after the %%EOF marker. The current find_first_xref_offset implementation in PDF::Reader::Buffer reads the last TRAILING_BYTECOUNT (5000) bytes of the file and looks for %%EOF in that window.

Because null bytes are not line terminators (the PDF spec defines those as \r and \n), they are never trimmed by split(/[\n\r]+/). When the run of trailing nulls exceeds 5000 bytes, the %%EOF marker is pushed entirely outside the search window and parsing fails with:

PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

Fixes #585.

Approach

Before reading the trailing window, scan backwards from the end of the file to locate the effective end — the byte just past the last non-null byte — and read the TRAILING_BYTECOUNT-sized window ending there.

To keep this cheap for files padded with megabytes of nulls, the backward scan happens in TRAILING_BYTECOUNT-sized chunks rather than byte-by-byte or by slurping the whole tail into memory:

@io.seek(0, IO::SEEK_END)
end_pos = @io.pos

while end_pos > 0
  chunk_size = [TRAILING_BYTECOUNT, end_pos].min
  @io.seek(end_pos - chunk_size)
  chunk = @io.read(chunk_size)
  if chunk && (idx = chunk.rindex(/[^\x00]/))
    end_pos = end_pos - chunk_size + idx + 1
    break
  end
  end_pos -= chunk_size
end

start_pos = [end_pos - TRAILING_BYTECOUNT, 0].max
@io.seek(start_pos)
data = @io.read(end_pos - start_pos)

Once the effective end is known, the rest of the method is unchanged: read TRAILING_BYTECOUNT bytes ending there, split on line terminators, search for %%EOF.

Why this shape

  • Chunk-based scan, not a byte-by-byte walk. A naive loop { @io.seek(-1, …); … } would do O(N) seek+read syscalls for N trailing nulls; pathological inputs could easily produce hundreds of thousands of syscalls. Chunking caps it at ceil(N / TRAILING_BYTECOUNT).
  • Chunk-based scan, not "read everything past %%EOF". We can't know upfront how much padding there is; reading the entire tail could be arbitrarily large. Capping chunk size keeps memory resident at ≤5000 bytes.
  • Reuse TRAILING_BYTECOUNT as the chunk size. Already battle-tested as a sensible IO read size in this method; avoids introducing a new tuning knob.
  • @io.read(end_pos - start_pos) instead of always reading TRAILING_BYTECOUNT. When the entire file is shorter than TRAILING_BYTECOUNT (or shrinks to empty after stripping nulls), this naturally falls through to the existing "empty data" branch.

Behaviour for edge cases

Input Result
Compliant PDF, no trailing nulls Identical to before — single chunk read, rindex matches the last byte, loop exits immediately
Trailing nulls fit within TRAILING_BYTECOUNT Identical to before in observable output; one extra rindex call
Trailing nulls > TRAILING_BYTECOUNT Previously failed, now succeeds
File is entirely null bytes Loop exhausts to end_pos = 0, data is empty, raises MalformedPDFError (preserved behaviour)
Empty file Caught earlier by check_size_is_non_zero (unchanged)

Behaviour change summary

The only observable change for compliant PDFs is one extra String#rindex call per parse — negligible. The behavioural change is strictly recovery of files that previously raised MalformedPDFError; no compliant input that previously parsed will start failing.

Tests

Two new specs in spec/reader/buffer_spec.rb:

  1. trailing_null_bytes.pdfextended_eof.pdf with 6000 trailing null bytes appended (deliberately > TRAILING_BYTECOUNT to exercise the multi-chunk path). Asserts find_first_xref_offset still returns the correct offset (145) — i.e. matches what extended_eof.pdf itself returns.
  2. All-nulls file — a 10,000-byte StringIO of \x00. Asserts MalformedPDFError is still raised, confirming we don't silently succeed on garbage input.

The fixture is also registered in spec/integrity.yml with its md5 and byte count, consistent with how other fixtures are tracked.

Full spec suite passes locally.

Notes for reviewers

  • The original code had @io.seek(-TRAILING_BYTECOUNT, IO::SEEK_END) rescue @io.seek(0). The rescue is no longer needed because check_size_is_non_zero (called on the first line) already performs @io.seek(-1, IO::SEEK_END) — any IO that can't seek from the end will fail there first, so we never reach the new code with a non-seekable IO.
  • The empty-data guard was widened from data.nil? to data.nil? || data.empty? because the new code can legitimately compute end_pos == start_pos == 0 (all-nulls case) and ask for a zero-byte read, which returns "" rather than nil on some IO implementations.

@55728
Copy link
Copy Markdown
Contributor Author

55728 commented Apr 24, 2026

Rebased onto latest main. @yob Would appreciate a review when you have a chance!

Some PDF generators (e.g. Atos/Fonet) append thousands of null
bytes after %%EOF. Since null bytes are not line terminators,
they prevented the EOF marker from being found within the
trailing byte scan window.

Scan backwards from the end of the file in TRAILING_BYTECOUNT-sized
chunks, using String#rindex to find the last non-null byte. This
locates the effective end of the PDF without an O(N) per-byte
seek+read loop, which would be a DoS vector on a file padded with
hundreds of MB of nulls.

Treat an all-null file (or any file where the skip walks all the
way to offset 0) as MalformedPDFError.

Fixes yob#585
@55728 55728 force-pushed the fix-trailing-null-after-eof branch from 7292cd0 to 6f8ce8d Compare May 24, 2026 09:19
@55728
Copy link
Copy Markdown
Contributor Author

55728 commented May 24, 2026

Hi @yob 👋 Following up on this one — just wanted to surface this again in case it slipped through.

I've reworked the approach since the original submission: the backward scan now operates in TRAILING_BYTECOUNT-sized chunks instead of byte-by-byte, which caps syscalls at ceil(N / 5000) even for pathologically padded files. I've also rewritten the PR description with an edge case table and design rationale.

The behavioural change is strictly additive — compliant PDFs see only one extra rindex call per parse.

Happy to adjust anything 😊 — there's a "Why this shape" section in the description covering the tradeoffs if you're curious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDF with trailing null characters after EOF raises MalformedPDFError

1 participant