Skip trailing null bytes when searching for EOF marker by 55728 · Pull Request #586 · yob/pdf-reader

55728 · 2026-04-03T14:51:20Z

Problem

PDFs generated by some tools (notably Atos/Fonet) append thousands of null bytes (\x00) after the %%EOF marker. The current find_first_xref_offset implementation in PDF::Reader::Buffer reads the last TRAILING_BYTECOUNT (5000) bytes of the file and looks for %%EOF in that window.

Because null bytes are not line terminators (the PDF spec defines those as \r and \n), they are never trimmed by split(/[\n\r]+/). When the run of trailing nulls exceeds 5000 bytes, the %%EOF marker is pushed entirely outside the search window and parsing fails with:

PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

Fixes #585.

Approach

Before reading the trailing window, scan backwards from the end of the file to locate the effective end — the byte just past the last non-null byte — and read the TRAILING_BYTECOUNT-sized window ending there.

To keep this cheap for files padded with megabytes of nulls, the backward scan happens in TRAILING_BYTECOUNT-sized chunks rather than byte-by-byte or by slurping the whole tail into memory:

@io.seek(0, IO::SEEK_END)
end_pos = @io.pos

while end_pos > 0
  chunk_size = [TRAILING_BYTECOUNT, end_pos].min
  @io.seek(end_pos - chunk_size)
  chunk = @io.read(chunk_size)
  if chunk && (idx = chunk.rindex(/[^\x00]/))
    end_pos = end_pos - chunk_size + idx + 1
    break
  end
  end_pos -= chunk_size
end

start_pos = [end_pos - TRAILING_BYTECOUNT, 0].max
@io.seek(start_pos)
data = @io.read(end_pos - start_pos)

Once the effective end is known, the rest of the method is unchanged: read TRAILING_BYTECOUNT bytes ending there, split on line terminators, search for %%EOF.

Why this shape

Chunk-based scan, not a byte-by-byte walk. A naive loop { @io.seek(-1, …); … } would do O(N) seek+read syscalls for N trailing nulls; pathological inputs could easily produce hundreds of thousands of syscalls. Chunking caps it at ceil(N / TRAILING_BYTECOUNT).
Chunk-based scan, not "read everything past %%EOF". We can't know upfront how much padding there is; reading the entire tail could be arbitrarily large. Capping chunk size keeps memory resident at ≤5000 bytes.
Reuse TRAILING_BYTECOUNT as the chunk size. Already battle-tested as a sensible IO read size in this method; avoids introducing a new tuning knob.
@io.read(end_pos - start_pos) instead of always reading TRAILING_BYTECOUNT. When the entire file is shorter than TRAILING_BYTECOUNT (or shrinks to empty after stripping nulls), this naturally falls through to the existing "empty data" branch.

Behaviour for edge cases

Input	Result
Compliant PDF, no trailing nulls	Identical to before — single chunk read, `rindex` matches the last byte, loop exits immediately
Trailing nulls fit within `TRAILING_BYTECOUNT`	Identical to before in observable output; one extra `rindex` call
Trailing nulls > `TRAILING_BYTECOUNT`	Previously failed, now succeeds
File is entirely null bytes	Loop exhausts to `end_pos = 0`, data is empty, raises `MalformedPDFError` (preserved behaviour)
Empty file	Caught earlier by `check_size_is_non_zero` (unchanged)

Behaviour change summary

The only observable change for compliant PDFs is one extra String#rindex call per parse — negligible. The behavioural change is strictly recovery of files that previously raised MalformedPDFError; no compliant input that previously parsed will start failing.

Tests

Two new specs in spec/reader/buffer_spec.rb:

trailing_null_bytes.pdf — extended_eof.pdf with 6000 trailing null bytes appended (deliberately > TRAILING_BYTECOUNT to exercise the multi-chunk path). Asserts find_first_xref_offset still returns the correct offset (145) — i.e. matches what extended_eof.pdf itself returns.
All-nulls file — a 10,000-byte StringIO of \x00. Asserts MalformedPDFError is still raised, confirming we don't silently succeed on garbage input.

The fixture is also registered in spec/integrity.yml with its md5 and byte count, consistent with how other fixtures are tracked.

Full spec suite passes locally.

Notes for reviewers

The original code had @io.seek(-TRAILING_BYTECOUNT, IO::SEEK_END) rescue @io.seek(0). The rescue is no longer needed because check_size_is_non_zero (called on the first line) already performs @io.seek(-1, IO::SEEK_END) — any IO that can't seek from the end will fail there first, so we never reach the new code with a non-seekable IO.
The empty-data guard was widened from data.nil? to data.nil? || data.empty? because the new code can legitimately compute end_pos == start_pos == 0 (all-nulls case) and ask for a zero-byte read, which returns "" rather than nil on some IO implementations.

55728 · 2026-04-24T15:38:14Z

Rebased onto latest main. @yob Would appreciate a review when you have a chance!

Some PDF generators (e.g. Atos/Fonet) append thousands of null bytes after %%EOF. Since null bytes are not line terminators, they prevented the EOF marker from being found within the trailing byte scan window. Scan backwards from the end of the file in TRAILING_BYTECOUNT-sized chunks, using String#rindex to find the last non-null byte. This locates the effective end of the PDF without an O(N) per-byte seek+read loop, which would be a DoS vector on a file padded with hundreds of MB of nulls. Treat an all-null file (or any file where the skip walks all the way to offset 0) as MalformedPDFError. Fixes yob#585

55728 · 2026-05-24T09:41:48Z

Hi @yob 👋 Following up on this one — just wanted to surface this again in case it slipped through.

I've reworked the approach since the original submission: the backward scan now operates in TRAILING_BYTECOUNT-sized chunks instead of byte-by-byte, which caps syscalls at ceil(N / 5000) even for pathologically padded files. I've also rewritten the PR description with an edge case table and design rationale.

The behavioural change is strictly additive — compliant PDFs see only one extra rindex call per parse.

Happy to adjust anything 😊 — there's a "Why this shape" section in the description covering the tradeoffs if you're curious.

55728 mentioned this pull request Apr 3, 2026

PDF with trailing null characters after EOF raises MalformedPDFError #585

Open

55728 force-pushed the fix-trailing-null-after-eof branch from 71e7c77 to 7292cd0 Compare April 24, 2026 15:36

55728 force-pushed the fix-trailing-null-after-eof branch from 7292cd0 to 6f8ce8d Compare May 24, 2026 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip trailing null bytes when searching for EOF marker#586

Skip trailing null bytes when searching for EOF marker#586
55728 wants to merge 1 commit into
yob:mainfrom
55728:fix-trailing-null-after-eof

55728 commented Apr 3, 2026 •

edited

Loading

Uh oh!

55728 commented Apr 24, 2026

Uh oh!

55728 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

55728 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

Why this shape

Behaviour for edge cases

Behaviour change summary

Tests

Notes for reviewers

Uh oh!

55728 commented Apr 24, 2026

Uh oh!

55728 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

55728 commented Apr 3, 2026 •

edited

Loading