Skip trailing null bytes when searching for EOF marker#586
Open
55728 wants to merge 1 commit into
Open
Conversation
71e7c77 to
7292cd0
Compare
Contributor
Author
|
Rebased onto latest main. @yob Would appreciate a review when you have a chance! |
Some PDF generators (e.g. Atos/Fonet) append thousands of null bytes after %%EOF. Since null bytes are not line terminators, they prevented the EOF marker from being found within the trailing byte scan window. Scan backwards from the end of the file in TRAILING_BYTECOUNT-sized chunks, using String#rindex to find the last non-null byte. This locates the effective end of the PDF without an O(N) per-byte seek+read loop, which would be a DoS vector on a file padded with hundreds of MB of nulls. Treat an all-null file (or any file where the skip walks all the way to offset 0) as MalformedPDFError. Fixes yob#585
7292cd0 to
6f8ce8d
Compare
Contributor
Author
|
Hi @yob 👋 Following up on this one — just wanted to surface this again in case it slipped through. I've reworked the approach since the original submission: the backward scan now operates in TRAILING_BYTECOUNT-sized chunks instead of byte-by-byte, which caps syscalls at ceil(N / 5000) even for pathologically padded files. I've also rewritten the PR description with an edge case table and design rationale. The behavioural change is strictly additive — compliant PDFs see only one extra rindex call per parse. Happy to adjust anything 😊 — there's a "Why this shape" section in the description covering the tradeoffs if you're curious. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PDFs generated by some tools (notably Atos/Fonet) append thousands of null bytes (
\x00) after the%%EOFmarker. The currentfind_first_xref_offsetimplementation inPDF::Reader::Bufferreads the lastTRAILING_BYTECOUNT(5000) bytes of the file and looks for%%EOFin that window.Because null bytes are not line terminators (the PDF spec defines those as
\rand\n), they are never trimmed bysplit(/[\n\r]+/). When the run of trailing nulls exceeds 5000 bytes, the%%EOFmarker is pushed entirely outside the search window and parsing fails with:Fixes #585.
Approach
Before reading the trailing window, scan backwards from the end of the file to locate the effective end — the byte just past the last non-null byte — and read the
TRAILING_BYTECOUNT-sized window ending there.To keep this cheap for files padded with megabytes of nulls, the backward scan happens in
TRAILING_BYTECOUNT-sized chunks rather than byte-by-byte or by slurping the whole tail into memory:Once the effective end is known, the rest of the method is unchanged: read
TRAILING_BYTECOUNTbytes ending there, split on line terminators, search for%%EOF.Why this shape
{ @io.seek(-1, …); … }would do O(N) seek+read syscalls for N trailing nulls; pathological inputs could easily produce hundreds of thousands of syscalls. Chunking caps it atceil(N / TRAILING_BYTECOUNT).%%EOF". We can't know upfront how much padding there is; reading the entire tail could be arbitrarily large. Capping chunk size keeps memory resident at ≤5000 bytes.TRAILING_BYTECOUNTas the chunk size. Already battle-tested as a sensible IO read size in this method; avoids introducing a new tuning knob.@io.read(end_pos - start_pos)instead of always readingTRAILING_BYTECOUNT. When the entire file is shorter thanTRAILING_BYTECOUNT(or shrinks to empty after stripping nulls), this naturally falls through to the existing "empty data" branch.Behaviour for edge cases
rindexmatches the last byte, loop exits immediatelyTRAILING_BYTECOUNTrindexcallTRAILING_BYTECOUNTend_pos = 0, data is empty, raisesMalformedPDFError(preserved behaviour)check_size_is_non_zero(unchanged)Behaviour change summary
The only observable change for compliant PDFs is one extra
String#rindexcall per parse — negligible. The behavioural change is strictly recovery of files that previously raisedMalformedPDFError; no compliant input that previously parsed will start failing.Tests
Two new specs in
spec/reader/buffer_spec.rb:trailing_null_bytes.pdf—extended_eof.pdfwith 6000 trailing null bytes appended (deliberately >TRAILING_BYTECOUNTto exercise the multi-chunk path). Assertsfind_first_xref_offsetstill returns the correct offset (145) — i.e. matches whatextended_eof.pdfitself returns.StringIOof\x00. AssertsMalformedPDFErroris still raised, confirming we don't silently succeed on garbage input.The fixture is also registered in
spec/integrity.ymlwith its md5 and byte count, consistent with how other fixtures are tracked.Full spec suite passes locally.
Notes for reviewers
@io.seek(-TRAILING_BYTECOUNT, IO::SEEK_END) rescue @io.seek(0). The rescue is no longer needed becausecheck_size_is_non_zero(called on the first line) already performs@io.seek(-1, IO::SEEK_END)— any IO that can't seek from the end will fail there first, so we never reach the new code with a non-seekable IO.data.nil?todata.nil? || data.empty?because the new code can legitimately computeend_pos == start_pos == 0(all-nulls case) and ask for a zero-byte read, which returns""rather thannilon some IO implementations.