Skip to content

feat: LRU cache eviction for PDF page decoders#282

Draft
dpantaleoni wants to merge 1 commit into
docling-project:mainfrom
dpantaleoni:lru-cache-page-decoders
Draft

feat: LRU cache eviction for PDF page decoders#282
dpantaleoni wants to merge 1 commit into
docling-project:mainfrom
dpantaleoni:lru-cache-page-decoders

Conversation

@dpantaleoni

Copy link
Copy Markdown

Description

This PR adds a Least Recently Used cache eviction policy for pdf page decoders, preventing unbounded memory growth when processing many large PDF documents in succession. This resolves the bug described in this comment on docling-serve issue #366.

Fix

Replacing unbounded page decoder accumulation with an LRU-bounded cache of max 16 live page decoders eliminated the observed steady RSS growth when converting many large PDFs in succession.

Before:
before-res_chronological_rss_lines

After:
after-res_chronological_rss_lines

Testing

The script below was ran within a docling dev environment with the locally edited docling-parse dependency installed. A folder of real-world, diverse pdfs of varying sizes ~1 MB - ~20 MB was used. This was done on a Macbook pro m4 36 GB.
check_mem.py

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @dpantaleoni, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

// New: Persistent page decoders for typed API
std::map<int, page_decoder_ptr> page_decoders;
std::list<int> page_access_order;
size_t max_cached_pages = 16;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be a configurable parameter for sure.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After making max_cached_pages configurable and setting default to -1 (no limit, to follow set conventions), I believe both docling and docling-serve will need a few changes to allow the user to set max_cached_pages to a limit. Would this be okay?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpantaleoni we need to set sensible default values. We do not necessarily need to propagate these variables to docling/docling-serve.

Signed-off-by: Dominik Pantaleoni <dominikpantaleoniibm@Dominiks-MacBook-Pro.local>
@dpantaleoni dpantaleoni force-pushed the lru-cache-page-decoders branch from 1dc778f to 42a3031 Compare June 26, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants