feat: LRU cache eviction for PDF page decoders#282
Conversation
|
✅ DCO Check Passed Thanks @dpantaleoni, all your commits are properly signed off. 🎉 |
Merge Protections🟢 Merge protection satisfied — ready to merge. Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
| // New: Persistent page decoders for typed API | ||
| std::map<int, page_decoder_ptr> page_decoders; | ||
| std::list<int> page_access_order; | ||
| size_t max_cached_pages = 16; |
There was a problem hiding this comment.
this needs to be a configurable parameter for sure.
There was a problem hiding this comment.
After making max_cached_pages configurable and setting default to -1 (no limit, to follow set conventions), I believe both docling and docling-serve will need a few changes to allow the user to set max_cached_pages to a limit. Would this be okay?
There was a problem hiding this comment.
@dpantaleoni we need to set sensible default values. We do not necessarily need to propagate these variables to docling/docling-serve.
Signed-off-by: Dominik Pantaleoni <dominikpantaleoniibm@Dominiks-MacBook-Pro.local>
1dc778f to
42a3031
Compare
Description
This PR adds a Least Recently Used cache eviction policy for pdf page decoders, preventing unbounded memory growth when processing many large PDF documents in succession. This resolves the bug described in this comment on docling-serve issue #366.
Fix
Replacing unbounded page decoder accumulation with an LRU-bounded cache of max 16 live page decoders eliminated the observed steady RSS growth when converting many large PDFs in succession.
Before:

After:

Testing
The script below was ran within a docling dev environment with the locally edited docling-parse dependency installed. A folder of real-world, diverse pdfs of varying sizes ~1 MB - ~20 MB was used. This was done on a Macbook pro m4 36 GB.
check_mem.py