Skip to content

Reduce segmenter memory usage further.#223

Merged
garretrieger merged 2 commits into
w3c:mainfrom
garretrieger:frequency_filtering
May 13, 2026
Merged

Reduce segmenter memory usage further.#223
garretrieger merged 2 commits into
w3c:mainfrom
garretrieger:frequency_filtering

Conversation

@garretrieger
Copy link
Copy Markdown
Contributor

  • Only store pair probabilities for codepoints that are in the font being processed. Pairs that are not a subset of the input font will not ever be needed.
  • Add periodic reset to the bigram -> probability cache to clear out entries that are no longer needed.

Will reduce memory consumed by frequency data by keeping only what's relevant to the segmenting operation.
Without reset it's memory usage grows unbounded and can be significant for longer segmenter runs. This resets the cache after each base segment is finished. The probability calculations are fairly scoped to the current base segment so this has minimal impact on observed cache hit rates.
@garretrieger garretrieger merged commit a058ca5 into w3c:main May 13, 2026
3 checks passed
@garretrieger garretrieger deleted the frequency_filtering branch May 13, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant