Optimize performance for large schema processing by koxudaxi · Pull Request #2774 · koxudaxi/datamodel-code-generator

koxudaxi · 2025-12-23T20:34:54Z

Summary by CodeRabbit

Refactor
- Improved internal performance for model generation: faster dependency resolution, reduced work when resolving reference names, and more efficient handling of type combinations and naming collisions. These changes speed up processing for large or complex schemas and reduce latency during code generation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-23T20:35:05Z

Warning

Rate limit exceeded

@koxudaxi has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 17 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 4866f76 and a688ee9.

📒 Files selected for processing (2)

src/datamodel_code_generator/reference.py
src/datamodel_code_generator/types.py

📝 Walkthrough

Walkthrough

Replace repeated linear scans with dictionary-based lookups and a small cache across imports, parser, reference resolution, and data-type initialization to reduce O(n) work; observable behavior and public APIs are unchanged. (28 words)

Changes

Cohort / File(s)	Summary
Reference caching & parent lookup `src/datamodel_code_generator/reference.py`	Add `_reference_names_cache`, `_get_reference_names()`, `_invalidate_reference_names_cache()`, `_find_parent_reference()`; use cache and parent lookup across add/get/update/delete and naming flows; invalidate cache when references change to avoid repeated O(n) scans.
Parser indexing & list assignment `src/datamodel_code_generator/parser/base.py`	Replace membership/list scans in `sort_data_models()` with a `path_to_index` dict for O(1) lookups; use `path_to_index.keys()` for unsorted names; replace insert+remove with direct index assignment in `_replace_model_in_list()`.
Imports lookup optimization `src/datamodel_code_generator/imports.py`	In `remove_unused()`, replace generator-based `next(...)` scan with a precomputed `reverse_lookup` dict and O(1 `.get()` access to find `reference_path`.
DataType single-pass optional detection `src/datamodel_code_generator/types.py`	Consolidate Any vs optional detection in `DataType.__init__` using two boolean flags and a deferred post-loop adjustment to `is_optional` and `data_types`, avoiding in-loop mutation and early breaks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Refactor parser base post-processing for DRY and type-safe implementation #2730: Modifies remove_unused() in src/datamodel_code_generator/imports.py, touches the same function optimized here.

Suggested labels

breaking-change-analyzed

Poem

🐇
I hopped through paths and names today,
Cached my carrots so lookups play.
No slow next(...) nibbling by,
Dicts and parent hops let logic fly.
Faster thumps — a tidy, spry display!

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Optimize performance for large schema processing' directly aligns with the main objective of improving performance by optimizing operations in the datamodel code generator for large schemas.
Linked Issues check	✅ Passed	The PR addresses issue #2286 by implementing multiple performance optimizations including caching reference names, replacing linear lookups with O(1) dictionary operations, and optimizing dependency resolution, which directly tackle the reported slowness in _get_unique_name.
Out of Scope Changes check	✅ Passed	All changes are performance-related optimizations focused on four modules (imports.py, base.py, reference.py, types.py) that directly support the goal of accelerating large schema processing without altering observable behavior.
Docstring Coverage	✅ Passed	Docstring coverage is 84.21% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/datamodel_code_generator/types.py (1)

386-396: Cache invalidation bug in swap_with() method.

The unresolved_types property depends on both self.reference and self.data_types. While replace_reference() invalidates the cache when reference changes (line 394), the swap_with() method (line 415) modifies parent.data_types without invalidating parent._unresolved_types_cache. This causes stale cache data if unresolved_types was accessed before swap_with() is called on a child.

Add parent._unresolved_types_cache = None after line 415 to ensure cache coherency when parent's data_types is modified.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1966581 and 3fbeaf9.

📒 Files selected for processing (4)

src/datamodel_code_generator/imports.py
src/datamodel_code_generator/parser/base.py
src/datamodel_code_generator/reference.py
src/datamodel_code_generator/types.py

🧰 Additional context used

🧬 Code graph analysis (3)

src/datamodel_code_generator/types.py (2)

src/datamodel_code_generator/reference.py (1)

reference (76-78)

src/datamodel_code_generator/model/base.py (2)

path (811-813)

all_data_types (778-782)

src/datamodel_code_generator/reference.py (3)

src/datamodel_code_generator/model/base.py (2)

name (730-732)

path (811-813)

src/datamodel_code_generator/__main__.py (1)

get (117-119)

src/datamodel_code_generator/__init__.py (1)

NamingStrategy (286-298)

src/datamodel_code_generator/parser/base.py (3)

src/datamodel_code_generator/model/base.py (3)

reference_classes (722-727)

path (811-813)

all_data_types (778-782)

src/datamodel_code_generator/model/pydantic_v2/root_model.py (1)

RootModel (14-45)

src/datamodel_code_generator/types.py (1)

all_data_types (433-439)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: 3.12 on Windows
GitHub Check: 3.10 on Windows
GitHub Check: py312-isort5 on Ubuntu
GitHub Check: 3.10 on macOS
GitHub Check: 3.11 on Ubuntu
GitHub Check: py312-black23 on Ubuntu
GitHub Check: 3.14 on Windows
GitHub Check: 3.11 on Windows
GitHub Check: 3.13 on Windows
GitHub Check: benchmarks
GitHub Check: Analyze (python)

🔇 Additional comments (13)

src/datamodel_code_generator/imports.py (1)

188-194: LGTM! Good performance optimization.

The reverse lookup dictionary correctly replaces the O(n) linear scan per import with a single O(n) dict construction followed by O(1) lookups. This is a meaningful improvement for remove_unused when processing many imports.

src/datamodel_code_generator/parser/base.py (4)

400-400: LGTM! Correct set operation optimization.

Using sorted_data_models.keys() directly in the set difference is efficient since dict views support set operations in Python 3.

422-436: LGTM! Significant performance improvement for dependency resolution.

Building path_to_index once and using O(1) dictionary lookups instead of repeated O(n) list.index() calls reduces the complexity from O(n²) to O(n) in this loop. The walrus operator usage is correct and idiomatic.

454-456: LGTM! Consistent use of path_to_index.

Using path_to_index.keys() to construct unsorted_data_model_names is consistent with the earlier optimization and avoids maintaining a separate data structure.

1086-1088: LGTM! Cleaner in-place replacement.

Direct index assignment avoids the element shifting overhead of insert+remove. The comment's complexity analysis is approximate (both are O(n)), but the new approach is more efficient in practice by avoiding list element shifts.

src/datamodel_code_generator/types.py (3)

341-346: LGTM! Cache field setup is correct.

Adding _unresolved_types_cache to _exclude_fields ensures it won't be included in dict/model_dump operations, and the field declaration with None default is appropriate for lazy initialization.

375-384: LGTM! Correct lazy caching implementation.

The cache-check-compute-store pattern is correctly implemented. Using frozenset ensures the cached value is immutable, preventing accidental modifications.

529-543: LGTM! Correct single-pass optimization.

The logic correctly identifies when an optional Any type coexists with non-Any types, and appropriately:

Promotes is_optional to the parent type

Filters out the redundant Any entries

The early exit when both conditions are met avoids unnecessary iterations.

src/datamodel_code_generator/reference.py (5)

573-584: LGTM! Clean cache implementation.

The cache getter lazily populates from self.references.values(), and the invalidation method simply sets it to None. This is a straightforward and effective caching pattern.

807-818: LGTM! Clean helper extraction.

The _find_parent_reference method correctly traverses up the path hierarchy and consolidates parent lookup logic that was previously duplicated across _check_parent_scope_option and _apply_full_path_naming.

804-804: LGTM! Cache invalidation appears complete.

All methods that modify self.references or reference names properly call _invalidate_reference_names_cache(). The invalidation points cover:

add_ref: new reference creation

_rename_external_ref_with_same_name: reference renaming

add: reference creation/update

delete: reference removal

Also applies to: 879-879, 945-945, 955-955, 967-967

820-825: LGTM! Clean refactor using the new helper.

Both _check_parent_scope_option and _apply_full_path_naming now use _find_parent_reference, reducing code duplication while maintaining the same behavior.

Also applies to: 827-841

1013-1032: LGTM! This is the core optimization addressing issue #2286.

Using _get_reference_names() avoids the O(n) set creation on every _get_unique_name call. For large schemas with many references, this significantly reduces the overhead of duplicate-name checks by caching the set of existing reference names.

codspeed-hq · 2025-12-23T20:40:56Z

CodSpeed Performance Report

Merging #2774 will not alter performance

_{Comparing perf/optimize-large-schema-processing (a688ee9) with main (1966581)}

Summary

✅ 73 untouched
⏩ 10 skipped¹

10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

codecov · 2025-12-23T23:47:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.48%. Comparing base (1966581) to head (a688ee9).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2774   +/-   ##
=======================================
  Coverage   99.47%   99.48%           
=======================================
  Files          87       87           
  Lines       12882    12897   +15     
  Branches     1546     1543    -3     
=======================================
+ Hits        12815    12831   +16     
  Misses         35       35           
+ Partials       32       31    -1

Flag	Coverage Δ
unittests	`99.48% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/datamodel_code_generator/types.py (1)
521-536: LGTM—Edge-case optimization is sound.

The two-flag detection logic correctly identifies when Any with Optional can be removed (when non-Any types exist), and the early exit on line 532 avoids unnecessary iteration once both conditions are found.

One minor note: the comment on line 521 says "Single-pass optimization," but the implementation actually uses multiple passes—one to detect (525–532), one to filter (536), and another to set parents (538–541). This doesn't affect correctness, just comment accuracy.
Optional: Clarify the comment
-        # Single-pass optimization: detect ANY+optional and non-ANY types together
+        # Edge-case optimization: detect ANY+optional and non-ANY types together
         # This is a rare edge case optimization - pragma: no cover

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4fa082 and 4866f76.

📒 Files selected for processing (1)

src/datamodel_code_generator/types.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: benchmarks
GitHub Check: py312-isort5 on Ubuntu
GitHub Check: py312-isort6 on Ubuntu
GitHub Check: Analyze (python)
GitHub Check: 3.14 on macOS
GitHub Check: 3.13 on Windows
GitHub Check: 3.12 on macOS
GitHub Check: 3.11 on Windows
GitHub Check: 3.10 on Windows
GitHub Check: 3.11 on macOS
GitHub Check: 3.14 on Windows
GitHub Check: 3.12 on Windows

The optimization block for optional Any + non-Any types is a rare edge case that is difficult to trigger via e2e tests. Mark it with pragma: no cover.

github-actions · 2025-12-24T01:20:35Z

Breaking Change Analysis

Result: No breaking changes detected

Reasoning: This PR contains only internal performance optimizations with no breaking changes:

imports.py: Changed O(n) linear scan to O(1) dictionary lookup for reference paths - same behavior, just faster.
parser/base.py: Replaced set(sorted_data_models) with sorted_data_models.keys() (functionally equivalent), replaced list.index() with dictionary lookups, and changed insert+remove to direct index assignment in _replace_model_in_list - all produce identical results.
reference.py: Added private caching mechanism (_reference_names_cache) for reference names to avoid O(n) set creation on every _get_unique_name call. Extracted _find_parent_reference helper method. All changes are internal implementation details with proper cache invalidation on reference mutations.
types.py: Optimized DataType.init to use single-pass detection with early exit instead of nested loop. The logic for handling optional Any types with non-Any types remains functionally identical.

All changes are purely algorithmic optimizations (O(n) → O(1) lookups, reduced iterations) that preserve the exact same external behavior. No public APIs, CLI options, generated code output, template interfaces, or error handling were modified.

This analysis was performed by Claude Code Action

Optimize performance for large schema processing

3fbeaf9

coderabbitai Bot reviewed Dec 23, 2025

View reviewed changes

Remove unresolved_types cache for Pydantic v1 compatibility

b4fa082

coderabbitai Bot reviewed Dec 24, 2025

View reviewed changes

koxudaxi force-pushed the perf/optimize-large-schema-processing branch from 4866f76 to 6b0072f Compare December 24, 2025 00:09

Add pragma: no cover for rare edge case optimization in DataType

fd3d93e

The optimization block for optional Any + non-Any types is a rare edge case that is difficult to trigger via e2e tests. Mark it with pragma: no cover.

koxudaxi force-pushed the perf/optimize-large-schema-processing branch 2 times, most recently from ff54605 to fe31a45 Compare December 24, 2025 00:20

Add pragma: no cover for cache hit path in reference

a688ee9

koxudaxi force-pushed the perf/optimize-large-schema-processing branch from fe31a45 to a688ee9 Compare December 24, 2025 00:35

koxudaxi merged commit 3ffee15 into main Dec 24, 2025
37 checks passed

koxudaxi deleted the perf/optimize-large-schema-processing branch December 24, 2025 01:19

github-actions Bot added the breaking-change-analyzed label Dec 24, 2025

This was referenced Dec 24, 2025

Optimize performance with LRU caching and O(n) algorithms #2777

Merged

Optimize O(n²) algorithms and reduce redundant iterations #2778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize performance for large schema processing#2774

Optimize performance for large schema processing#2774
koxudaxi merged 4 commits intomainfrom
perf/optimize-large-schema-processing

koxudaxi commented Dec 23, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Dec 23, 2025 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codspeed-hq Bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

koxudaxi commented Dec 23, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #2774 will not alter performance

Summary

Footnotes

Uh oh!

codecov Bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Dec 24, 2025

Breaking Change Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

koxudaxi commented Dec 23, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 23, 2025 •

edited

Loading

codspeed-hq Bot commented Dec 23, 2025 •

edited

Loading

codecov Bot commented Dec 23, 2025 •

edited

Loading