Skip to content

Optimize performance for large schema processing#2774

Merged
koxudaxi merged 4 commits intomainfrom
perf/optimize-large-schema-processing
Dec 24, 2025
Merged

Optimize performance for large schema processing#2774
koxudaxi merged 4 commits intomainfrom
perf/optimize-large-schema-processing

Conversation

@koxudaxi
Copy link
Copy Markdown
Owner

@koxudaxi koxudaxi commented Dec 23, 2025

Fixes: #2286

Summary by CodeRabbit

  • Refactor
    • Improved internal performance for model generation: faster dependency resolution, reduced work when resolving reference names, and more efficient handling of type combinations and naming collisions. These changes speed up processing for large or complex schemas and reduce latency during code generation.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 23, 2025

Warning

Rate limit exceeded

@koxudaxi has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 17 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 4866f76 and a688ee9.

📒 Files selected for processing (2)
  • src/datamodel_code_generator/reference.py
  • src/datamodel_code_generator/types.py
📝 Walkthrough

Walkthrough

Replace repeated linear scans with dictionary-based lookups and a small cache across imports, parser, reference resolution, and data-type initialization to reduce O(n) work; observable behavior and public APIs are unchanged. (28 words)

Changes

Cohort / File(s) Summary
Reference caching & parent lookup
src/datamodel_code_generator/reference.py
Add _reference_names_cache, _get_reference_names(), _invalidate_reference_names_cache(), _find_parent_reference(); use cache and parent lookup across add/get/update/delete and naming flows; invalidate cache when references change to avoid repeated O(n) scans.
Parser indexing & list assignment
src/datamodel_code_generator/parser/base.py
Replace membership/list scans in sort_data_models() with a path_to_index dict for O(1) lookups; use path_to_index.keys() for unsorted names; replace insert+remove with direct index assignment in _replace_model_in_list().
Imports lookup optimization
src/datamodel_code_generator/imports.py
In remove_unused(), replace generator-based next(...) scan with a precomputed reverse_lookup dict and O(1 .get() access to find reference_path.
DataType single-pass optional detection
src/datamodel_code_generator/types.py
Consolidate Any vs optional detection in DataType.__init__ using two boolean flags and a deferred post-loop adjustment to is_optional and data_types, avoiding in-loop mutation and early breaks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

breaking-change-analyzed

Poem

🐇
I hopped through paths and names today,
Cached my carrots so lookups play.
No slow next(...) nibbling by,
Dicts and parent hops let logic fly.
Faster thumps — a tidy, spry display!

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Optimize performance for large schema processing' directly aligns with the main objective of improving performance by optimizing operations in the datamodel code generator for large schemas.
Linked Issues check ✅ Passed The PR addresses issue #2286 by implementing multiple performance optimizations including caching reference names, replacing linear lookups with O(1) dictionary operations, and optimizing dependency resolution, which directly tackle the reported slowness in _get_unique_name.
Out of Scope Changes check ✅ Passed All changes are performance-related optimizations focused on four modules (imports.py, base.py, reference.py, types.py) that directly support the goal of accelerating large schema processing without altering observable behavior.
Docstring Coverage ✅ Passed Docstring coverage is 84.21% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/datamodel_code_generator/types.py (1)

386-396: Cache invalidation bug in swap_with() method.

The unresolved_types property depends on both self.reference and self.data_types. While replace_reference() invalidates the cache when reference changes (line 394), the swap_with() method (line 415) modifies parent.data_types without invalidating parent._unresolved_types_cache. This causes stale cache data if unresolved_types was accessed before swap_with() is called on a child.

Add parent._unresolved_types_cache = None after line 415 to ensure cache coherency when parent's data_types is modified.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1966581 and 3fbeaf9.

📒 Files selected for processing (4)
  • src/datamodel_code_generator/imports.py
  • src/datamodel_code_generator/parser/base.py
  • src/datamodel_code_generator/reference.py
  • src/datamodel_code_generator/types.py
🧰 Additional context used
🧬 Code graph analysis (3)
src/datamodel_code_generator/types.py (2)
src/datamodel_code_generator/reference.py (1)
  • reference (76-78)
src/datamodel_code_generator/model/base.py (2)
  • path (811-813)
  • all_data_types (778-782)
src/datamodel_code_generator/reference.py (3)
src/datamodel_code_generator/model/base.py (2)
  • name (730-732)
  • path (811-813)
src/datamodel_code_generator/__main__.py (1)
  • get (117-119)
src/datamodel_code_generator/__init__.py (1)
  • NamingStrategy (286-298)
src/datamodel_code_generator/parser/base.py (3)
src/datamodel_code_generator/model/base.py (3)
  • reference_classes (722-727)
  • path (811-813)
  • all_data_types (778-782)
src/datamodel_code_generator/model/pydantic_v2/root_model.py (1)
  • RootModel (14-45)
src/datamodel_code_generator/types.py (1)
  • all_data_types (433-439)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: 3.12 on Windows
  • GitHub Check: 3.10 on Windows
  • GitHub Check: py312-isort5 on Ubuntu
  • GitHub Check: 3.10 on macOS
  • GitHub Check: 3.11 on Ubuntu
  • GitHub Check: py312-black23 on Ubuntu
  • GitHub Check: 3.14 on Windows
  • GitHub Check: 3.11 on Windows
  • GitHub Check: 3.13 on Windows
  • GitHub Check: benchmarks
  • GitHub Check: Analyze (python)
🔇 Additional comments (13)
src/datamodel_code_generator/imports.py (1)

188-194: LGTM! Good performance optimization.

The reverse lookup dictionary correctly replaces the O(n) linear scan per import with a single O(n) dict construction followed by O(1) lookups. This is a meaningful improvement for remove_unused when processing many imports.

src/datamodel_code_generator/parser/base.py (4)

400-400: LGTM! Correct set operation optimization.

Using sorted_data_models.keys() directly in the set difference is efficient since dict views support set operations in Python 3.


422-436: LGTM! Significant performance improvement for dependency resolution.

Building path_to_index once and using O(1) dictionary lookups instead of repeated O(n) list.index() calls reduces the complexity from O(n²) to O(n) in this loop. The walrus operator usage is correct and idiomatic.


454-456: LGTM! Consistent use of path_to_index.

Using path_to_index.keys() to construct unsorted_data_model_names is consistent with the earlier optimization and avoids maintaining a separate data structure.


1086-1088: LGTM! Cleaner in-place replacement.

Direct index assignment avoids the element shifting overhead of insert+remove. The comment's complexity analysis is approximate (both are O(n)), but the new approach is more efficient in practice by avoiding list element shifts.

src/datamodel_code_generator/types.py (3)

341-346: LGTM! Cache field setup is correct.

Adding _unresolved_types_cache to _exclude_fields ensures it won't be included in dict/model_dump operations, and the field declaration with None default is appropriate for lazy initialization.


375-384: LGTM! Correct lazy caching implementation.

The cache-check-compute-store pattern is correctly implemented. Using frozenset ensures the cached value is immutable, preventing accidental modifications.


529-543: LGTM! Correct single-pass optimization.

The logic correctly identifies when an optional Any type coexists with non-Any types, and appropriately:

  1. Promotes is_optional to the parent type
  2. Filters out the redundant Any entries

The early exit when both conditions are met avoids unnecessary iterations.

src/datamodel_code_generator/reference.py (5)

573-584: LGTM! Clean cache implementation.

The cache getter lazily populates from self.references.values(), and the invalidation method simply sets it to None. This is a straightforward and effective caching pattern.


807-818: LGTM! Clean helper extraction.

The _find_parent_reference method correctly traverses up the path hierarchy and consolidates parent lookup logic that was previously duplicated across _check_parent_scope_option and _apply_full_path_naming.


804-804: LGTM! Cache invalidation appears complete.

All methods that modify self.references or reference names properly call _invalidate_reference_names_cache(). The invalidation points cover:

  • add_ref: new reference creation
  • _rename_external_ref_with_same_name: reference renaming
  • add: reference creation/update
  • delete: reference removal

Also applies to: 879-879, 945-945, 955-955, 967-967


820-825: LGTM! Clean refactor using the new helper.

Both _check_parent_scope_option and _apply_full_path_naming now use _find_parent_reference, reducing code duplication while maintaining the same behavior.

Also applies to: 827-841


1013-1032: LGTM! This is the core optimization addressing issue #2286.

Using _get_reference_names() avoids the O(n) set creation on every _get_unique_name call. For large schemas with many references, this significantly reduces the overhead of duplicate-name checks by caching the set of existing reference names.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Dec 23, 2025

CodSpeed Performance Report

Merging #2774 will not alter performance

Comparing perf/optimize-large-schema-processing (a688ee9) with main (1966581)

Summary

✅ 73 untouched
⏩ 10 skipped1

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.48%. Comparing base (1966581) to head (a688ee9).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2774   +/-   ##
=======================================
  Coverage   99.47%   99.48%           
=======================================
  Files          87       87           
  Lines       12882    12897   +15     
  Branches     1546     1543    -3     
=======================================
+ Hits        12815    12831   +16     
  Misses         35       35           
+ Partials       32       31    -1     
Flag Coverage Δ
unittests 99.48% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/datamodel_code_generator/types.py (1)

521-536: LGTM—Edge-case optimization is sound.

The two-flag detection logic correctly identifies when Any with Optional can be removed (when non-Any types exist), and the early exit on line 532 avoids unnecessary iteration once both conditions are found.

One minor note: the comment on line 521 says "Single-pass optimization," but the implementation actually uses multiple passes—one to detect (525–532), one to filter (536), and another to set parents (538–541). This doesn't affect correctness, just comment accuracy.

Optional: Clarify the comment
-        # Single-pass optimization: detect ANY+optional and non-ANY types together
+        # Edge-case optimization: detect ANY+optional and non-ANY types together
         # This is a rare edge case optimization - pragma: no cover
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4fa082 and 4866f76.

📒 Files selected for processing (1)
  • src/datamodel_code_generator/types.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: benchmarks
  • GitHub Check: py312-isort5 on Ubuntu
  • GitHub Check: py312-isort6 on Ubuntu
  • GitHub Check: Analyze (python)
  • GitHub Check: 3.14 on macOS
  • GitHub Check: 3.13 on Windows
  • GitHub Check: 3.12 on macOS
  • GitHub Check: 3.11 on Windows
  • GitHub Check: 3.10 on Windows
  • GitHub Check: 3.11 on macOS
  • GitHub Check: 3.14 on Windows
  • GitHub Check: 3.12 on Windows

@koxudaxi koxudaxi force-pushed the perf/optimize-large-schema-processing branch from 4866f76 to 6b0072f Compare December 24, 2025 00:09
The optimization block for optional Any + non-Any types is a rare edge case
that is difficult to trigger via e2e tests. Mark it with pragma: no cover.
@koxudaxi koxudaxi force-pushed the perf/optimize-large-schema-processing branch 2 times, most recently from ff54605 to fe31a45 Compare December 24, 2025 00:20
@koxudaxi koxudaxi force-pushed the perf/optimize-large-schema-processing branch from fe31a45 to a688ee9 Compare December 24, 2025 00:35
@koxudaxi koxudaxi merged commit 3ffee15 into main Dec 24, 2025
37 checks passed
@koxudaxi koxudaxi deleted the perf/optimize-large-schema-processing branch December 24, 2025 01:19
@github-actions
Copy link
Copy Markdown
Contributor

Breaking Change Analysis

Result: No breaking changes detected

Reasoning: This PR contains only internal performance optimizations with no breaking changes:

  1. imports.py: Changed O(n) linear scan to O(1) dictionary lookup for reference paths - same behavior, just faster.

  2. parser/base.py: Replaced set(sorted_data_models) with sorted_data_models.keys() (functionally equivalent), replaced list.index() with dictionary lookups, and changed insert+remove to direct index assignment in _replace_model_in_list - all produce identical results.

  3. reference.py: Added private caching mechanism (_reference_names_cache) for reference names to avoid O(n) set creation on every _get_unique_name call. Extracted _find_parent_reference helper method. All changes are internal implementation details with proper cache invalidation on reference mutations.

  4. types.py: Optimized DataType.init to use single-pass detection with early exit instead of nested loop. The logic for handling optional Any types with non-Any types remains functionally identical.

All changes are purely algorithmic optimizations (O(n) → O(1) lookups, reduced iterations) that preserve the exact same external behavior. No public APIs, CLI options, generated code output, template interfaces, or error handling were modified.


This analysis was performed by Claude Code Action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_get_unique_name take really long with large swagger definition file

1 participant