Optimize performance for large schema processing#2774
Conversation
|
Warning Rate limit exceeded@koxudaxi has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 17 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughReplace repeated linear scans with dictionary-based lookups and a small cache across imports, parser, reference resolution, and data-type initialization to reduce O(n) work; observable behavior and public APIs are unchanged. (28 words) Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/datamodel_code_generator/types.py (1)
386-396: Cache invalidation bug inswap_with()method.The
unresolved_typesproperty depends on bothself.referenceandself.data_types. Whilereplace_reference()invalidates the cache when reference changes (line 394), theswap_with()method (line 415) modifiesparent.data_typeswithout invalidatingparent._unresolved_types_cache. This causes stale cache data ifunresolved_typeswas accessed beforeswap_with()is called on a child.Add
parent._unresolved_types_cache = Noneafter line 415 to ensure cache coherency when parent's data_types is modified.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/datamodel_code_generator/imports.pysrc/datamodel_code_generator/parser/base.pysrc/datamodel_code_generator/reference.pysrc/datamodel_code_generator/types.py
🧰 Additional context used
🧬 Code graph analysis (3)
src/datamodel_code_generator/types.py (2)
src/datamodel_code_generator/reference.py (1)
reference(76-78)src/datamodel_code_generator/model/base.py (2)
path(811-813)all_data_types(778-782)
src/datamodel_code_generator/reference.py (3)
src/datamodel_code_generator/model/base.py (2)
name(730-732)path(811-813)src/datamodel_code_generator/__main__.py (1)
get(117-119)src/datamodel_code_generator/__init__.py (1)
NamingStrategy(286-298)
src/datamodel_code_generator/parser/base.py (3)
src/datamodel_code_generator/model/base.py (3)
reference_classes(722-727)path(811-813)all_data_types(778-782)src/datamodel_code_generator/model/pydantic_v2/root_model.py (1)
RootModel(14-45)src/datamodel_code_generator/types.py (1)
all_data_types(433-439)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: 3.12 on Windows
- GitHub Check: 3.10 on Windows
- GitHub Check: py312-isort5 on Ubuntu
- GitHub Check: 3.10 on macOS
- GitHub Check: 3.11 on Ubuntu
- GitHub Check: py312-black23 on Ubuntu
- GitHub Check: 3.14 on Windows
- GitHub Check: 3.11 on Windows
- GitHub Check: 3.13 on Windows
- GitHub Check: benchmarks
- GitHub Check: Analyze (python)
🔇 Additional comments (13)
src/datamodel_code_generator/imports.py (1)
188-194: LGTM! Good performance optimization.The reverse lookup dictionary correctly replaces the O(n) linear scan per import with a single O(n) dict construction followed by O(1) lookups. This is a meaningful improvement for
remove_unusedwhen processing many imports.src/datamodel_code_generator/parser/base.py (4)
400-400: LGTM! Correct set operation optimization.Using
sorted_data_models.keys()directly in the set difference is efficient since dict views support set operations in Python 3.
422-436: LGTM! Significant performance improvement for dependency resolution.Building
path_to_indexonce and using O(1) dictionary lookups instead of repeated O(n)list.index()calls reduces the complexity from O(n²) to O(n) in this loop. The walrus operator usage is correct and idiomatic.
454-456: LGTM! Consistent use of path_to_index.Using
path_to_index.keys()to constructunsorted_data_model_namesis consistent with the earlier optimization and avoids maintaining a separate data structure.
1086-1088: LGTM! Cleaner in-place replacement.Direct index assignment avoids the element shifting overhead of insert+remove. The comment's complexity analysis is approximate (both are O(n)), but the new approach is more efficient in practice by avoiding list element shifts.
src/datamodel_code_generator/types.py (3)
341-346: LGTM! Cache field setup is correct.Adding
_unresolved_types_cacheto_exclude_fieldsensures it won't be included in dict/model_dump operations, and the field declaration withNonedefault is appropriate for lazy initialization.
375-384: LGTM! Correct lazy caching implementation.The cache-check-compute-store pattern is correctly implemented. Using
frozensetensures the cached value is immutable, preventing accidental modifications.
529-543: LGTM! Correct single-pass optimization.The logic correctly identifies when an optional
Anytype coexists with non-Anytypes, and appropriately:
- Promotes
is_optionalto the parent type- Filters out the redundant
AnyentriesThe early exit when both conditions are met avoids unnecessary iterations.
src/datamodel_code_generator/reference.py (5)
573-584: LGTM! Clean cache implementation.The cache getter lazily populates from
self.references.values(), and the invalidation method simply sets it toNone. This is a straightforward and effective caching pattern.
807-818: LGTM! Clean helper extraction.The
_find_parent_referencemethod correctly traverses up the path hierarchy and consolidates parent lookup logic that was previously duplicated across_check_parent_scope_optionand_apply_full_path_naming.
804-804: LGTM! Cache invalidation appears complete.All methods that modify
self.referencesor reference names properly call_invalidate_reference_names_cache(). The invalidation points cover:
add_ref: new reference creation_rename_external_ref_with_same_name: reference renamingadd: reference creation/updatedelete: reference removalAlso applies to: 879-879, 945-945, 955-955, 967-967
820-825: LGTM! Clean refactor using the new helper.Both
_check_parent_scope_optionand_apply_full_path_namingnow use_find_parent_reference, reducing code duplication while maintaining the same behavior.Also applies to: 827-841
1013-1032: LGTM! This is the core optimization addressing issue #2286.Using
_get_reference_names()avoids the O(n) set creation on every_get_unique_namecall. For large schemas with many references, this significantly reduces the overhead of duplicate-name checks by caching the set of existing reference names.
CodSpeed Performance ReportMerging #2774 will not alter performanceComparing Summary
Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2774 +/- ##
=======================================
Coverage 99.47% 99.48%
=======================================
Files 87 87
Lines 12882 12897 +15
Branches 1546 1543 -3
=======================================
+ Hits 12815 12831 +16
Misses 35 35
+ Partials 32 31 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/datamodel_code_generator/types.py (1)
521-536: LGTM—Edge-case optimization is sound.The two-flag detection logic correctly identifies when
AnywithOptionalcan be removed (when non-Anytypes exist), and the early exit on line 532 avoids unnecessary iteration once both conditions are found.One minor note: the comment on line 521 says "Single-pass optimization," but the implementation actually uses multiple passes—one to detect (525–532), one to filter (536), and another to set parents (538–541). This doesn't affect correctness, just comment accuracy.
Optional: Clarify the comment
- # Single-pass optimization: detect ANY+optional and non-ANY types together + # Edge-case optimization: detect ANY+optional and non-ANY types together # This is a rare edge case optimization - pragma: no cover
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/datamodel_code_generator/types.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: benchmarks
- GitHub Check: py312-isort5 on Ubuntu
- GitHub Check: py312-isort6 on Ubuntu
- GitHub Check: Analyze (python)
- GitHub Check: 3.14 on macOS
- GitHub Check: 3.13 on Windows
- GitHub Check: 3.12 on macOS
- GitHub Check: 3.11 on Windows
- GitHub Check: 3.10 on Windows
- GitHub Check: 3.11 on macOS
- GitHub Check: 3.14 on Windows
- GitHub Check: 3.12 on Windows
4866f76 to
6b0072f
Compare
The optimization block for optional Any + non-Any types is a rare edge case that is difficult to trigger via e2e tests. Mark it with pragma: no cover.
ff54605 to
fe31a45
Compare
fe31a45 to
a688ee9
Compare
Breaking Change AnalysisResult: No breaking changes detected Reasoning: This PR contains only internal performance optimizations with no breaking changes:
All changes are purely algorithmic optimizations (O(n) → O(1) lookups, reduced iterations) that preserve the exact same external behavior. No public APIs, CLI options, generated code output, template interfaces, or error handling were modified. This analysis was performed by Claude Code Action |
Fixes: #2286
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.