Selective tydb reads for imported modules#2916
Closed
plajjan wants to merge 19 commits into
Closed
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f377968ad
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
3f37796 to
b729c65
Compare
added 19 commits
June 11, 2026 20:10
Every .tydb access opened its own LMDB environment, ran one transaction, and closed it again, with a per-path mutex keeping LMDB's one-environment-per-process rule satisfied. That makes each read pay an environment open and an mmap, and it leaves no room for a reader that outlives a single call. Cache one read-only environment per path in a process-wide registry instead. Readers claim the cached environment, run their short read transaction on it, and release it; the registry counts active readers. Exclusive users - writers and environment copies - keep the per-path lock and retire the cached environment before opening their own, waiting for active readers to drain, so the process still never holds two environments for one path. In-process readers therefore block during a write exactly as before, while readers in other processes are protected by LMDB's reader table, which every environment participates in. A transient lock-table race, or a map grown by another process, retires the cached environment and retries on a fresh one. The binding adds MDB_NOTLS implicitly, so reader slots follow transactions rather than the transient bound threads they run on.
A module that imports a large dependency decodes the dependency's whole .tydb interface even when it uses a handful of names. The on-disk layout already keys every NameInfo entry individually, but there is no read API for exact lookups, and no stored answers for the solver's module-wide questions: which constructors carry an attribute, which types sit below a constructor, which extensions implement a protocol or extend a type. Store narrow per-module query indexes at write time, derived from the public interface. "public-names", "constructors", and "actors" record name lists in TEnv order, while "con-attr/<name>", "proto-attr/<name>", "descendants/<qname>", "ext-proto/<qname>", and "ext-type/<qname>" map one query key to the public names that answer it. QName keys hash a source-location-free encoding so the same semantic name always maps to the same key. The descendants index covers classes and protocols; readers filter by NameInfo kind. On the read side, add InterfaceDB: a handle whose path is validated once and whose lookups each run a short read transaction on the shared per-path environment. readInterfaceDBNameInfoMaybe decodes exactly one name-info row, and the readInterfaceDB* index readers resolve one index row plus the name-info entries it points at. Nothing uses the new API yet; the .tydb version is bumped for the new keys.
Imported modules live in the env as NModule TEnv trees plus a hashed mirror, so every per-module view is derived from a fully decoded interface. To read interfaces selectively, the env needs a module representation that answers questions through functions instead of exposing a materialized TEnv. Introduce ModuleInfo as that per-module lookup view: recorded imports and docstring, an exact name-lookup function, public names, constructors, actors, attribute owners, descendants, and extension witnesses keyed by protocol and by type. mkModuleInfo backs the view with an in-memory interface TEnv, building lazy maps for the index queries; mkTyFileModuleInfo backs it with keyed InterfaceDB reads, hidden behind pure-looking functions and memoized per handle through memoLookup, so a demanded name or index row is read at most once. Store one ModuleInfo per module in a new map in EnvF, populated alongside the existing modules/hmodules TEnv storage, which stays the active lookup path for now. addMod fills both representations, mapModules rebuilds the ModuleInfo of every converted module, and findModuleInfo/lookupModuleInfo mirror the alias-aware and direct module lookups. While compiling __builtin__ itself the module is its own environment, so its view resolves names through the active hnames index instead of a TEnv copy.
Qualified-name resolution walks the NModule tree through findHMod and lookupHMod, which presumes a fully materialized per-module TEnv for every loaded module. Switch the point lookups over to the ModuleInfo map: tryQName and QName unaliasing resolve one (module, name) pair through moduleLookupHName, and transitiveImports follows each module's recorded import list instead of decoding NModule nodes. Both representations are built from the same interface data, so behavior is unchanged; the old tree lookups remain for their remaining callers until those are converted. One thing the tree provided implicitly was parent packages: importing a.b created an "a" node, so module-path detection could treat "a" as a module prefix. The flat map needs that spelled out, so isMod and ModName unaliasing consult modulePrefix, which checks whether any loaded module name extends the queried path.
A from-import resolves its names by searching the imported module's public TEnv, which forces the whole interface even for a single imported name. Route imports through ModuleInfo instead. importSome looks up exactly the requested names and aliases them; importAll enumerates the recorded public-name list and performs one exact lookup per importable entry, since import * is the one importer that genuinely asks for everything. doImp returns the imported module's ModuleInfo, and a module already loaded in the env restores its transitive import closure from the recorded import list without touching the .tydb. Completion's shallow import fallback follows the same path. Because the lookup functions only expose public names, a selected import of a private name now fails with NoItem at the lookup itself, matching how qualified access to private names already fails.
allDescendants answers "which types sit below this constructor" by scanning every constructor of every transitive import and testing its ancestry, forcing all imported interfaces for one solver query. Read each imported module's descendants index for the queried constructor instead: the writer recorded, for every public class, the class names sitting below each of its ancestors, so the reader resolves one index row per module plus the name-info entries it names. Local definitions keep the ancestry scan over localCons, since they are not covered by any module index. importedModuleInfos centralizes the per-transitive-import ModuleInfo enumeration for this and the following query conversions.
allConAttr and allPConAttr collect the constructors that expose an attribute by testing every imported constructor with hasAttr, an ancestry-aware scan that forces all imported interfaces whenever the solver ranks a selection constraint. Read the per-module attribute indexes instead. The indexes record attributes where they are declared, so declared owners alone would miss imported types that only inherit the attribute. Complete the result with the descendants of every declaring constructor — having the attribute means having a declaring ancestor — using class descendants for allConAttr and protocol descendants for allPConAttr. This preserves the ancestry-aware semantics of the old scan while reading only rows keyed by the queried attribute name. Local definitions keep the hasAttr scan.
allActors filters every imported constructor for actors when the solver searches Identity-protocol candidates, which forces all imported interfaces for a query whose answer is usually tiny. Concatenate each imported module's compact actor record instead, so the cost follows the actor count rather than the module size. allTypes stays a deliberately broad enumeration — it remains for callers that really ask for every imported type — but it reads constructor records through ModuleInfo so it no longer depends on the materialized module TEnvs.
Every type check starts by folding every transitive import's full interface into the type-check env: setupWits turns each imported extension into a closed witness and setupCons registers each imported constructor in the tyids/tyinfos lattice. This preload is the largest forced materialization of imported interfaces and scales with dependency size rather than with what the module being checked actually uses. Drop the preload and merge imported witnesses per query instead. witsByPName and witsByTName concatenate the local witness tables with each imported module's extension-index reads, keyed by the queried protocol or type name. queryQNames tries both the queried and the unaliased name form, since interfaces record references in either, and uniqueWits collapses duplicates the way the insert-time check used to. Solver behavior is otherwise unchanged because findWitness, allBelowProto, and findProtoByAttr already funnel through these two functions. The tyids/tyinfos lattice now covers local definitions only, so constructor and type-variable registration tolerates ancestors that are not registered: addconinfo keeps only known ancestor ids and addvarinfo falls back to an empty entry for an imported bound. The lattice's attribute and subtype sets feed nothing but the debug printer today, so omitting imported ancestors changes no behavior.
A few consumers outside the type checker still reach into the materialized module TEnvs: code generation's provider-object scan, doc output, root-actor eligibility checks, and completion's qualified-name lookup. Convert them to ModuleInfo. classQBinds and completion resolve one exact name; doc output and root-actor checks reconstruct a public TEnv through modulePublicTEnv only for the single module they describe. allClasses enumerates public names through each module's lookup function rather than the raw constructor records, which matters once later passes rewrite imported entries on demand: the lookup function is where converted entries will appear.
Each back pass rewrites every loaded module's interface TEnv through mapModules: Normalizer, Deactorizer, CPS, LambdaLifter, and the protocol converter all walk and rebuild the full NModule tree once per compiled module, no matter how few imported names the module touches. Replace mapModules with convertModules, which wraps each ModuleInfo's lookup function with the pass's conversion: a demanded name is fetched from the underlying view, converted, and memoized, so conversion work follows actual lookups. Passes that fabricate new names from existing entries — the deactorizer's newact definitions, the converter's derived protocol classes — provide a source-name function mapping a generated name back to the entry it derives from, so a lookup of the generated name converts its source entry and picks the result out of the produced bindings. The modules tree is no longer rewritten by the passes, so forceTypeResult stops forcing the hmodules mirror.
Importing a module that is not yet in the env decodes its whole .tydb through readFile, and reusing a fresh cached module in the build graph does the same through readIfaceFromTy. Either way a dependent module pays for the full size of its dependencies before type checking starts. Make both paths install lazy shells instead. doImp opens the imported module's .tydb with openInterfaceDB, reads only the import list and docstring, and registers a ModuleInfo backed by keyed reads; the recursive import-closure walk follows the recorded imports as before. readIfaceFromTy reads the stored module hashes plus the same metadata, and FrontResult carries the shell so the scheduler's accumulated environment registers it directly. Freshly compiled modules keep registering eager TEnv-backed views built from their front results. A missing, corrupt, or version-mismatched .tydb is treated as a cache miss when opening the handle (openInterfaceDBMaybe), matching the old full-read behavior of falling back to recompilation.
The front pass's hashing phase resolves recorded hashes for every external name a module depends on. It did so by loading a complete per-module name-hash map for each referenced dependency — a whole name-hash section decode per dependency, proportional to dependency size rather than to the handful of names actually referenced. Resolve each external dependency with one targeted read instead. resolveDepHashes becomes IO and asks the per-name resolver for exactly the (module, name) pairs in the dependency sets. In-graph providers still answer from the scheduler's in-memory name map; only out-of-graph modules such as __builtin__ and search-path dependencies fall through to keyed .tydb reads. The per-module map loading and its error plumbing disappear.
readHeader decodes every per-name hash row, and it backs all the cheap-looking paths: cached-module discovery, root-stub expectation, test listing, import discovery, and DBP preparation. A fresh cached dependency therefore still costs a whole name-hash section decode on every build. Add readHeaderSummary, which reads the same metadata plus the stored name count but no per-name rows, and switch those paths over. TyTask carries the name count — enough for the DBP threshold and doc-output decisions — instead of the hash list, so cached tasks hold no name hashes at all; a dependent that needs one of a cached provider's hashes falls back to a targeted read (providerIsCachedTy). DBP preparation reads only the roots row and expands its selection closure through exact name-hash and extension-index reads (dbpNameHashClosure), so selection cost follows the seed closure rather than the module size. The verbose hash-delta report falls back to a full map read only when the previous hashes are not already in memory.
Each targeted name-hash read opens the dependency's .tydb, runs one transaction, and closes the environment again. During hashing of a large module the same few builtin hashes are resolved once per local name, which multiplies into tens of thousands of redundant reads. Memoize per-name results in a process-wide map, kept separate from the existing per-module map cache because that cache's entries must stay complete. Cached modules now report empty hash lists, so updateNameHashCache also skips empty results to keep them from shadowing targeted reads behind an apparently-complete empty module map.
Every reader has moved over to ModuleInfo, so the duplicated module storage can go. EnvF drops the NModule TEnv tree and its hashed mirror, and the ModuleInfo map becomes the modules field. addMod reduces to registering a TEnv-backed view, getImps no longer rebuilds a hash mirror after loading imports, and the tree walkers — lookupModule, lookupMod, findHMod, lookupHMod, lookupHModule — are deleted. Completion drops its manual mirror refresh, and the benches force the module map through its lookup functions instead of the old HTEnv, which exercises the same thunks the compiler demands during a real build.
Lock down the selective-read behavior with traced tests, using ACTON_TYDB_TRACE_READS to observe exactly which .tydb operations a build performs against a generated class-heavy dependency. The incremental cases assert that a small importer performs keyed lookups only — exact name-info reads, attribute-index reads, and keyed extension reads — and never decodes a whole section (all, public-names, constructors, name-hash-all). An unused import must not read any names at all, and adding an unrelated name to the dependency must keep the importer fresh while reading only the name it actually uses. The module_lookup compiler project covers imported classes, extension witnesses, and actors end to end through codegen.
Three costs crept in with per-query witness merging, all visible on large-module builds. Witness dedup unaliased both sides of every comparison, on a function the solver calls for every protocol constraint; comparing the recorded names directly matches the old insert-time check, so drop the unaliasing. While compiling __builtin__ itself the env has no imported modules, but importedModuleInfos rebuilt the builtin self-view on every query; return nothing in that case so builtin compilation stays on its local tables. And the dependency index deduplicated per-name user lists with quadratic nub, which dominated .tydb writes for large modules; collect users in sets instead.
Update the dev guide for the ModuleInfo world. imports_and_envs.md describes the module shells, the selective lookup path, the on-demand back-pass conversion, and how solver queries map onto the narrow per-module indexes, including the read-time completion of inherited attribute owners. interface_caches.md documents the new key layout — public-names/constructors/actors and the per-query index rows — along with the InterfaceDB handle, its read API, and the header-summary read used for cache checks. The debugging advice now points at broad enumeration as the thing to look for when a full read shows up during normal import lookup.
b729c65 to
5021b6f
Compare
Contributor
Author
|
@nordlander btw, this is going to conflict with the changes in the other branches. I think #2924 should take priority, so please review that one first, and we will have to rebase and resolve the conflicts in this branch. |
Contributor
Author
|
Closing in preference of #2930 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This makes the compiler read imported modules' .tydb interfaces selectively instead of decoding them in full. Until now, compiling a module forced the complete interface of every transitive import: doImp decoded the whole .tydb, initTypeEnv folded every imported constructor and extension witness into the type-check environment, the solver's candidate queries scanned every imported constructor, each back pass rewrote every loaded module's interface, and cache checks decoded whole per-name hash sections. A small module importing a large one therefore paid for the dependency's size on every rebuild.
The series introduces ModuleInfo as the only environment representation of dependency modules: a per-module view that answers questions through lookup functions rather than exposing a materialized TEnv. A locally compiled module backs it with its in-memory interface; an imported module backs it with a long-lived read-only LMDB handle (InterfaceDB) doing keyed reads that are memoized per handle. To support this, the .tydb format gains narrow per-module query indexes written next to the existing keyed name entries: public names, constructors, actors, attribute owners, descendants, and extensions keyed by protocol and by type. The version is bumped to [0,25].
The conversion is staged for review, so the commits are best read in order, oldest first. First the plumbing lands (the indexes and read API, then ModuleInfo alongside the old modules/hmodules storage), then consumers are rerouted one group per commit while the two representations are kept in sync and behavior is preserved: qualified-name lookup, imports, the solver-facing queries, the type-check environment, tools, and the back passes (which now convert imported entries on demand instead of rewriting whole module trees). Only then does the lazy flip happen: doImp and cached-module reuse install .tydb-backed shells, dependency hashes resolve per name, and cache checks read a header summary with a stored name count instead of every hash row. The final step deletes the NModule tree and its hashed mirror, leaving the ModuleInfo map as the only module storage.
A deliberate design constraint throughout: no change to inference behavior, and Solver.hs is untouched. The attribute indexes record attributes where they are declared, and readers complete inherited owners through the descendants of each declaring constructor, preserving the ancestry-aware semantics of the old hasAttr scans. Witness queries merge keyed extension-index reads over all transitive imports, so cross-module extensions keep working and query cost follows the number of imported modules, never their size.
The benchmark this targets is a project with a generated class-heavy module (test/compiler/concurrent_typecheck_class_heavy/generate_heavy.py) and a small module importing it. Rebuilding only the small module takes 0.09s whether the dependency has 3,000, 10,000 or 30,000 classes; on main the same rebuild takes 0.55s, 1.6s and 5.8s respectively. Full builds also got faster, mostly from the dependency-index writer deduplicating user lists with sets instead of quadratic nub: the 30,000-class project drops from 345s to 96s.
New traced regression tests pin the behavior down with ACTON_TYDB_TRACE_READS: a small importer must perform keyed lookups only (exact name reads, attribute-index reads, keyed extension reads) with no whole-section decodes, an unused import must not read names at all, and an unrelated added name in the dependency must keep the importer fresh. A module_lookup compiler test project covers imported classes, extension witnesses and actors end to end. The lib, incremental and compiler test suites all pass.
Known follow-up work, intentionally out of scope here: when DBP has to regenerate a large dependency's code (its selection changed or the module itself changed), prepareDeferredBackJob still does one full typed-module read before pruning; storing per-name statement indexes so only the selected statements are read is the natural next step. Test discovery in acton test still decodes whole name-hash sections, initEnv still reads builtin in full once per process, and InterfaceDB handles rely on a GC finalizer backstop, which is fine for CLI compiles but worth revisiting for the long-lived LSP process.
This PR is arranged as a, rather long, series of commit. Each make sense on its own, introducing one conceptual change. It is expressly written to be simple to review.