Adding training functionalities to Toolkit#108
Merged
Conversation
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
…t-loading Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Rename EnergyLoss -> EnergyMSELoss, ForceLoss -> ForceMSELoss, StressLoss -> StressMSELoss for naming consistency with EnergyMAELoss and ForceL2NormLoss. Replace ignore_nan with ignore_nonfinite in all three MSE losses, switching masking from isnan() to torch.isfinite() to also exclude inf targets, matching the convention in the MAE/L2 terms. Add missing EnergyMAELoss and ForceL2NormLoss to API docs.
… weighting EnergyMAELoss(per_atom=True) now uses atom-count-weighted reduction, matching EnergyMSELoss semantics: larger graphs contribute in proportion to their atom count. Previously it used a simple mean over graphs. Also fixes SyntaxWarning from unescaped LaTeX in the class docstring.
…tion BaseLossFunction.forward() is now a concrete template orchestrating five overridable hooks: validate, normalize, mask, compute_residual, reduce. Subclasses override only what they need — at minimum compute_residual(). Add ReductionContext (dict subclass) for passing reduction metadata such as atom-count weights between hooks. Dynamo-safe (no TypedDict). All five leaf losses (EnergyMSELoss, EnergyMAELoss, ForceMSELoss, ForceL2NormLoss, StressMSELoss) refactored to use the template hooks. No behavioral changes — all 186 existing tests pass unchanged.
Add three new sections to the losses user guide: - Example 3: custom mask override (isfinite, padded layouts) - Example 4: custom reduce override (graph-balanced reduction) - Layout dispatch with plum (reference to ForceMSELoss/ForceL2NormLoss)
Covers built-in loss terms, the BaseLossFunction template-method pattern, and how to implement custom losses with normalize, mask, reduce overrides and plum dispatch for multi-layout forces.
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Add training strategy checkpoint restarts
New test files from training-epic (conftest, test_strategy, test_checkpoint, test_mixed_precision, test_training_update_orchestrator, test_losses spec tests) referenced old names EnergyLoss/ForceLoss/ignore_nan. Updated to EnergyMSELoss/ForceMSELoss/ignore_nonfinite.
…support Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com> # Conflicts: # README.md # docs/userguide/about/install.md # pyproject.toml # uv.lock
Add MAE energy and L2-norm force loss terms
Restructure the CLI from a single Click command to a Click group with two subcommands: - roundtrip: existing generate+write+read benchmark (no behavior change) - read: benchmark read performance against a pre-existing Zarr store The read subcommand accepts a store path and the same read-tuning options (--read-mode, --read-order, --read-batch-size, etc.) and reports samples/s throughput via a Rich table. Add _run_read_benchmark and _print_read_results helpers, plus three tests for the new functionality.
Comment on lines
+260
to
+269
| Use :class:`~nvalchemi.dynamics.hooks.StageTimingHook` for lightweight stage | ||
| timing and optional NVTX ranges. | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from nvalchemi.dynamics.hooks import ProfilerHook | ||
| from nvalchemi.dynamics.hooks import StageTimingHook | ||
|
|
||
| hook = ProfilerHook(enable_nvtx=True, enable_timer=True, frequency=10) | ||
| hook = StageTimingHook("step", frequency=10, log_path="stage_timing.csv") | ||
| dynamics = DemoDynamics(model=model, n_steps=1_000, dt=0.5, hooks=[hook]) | ||
| dynamics.run(batch) |
Collaborator
There was a problem hiding this comment.
This example doesn't really tell me what the heck this hook is doing, what "step" refers to, what "frequency" means. We don't need full api doc here but I would expect just another sentence with sufficient exposition explaining what is going on here.
…eloper focus Restructure the fine-tuning guide to follow a progressive disclosure model: - Rewrite the intro to frame the CLI vs. API split clearly for both audiences - Add a Fine-tuning API parent section with a simple-first ordering: simple full-model → modifications overview → inspect names → freeze → freeze mode → module patches → multi-model → checkpoints → hooks - Demote all API subsections to ### so the heading hierarchy reflects depth - Rewrite each section opening to flow naturally from the previous one, replacing terse reference-style prose with motivated narrative - Add developer extension points throughout: programmatic pattern generation, progressive unfreezing via from_pretrained_checkpoint, freeze_mode as a gradient-hook seam, create_model_spec with custom nn.Module subclasses, per-model optimizer_configs for differential learning rates - Add a multi-model fine-tuning section covering the dict-key naming requirement and the partial-coverage validation gap - Collapse operational notes into targeted inline callouts; defer hook mechanics entirely to the hooks guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er focus Restructure the loss computation guide to match the style of the fine-tuning documentation: motivating context before API, progressive disclosure from simple to complex, and explicit developer extension points. Key changes: - Rewrite intro to motivate the two-layer design (leaf + composition) as the natural answer to multi-task MLIP training objectives - Add how-to-choose framing before the built-in losses table (MSE vs Huber vs MAE/L2-norm trade-offs) - Reframe LossWeightSchedule as a named extension seam with a minimum viable implementation (per_epoch + __call__), then add to_spec() as the serialization requirement - Add bridge sentence from schedule section to writing-your-own-loss - Restructure "Writing your own loss" section: lead with a decision tree (compute_residual → normalize → mask → reduce), then show the minimum viable override before adding each additional hook - Flag plum dispatch section as advanced, explicitly skippable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion-seam focus Restructure the training guide to match the style of the fine-tuning documentation: lead with conclusions, explicit extension seams, and section bridges that guide the reader through the lifecycle. Key changes: - Rewrite intro to be concrete: TrainingStrategy as a workflow engine with named lifecycle stages, not an abstract "flexibility" statement - Add explicit framing of training_fn and loss_target_assembler as the two primary forward-pass extension seams before their detailed explanation - Name TrainingUpdateHook as the named extension seam for gradient and optimizer customization at the start of the Optimizer Orchestration section, removing the redundant second introduction - Add section bridges: setup → counters, optimizer orchestration → validation - Add tip block for ValidationConfig per-batch callback as the developer extension point for custom evaluation metrics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Further restructure the losses guide to consistently address developers and ML engineers building on top of the API: - Restructure "Writing your own loss" from numbered examples into one subsection per hook (compute_residual, normalize, mask, reduce, validate), each opening with what the base provides and when you'd override it — woven as prose rather than mechanical bold labels - Consolidate "Composition weights and schedules" as nested subsections (### Weights, ### Weight schedules) under ## Composition, eliminating the duplicate operator-sugar intro and merging weight normalization and operator constraints into one coherent Weights subsection - Motivate "The call signature" and "The return type" with why the design choices matter (keyed-mapping routing, per-component fields for debugging schedule behavior) before presenting the API - Motivate "Per-sample loss diagnostics" with the use case (hard-sample identification, curriculum strategies) before the table - Add an opening to "Routing errors" explaining why eager validation matters for debugging training_fn / loss mismatches - Add "Bring your own schedule" as flowing prose (not bold-label format) with the minimum protocol implementation shown inline - Add extension-pointer closing sentences to "Ignoring missing labels" and "MAE and force-L2 reductions" pointing to the relevant hook override sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com> # Conflicts: # docs/userguide/models.md # uv.lock
Collaborator
Author
|
/ok to test e3a04cb |
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Collaborator
Author
|
/ok to test e3a04cb |
@laserkelvin, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
Collaborator
Author
|
/ok to test 4ee2131 |
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Collaborator
Author
|
/ok to test 18399f3 |
dallasfoster
approved these changes
Jun 25, 2026
physicsnemo 2.1.0 conflicts with fairchem Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Collaborator
Author
|
/ok to test 85a93f5 |
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Collaborator
Author
|
/ok to test 2b25deb |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ALCHEMI Toolkit Pull Request
Description
This PR introduces the core functionalities required to support training and fine-tuning of models in
nvalchemi-toolkit.Type of Change
Related Issues
Changes Made
create_model_specmethods and dynamic pydantic model creation forpickle-less serialization of configurationTrainingStrategypydantic model as a recipe validation and loop executor. The execution is highly modular and extendible, allowing for (hopefully) arbitrarily complex training workflows to be built, and not limited to MLIPsFineTuningStrategythat specializesTrainingStrategyfor...fine-tuning workflows by making pre-existing checkpoints and layer addition/modification integral to the workflowTesting
make pytest)make lint)Checklist
Additional Notes
Tip
This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.