Model Training Example with MACE by ys-teh · Pull Request #109 · NVIDIA/nvalchemi-toolkit

ys-teh · 2026-06-09T17:01:07Z

ALCHEMI Toolkit Pull Request

Description

This PR adds an advanced training example for a charged MACE model and the supporting code modifications needed to train it with available ALCHEMI tools.

Note: This can only be merged after #108 is merged.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Performance improvement
Documentation update
Refactoring (no functional changes)
CI/CD or infrastructure change

Related Issues

Changes Made

Adds a MACE training example examples/advanced/10_mace_training.py (along with config examples/advanced/10_vanilla_mace.yaml) using nvalchemi model training pipeline.
Adds examples/advanced/_mace_training_helpers.py with additional training utilities including stress unit conversion, training loss logging, validation, parameter counting, and gradient clipping hook.
Adds examples/advanced/_mace_models.py with builders for vanilla MACE model, including cuEquivariance config support.
Adds MACE training user guide docs/userguide/mace_training_example.md.

Testing

Unit tests pass locally (make pytest)
Linting passes (make lint)
New tests added for new functionality meets coverage expectations?

Run training

Checklist

I have read and understand the Contributing Guidelines
I have updated the CHANGELOG.md
I have performed a self-review of my code
I have added docstrings to new functions/classes
I have updated the documentation (if applicable)

Additional Notes

Below are NVT stability results for the trained MACE model on a 324-atom MgVF4 3x3x3 MatPES-r2SCAN test structure. The 10 ps, 300 K Langevin run completed all 20,000 steps without numerical instability or force/temperature warnings.

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

copy-pr-bot · 2026-06-09T17:01:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

greptile-apps · 2026-06-26T01:54:32Z

Greptile Summary

This PR adds a complete end-to-end MACE model training example (10_mace_training.py) with its Hydra config, supporting helpers for loss construction, metrics logging, and gradient clipping, and a user-guide doc page.

_mace_models.py introduces build_vanilla_mace_model / build_training_mace_model with cuEquivariance support and a checkpoint reconstruction spec attached to the model.
_mace_training_helpers.py provides the two-stage cosine LR schedule, step-function Huber loss weighting, stress unit conversion, a JsonLinesLogger, and a GradientClipHook.
10_mace_training.py wires all components into a TrainingStrategy with distributed (DDP), EMA, validation, and checkpointing hooks, including graceful dataset cleanup in a finally block.

Important Files Changed

Filename	Overview
examples/advanced/10_mace_training.py	Main training entrypoint; well-structured with step-based validation, distributed support, EMA, gradient clipping, and graceful cleanup in a `finally` block.
examples/advanced/_mace_models.py	MACE model builders; contains a module-level `torch.serialization.add_safe_globals([slice])` call that mutates process-wide state on import.
examples/advanced/_mace_training_helpers.py	Training utilities including loss building, metrics logging, LR scheduling, and gradient clipping; `build_mace_step_huber_loss` can crash with a confusing `TypeError` when `stage_two.start_step` is absent, and `ToDType` accesses private `Batch` internals.
examples/advanced/10_vanilla_mace.yaml	Default Hydra config with well-documented hyperparameters; dataset-derived metadata (E0s, avg_num_neighbors, scale/shift) is precomputed and embedded.
docs/userguide/mace_training_example.md	Clear walkthrough of the full training lifecycle; code snippets align with the runnable example and config.
CHANGELOG.md	Single changelog entry added for the MACE training example under the correct `Added` section.

Comments Outside Diff (3)

examples/advanced/_mace_models.py, line 913 (link)

Module-level global side effect on torch.serialization

torch.serialization.add_safe_globals([slice]) runs every time this module is imported, mutating process-wide serialization state before any user code runs. In a multi-process or multi-import context this is invisible and can be surprising. The call is presumably needed to safely load checkpoints that contain slice objects in their metadata; scoping it to the checkpoint save/load site (or wrapping it in an explicit initialization function) keeps the side effect intentional and visible.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
examples/advanced/_mace_training_helpers.py, line 1358-1361 (link)

Confusing TypeError when stage_two.start_step is absent

_cfg_get(stage_two, "start_step") returns None when stage_two is an empty dict (the default), and int(None) then raises TypeError: int() argument must be a string … not 'NoneType' — which gives no hint about the missing config key. Adding a guard with an explicit ValueError (e.g. "loss.stage_two.start_step is required") would make misconfiguration failures much easier to diagnose.
examples/advanced/_mace_training_helpers.py, line 1194-1199 (link)

ToDType reaches into private Batch internals

batch._storage.groups and group._data are private implementation attributes. Any refactor or extension of Batch's internal storage layout will silently break ToDType at runtime, while the public API (attribute access on Batch) continues to work. Consider using the public Batch interface or a dedicated cast utility if one is available on nvalchemi.data.Batch.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "add cuequivariance in doc" | Re-trigger Greptile}

ys-teh force-pushed the feature/mace-training-ex branch 2 times, most recently from 83363c2 to 4addb20 Compare June 25, 2026 16:27

ys-teh marked this pull request as ready for review June 25, 2026 23:37

ys-teh added 8 commits June 26, 2026 01:34

adds mace training example

890c654

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

rename for consistency

2130371

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

add descriptions to config

270f4fd

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

removed charge model

648c1bc

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

removed unused import

4605ac6

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

fixed bug

cbe9b2c

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

update changelog

95029b1

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

add cuequivariance in doc

bb9efcf

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

ys-teh force-pushed the feature/mace-training-ex branch from 837a527 to bb9efcf Compare June 26, 2026 01:51

ys-teh requested a review from laserkelvin June 26, 2026 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model Training Example with MACE#109

Model Training Example with MACE#109
ys-teh wants to merge 8 commits into
NVIDIA:mainfrom
ys-teh:feature/mace-training-ex

ys-teh commented Jun 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026 •

edited

Loading

Comments Outside Diff (3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ys-teh commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ALCHEMI Toolkit Pull Request

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Comments Outside Diff (3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ys-teh commented Jun 9, 2026 •

edited

Loading

greptile-apps Bot commented Jun 26, 2026 •

edited

Loading