Skip to content

Model Training Example with MACE#109

Open
ys-teh wants to merge 8 commits into
NVIDIA:mainfrom
ys-teh:feature/mace-training-ex
Open

Model Training Example with MACE#109
ys-teh wants to merge 8 commits into
NVIDIA:mainfrom
ys-teh:feature/mace-training-ex

Conversation

@ys-teh

@ys-teh ys-teh commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

ALCHEMI Toolkit Pull Request

Description

This PR adds an advanced training example for a charged MACE model and the supporting code modifications needed to train it with available ALCHEMI tools.

Note: This can only be merged after #108 is merged.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

Changes Made

  • Adds a MACE training example examples/advanced/10_mace_training.py (along with config examples/advanced/10_vanilla_mace.yaml) using nvalchemi model training pipeline.
  • Adds examples/advanced/_mace_training_helpers.py with additional training utilities including stress unit conversion, training loss logging, validation, parameter counting, and gradient clipping hook.
  • Adds examples/advanced/_mace_models.py with builders for vanilla MACE model, including cuEquivariance config support.
  • Adds MACE training user guide docs/userguide/mace_training_example.md.

Testing

  • Unit tests pass locally (make pytest)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?

Run training

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

Below are NVT stability results for the trained MACE model on a 324-atom MgVF4 3x3x3 MatPES-r2SCAN test structure. The 10 ps, 300 K Langevin run completed all 20,000 steps without numerical instability or force/temperature warnings.

image

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ys-teh ys-teh force-pushed the feature/mace-training-ex branch 2 times, most recently from 83363c2 to 4addb20 Compare June 25, 2026 16:27
@ys-teh ys-teh marked this pull request as ready for review June 25, 2026 23:37
ys-teh added 8 commits June 26, 2026 01:34
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
@ys-teh ys-teh force-pushed the feature/mace-training-ex branch from 837a527 to bb9efcf Compare June 26, 2026 01:51
@ys-teh ys-teh requested a review from laserkelvin June 26, 2026 01:53
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a complete end-to-end MACE model training example (10_mace_training.py) with its Hydra config, supporting helpers for loss construction, metrics logging, and gradient clipping, and a user-guide doc page.

  • _mace_models.py introduces build_vanilla_mace_model / build_training_mace_model with cuEquivariance support and a checkpoint reconstruction spec attached to the model.
  • _mace_training_helpers.py provides the two-stage cosine LR schedule, step-function Huber loss weighting, stress unit conversion, a JsonLinesLogger, and a GradientClipHook.
  • 10_mace_training.py wires all components into a TrainingStrategy with distributed (DDP), EMA, validation, and checkpointing hooks, including graceful dataset cleanup in a finally block.

Important Files Changed

Filename Overview
examples/advanced/10_mace_training.py Main training entrypoint; well-structured with step-based validation, distributed support, EMA, gradient clipping, and graceful cleanup in a finally block.
examples/advanced/_mace_models.py MACE model builders; contains a module-level torch.serialization.add_safe_globals([slice]) call that mutates process-wide state on import.
examples/advanced/_mace_training_helpers.py Training utilities including loss building, metrics logging, LR scheduling, and gradient clipping; build_mace_step_huber_loss can crash with a confusing TypeError when stage_two.start_step is absent, and ToDType accesses private Batch internals.
examples/advanced/10_vanilla_mace.yaml Default Hydra config with well-documented hyperparameters; dataset-derived metadata (E0s, avg_num_neighbors, scale/shift) is precomputed and embedded.
docs/userguide/mace_training_example.md Clear walkthrough of the full training lifecycle; code snippets align with the runnable example and config.
CHANGELOG.md Single changelog entry added for the MACE training example under the correct Added section.

Comments Outside Diff (3)

  1. examples/advanced/_mace_models.py, line 913 (link)

    P2 Module-level global side effect on torch.serialization

    torch.serialization.add_safe_globals([slice]) runs every time this module is imported, mutating process-wide serialization state before any user code runs. In a multi-process or multi-import context this is invisible and can be surprising. The call is presumably needed to safely load checkpoints that contain slice objects in their metadata; scoping it to the checkpoint save/load site (or wrapping it in an explicit initialization function) keeps the side effect intentional and visible.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  2. examples/advanced/_mace_training_helpers.py, line 1358-1361 (link)

    P2 Confusing TypeError when stage_two.start_step is absent

    _cfg_get(stage_two, "start_step") returns None when stage_two is an empty dict (the default), and int(None) then raises TypeError: int() argument must be a string … not 'NoneType' — which gives no hint about the missing config key. Adding a guard with an explicit ValueError (e.g. "loss.stage_two.start_step is required") would make misconfiguration failures much easier to diagnose.

  3. examples/advanced/_mace_training_helpers.py, line 1194-1199 (link)

    P2 ToDType reaches into private Batch internals

    batch._storage.groups and group._data are private implementation attributes. Any refactor or extension of Batch's internal storage layout will silently break ToDType at runtime, while the public API (attribute access on Batch) continues to work. Consider using the public Batch interface or a dedicated cast utility if one is available on nvalchemi.data.Batch.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "add cuequivariance in doc" | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant