Skip to content

Add proteinbox workflow#15

Open
MSiggel wants to merge 14 commits into
mainfrom
feature/protein-box
Open

Add proteinbox workflow#15
MSiggel wants to merge 14 commits into
mainfrom
feature/protein-box

Conversation

@MSiggel

@MSiggel MSiggel commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add the proteinbox simulation type, models, workflow dispatch, and lysozyme example
  • add protein preparation through GROMACS pdb2gmx, including local force-field checks, exact disulfide prompts, and protonation overrides
  • add solvation, ionization, topology update, restrained OpenMM relaxation, and GROMACS run schedules for proteinbox builds
  • align GROMACS/pdb2gmx configuration with the existing CGenFF setup pattern through settings, config init, and config template updates
  • add focused tests for proteinbox models, metadata, topology parsing, disulfide/protonation behavior, force-field checks, and config handling

Tests

  • pytest mdfactory/tests/test_settings.py mdfactory/tests/test_sync_config_local_paths.py mdfactory/tests/test_proteinbox.py

Closes #13

MSiggel and others added 12 commits May 31, 2026 11:18
…Config

New models for the proteinbox simulation type:
- ProteinSpecies: PDB-path-based species with disulfide/protonation annotations
- ProteinBoxComposition: protein + box_padding + ionization config
- Pdb2gmxConfig: force field and water model config for pdb2gmx
- GromacsProteinParameterSet: output paths from pdb2gmx
- BuildInput extended with proteinbox type and pdb2gmx parametrization

Closes: relates to #13

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Functions for the proteinbox build pipeline:
- check_gmx_available: verify gmx binary is on PATH
- clean_pdb: PDBFixer-based PDB standardization
- run_pdb2gmx: subprocess wrapper for gmx pdb2gmx
- extract_charge_from_topology: parse net charge from .top/.itp
- update_topology_molecules: append water/ion entries to [ molecules ]
- validate_with_grompp: dry-run topology validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brief NPT equilibration with protein heavy atoms position-restrained
via CustomExternalForce. Allows water/ions to relax around the fixed
protein structure before GROMACS production runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build pipeline: clean PDB -> pdb2gmx -> center in cubic box -> solvate
(reuses existing solvate()) -> ionize (reuses ionize_solvated_system())
-> update topology -> OpenMM relax with protein restraints -> validate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MDP files for protein-in-waterbox equilibration and production:
- em.mdp: steepest descent minimization
- nvt.mdp: NVT with -DPOSRES and Protein/Non-Protein tc-grps
- npt.mdp: NPT with -DPOSRES and Berendsen barostat
- md.mdp: production with Parrinello-Rahman, no position restraints

All use CHARMM36m-specific nonbonded settings (1.2nm cutoffs,
Force-switch VdW modifier, no dispersion correction).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 13 unit tests covering ProteinSpecies, ProteinBoxComposition,
  Pdb2gmxConfig, BuildInput integration, and topology parsing
- Fix ProteinSpecies to default count=1/fraction=1.0 at field level
  (avoids parent Species validator ordering issue)
- Example YAML for lysozyme with CHARMM36m force field

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses 4 P2 review comments:
- Add missing `from pathlib import Path` to build.py module scope
- Resolve all paths to absolute in run_pdb2gmx before subprocess call
  (avoids cwd confusion with output_dir)
- Add species/total_count/charge properties to ProteinBoxComposition
  and proteinbox case in BuildInput.metadata for analysis compatibility
- Implement _apply_protonation_states: renames residues in PDB before
  pdb2gmx so protonation overrides (HIS->HIE, GLU->GLH) are applied

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add check_forcefield_available() that verifies the force field directory
  exists in GROMACS share, GMXLIB, or cwd before invoking pdb2gmx. Lists
  available force fields on failure.
- Clean up mdout.mdp artifact in validate_with_grompp alongside check.tpr.
- Fix bare open() file handles in relax_with_protein_restraints (use with-blocks).
- Add CHARMM HIS alias translation (HIE→HSE, HID→HSD, HIP→HSP for charmm FFs).
- Use re.fullmatch for protonation state key parsing (reject trailing chars).
- Validate 3-character residue names for PDB column safety.
- Add disulfide bond prompt generation (_build_disulfide_prompt_input) for
  deterministic pdb2gmx -ss interaction.
- Resolve paths before working_directory context switch to avoid cwd breakage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add FORCEFIELD_REGISTRY mapping friendly names (charmm36m, charmm36m-ljpme)
  to MacKerell lab download URLs and extracted directory names.
- Auto-download missing force fields on first use from the registry.
  Stored in platformdirs user_data_dir (~/Library/Application Support/mdfactory/forcefields/).
- resolve_forcefield() translates friendly names to actual directory stems
  so users write "charmm36m" in YAML and pdb2gmx receives the correct
  directory name.
- Inject GMXLIB in subprocess env so gmx finds downloaded force fields.
- Fix extract_charge_from_topology: skip .ff/ library includes (ions.itp,
  tip3p.itp) that contain atom type templates, not system charges. Was
  causing +159 charge for lysozyme with charmm36m instead of +8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add [gromacs] section to settings.py with GMX_PATH and FORCEFIELD_DIR,
  following the same pattern as [cgenff] SILCSBIODIR.
- Settings.__init__ auto-prepends FORCEFIELD_DIR to GMXLIB env var on
  startup, matching how SILCSBIODIR is auto-set for CGenFF.
- check_gmx_available() checks configured GMX_PATH first, falls back
  to PATH lookup.
- All gmx subprocess calls (pdb2gmx, grompp) use the configured binary
  and inject GMXLIB for force field resolution.
- Add GROMACS setup to config wizard (sync_config.py): prompts for gmx
  path, forcefield dir, and offers to download CHARMM36m on setup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- [76, 94]
protonation_states:
HIS15: HIE
box_padding: 12.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent naming, as in other system types?

- [64, 80]
- [76, 94]
protonation_states:
HIS15: HIE

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a list of key-value pairs, right?

type: pdb2gmx
forcefield: charmm36m
water_model: tip3p
ignh: true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad key name, have no idea what that means.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add proteinbox simulation type with pdb2gmx backend

2 participants