Skip to content

osmordred: fix issues #9–#11 (MolecularId/EState/RingCount) + correct HEState counting#37

Open
guillaume-osmo wants to merge 2 commits into
bp-kelley:osmordredfrom
guillaume-osmo:osmordredv3-port
Open

osmordred: fix issues #9–#11 (MolecularId/EState/RingCount) + correct HEState counting#37
guillaume-osmo wants to merge 2 commits into
bp-kelley:osmordredfrom
guillaume-osmo:osmordredv3-port

Conversation

@guillaume-osmo

Copy link
Copy Markdown

Follow-up to #33 (merged). Resolves the open osmordred issues osmoai/osmordred#9#12 and corrects two stale reference values found during verification. Runtime-confirmed by rebuilding the split and running the issue reproducers.

Code fixes

  • Fixes a leak with nonPOD types #9 MolecularIdcomputeAtomicIds skipped single-heavy-atom molecules (if (natoms > 1) guard) and returned 0; methane/ammonia/HF should be 1.0. Guard removed.
  • update swig wrappers #10 EState — the sCH3/dCH2/ssCH2/tCH SMARTS were multi-atom, so with uniquify=true two bonded same-type atoms collapsed to one match (ethane NsCH3=1 not 2; cyclopropane NssCH2=1 not 3). Switched to single-atom patterns. For dCH2/tCH the multiple-bond constraint is preserved via a recursive SMARTS ([CD1H2;$(*=*)], [CD1H;$(*#*)]) so single-bonded radicals/anions are not mis-typed.
  • Add better autodetection of labels, add dummyatom label, don't fall b… #11 RingCount — the ring_size > 12 branch updated the sub-category counters but never the nG12Ring total (stuck at 0). Added the increment. (nG12FRing was fixed earlier; nG12ARing already matches Mordred.)
  • Fix a problem with missing r groups #12 atomic mass — by design, no code change: osmordred uses RDKit's current IUPAC atomic weights, not Mordred's frozen CRC-94 table (S 32.067 vs 32.06, etc.).
  • HEState counting (osmordred extension, not in Mordred) — calcHEStateDescs counted unique atom-sets, which both collapsed bonded same-type atoms and over-counted an anchor bonded to several matching neighbours. Now counts distinct anchor atoms (one loop change covering all H-types). Note: this changes most molecules' HEState values to atom-count semantics — any model trained on the old HEState should be refit.

Verification (built split + reproducers)

methane MID=1 · ethane NsCH3=2/SsCH3=4/NHCsats=2 · ethylene NdCH2=2/NHdCH2=2 · cyclopropane NssCH2=3 · acetylene NtCH=2/NHtCH=2 · cyclotridecane nG12Ring=1 · peroxide NHsOH=2 · hydrazine NHsNH2=2.

Regression over a 20k set (Mordred-parity descriptors): 19,710/20,000 molecules byte-identical across all 3588 descriptors; the ~1.45% that change all move toward Mordred (0 away).

Stale reference corrections

  • EState.yaml Histidine: stored values were for a different imidazole tautomer; regenerated from current Mordred (osmordred matches to 16 digits).
  • Autocorrelation …m (mass-weighted) references: regenerated to current IUPAC atomic weights (consistent with Fix a problem with missing r groups #12).

#9 MolecularId: drop natoms>1 guard -> single-heavy-atom mols (methane/
   ammonia/HF) get base ID 1.0 not 0 (matches Mordred).
#10 EState: single-atom atom-type SMARTS so SubstructMatch(uniquify) no
   longer collapses bonded same-type atoms (ethane sCH3=2, cyclopropane
   ssCH2=3, acetylene tCH=2); dCH2/tCH keep the multiple-bond constraint
   via recursive SMARTS ([CD1H2;$(*=*)], [CD1H;$(*#*)]) so radicals/anions
   are not mis-typed.
#11 RingCount: add the missing nG12Ring total increment in ring_size>12.
#2 HEState (osmordred extension, not in Mordred): count DISTINCT anchor
   atoms (uniquify=false + dedup on match[0]) -> fixes both collapse and
   over-count across all H-types.
#12 atomic masses intentionally kept as current RDKit/IUPAC weights.

Rebuilt + ran reproducers (all pass); 20k regression = only intended
corrections toward Mordred.

Assisted by Claude
EState.yaml Histidine: SaaN/SaaNH/SaasC/SsssCH were generated from a
different imidazole tautomer; regenerated from current Mordred
(osmordred == Mordred == RDKit to 16 digits).
Autocorrelation/{ATS,ATSC,AATS,AATSC}.yaml: mass-weighted (...m) refs for
the sulfur molecules (Allicin, Glutathione) were vanilla-Mordred frozen
2013 masses; regenerated to current IUPAC weights (what osmordred uses
via RDKit's PeriodicTable).

Assisted by Claude
@guillaume-osmo

Copy link
Copy Markdown
Author

@bp-kelley those are the last corrections that was discover by a contributor !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant