osmordred: fix issues #9–#11 (MolecularId/EState/RingCount) + correct HEState counting#37
Open
guillaume-osmo wants to merge 2 commits into
Open
osmordred: fix issues #9–#11 (MolecularId/EState/RingCount) + correct HEState counting#37guillaume-osmo wants to merge 2 commits into
guillaume-osmo wants to merge 2 commits into
Conversation
#9 MolecularId: drop natoms>1 guard -> single-heavy-atom mols (methane/ ammonia/HF) get base ID 1.0 not 0 (matches Mordred). #10 EState: single-atom atom-type SMARTS so SubstructMatch(uniquify) no longer collapses bonded same-type atoms (ethane sCH3=2, cyclopropane ssCH2=3, acetylene tCH=2); dCH2/tCH keep the multiple-bond constraint via recursive SMARTS ([CD1H2;$(*=*)], [CD1H;$(*#*)]) so radicals/anions are not mis-typed. #11 RingCount: add the missing nG12Ring total increment in ring_size>12. #2 HEState (osmordred extension, not in Mordred): count DISTINCT anchor atoms (uniquify=false + dedup on match[0]) -> fixes both collapse and over-count across all H-types. #12 atomic masses intentionally kept as current RDKit/IUPAC weights. Rebuilt + ran reproducers (all pass); 20k regression = only intended corrections toward Mordred. Assisted by Claude
EState.yaml Histidine: SaaN/SaaNH/SaasC/SsssCH were generated from a
different imidazole tautomer; regenerated from current Mordred
(osmordred == Mordred == RDKit to 16 digits).
Autocorrelation/{ATS,ATSC,AATS,AATSC}.yaml: mass-weighted (...m) refs for
the sulfur molecules (Allicin, Glutathione) were vanilla-Mordred frozen
2013 masses; regenerated to current IUPAC weights (what osmordred uses
via RDKit's PeriodicTable).
Assisted by Claude
Author
|
@bp-kelley those are the last corrections that was discover by a contributor ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #33 (merged). Resolves the open osmordred issues osmoai/osmordred#9–#12 and corrects two stale reference values found during verification. Runtime-confirmed by rebuilding the split and running the issue reproducers.
Code fixes
computeAtomicIdsskipped single-heavy-atom molecules (if (natoms > 1)guard) and returned 0; methane/ammonia/HF should be 1.0. Guard removed.sCH3/dCH2/ssCH2/tCHSMARTS were multi-atom, so withuniquify=truetwo bonded same-type atoms collapsed to one match (ethaneNsCH3=1 not 2; cyclopropaneNssCH2=1 not 3). Switched to single-atom patterns. FordCH2/tCHthe multiple-bond constraint is preserved via a recursive SMARTS ([CD1H2;$(*=*)],[CD1H;$(*#*)]) so single-bonded radicals/anions are not mis-typed.ring_size > 12branch updated the sub-category counters but never thenG12Ringtotal (stuck at 0). Added the increment. (nG12FRingwas fixed earlier;nG12ARingalready matches Mordred.)calcHEStateDescscounted unique atom-sets, which both collapsed bonded same-type atoms and over-counted an anchor bonded to several matching neighbours. Now counts distinct anchor atoms (one loop change covering all H-types). Note: this changes most molecules' HEState values to atom-count semantics — any model trained on the old HEState should be refit.Verification (built split + reproducers)
methane MID=1 · ethane NsCH3=2/SsCH3=4/NHCsats=2 · ethylene NdCH2=2/NHdCH2=2 · cyclopropane NssCH2=3 · acetylene NtCH=2/NHtCH=2 · cyclotridecane nG12Ring=1 · peroxide NHsOH=2 · hydrazine NHsNH2=2.
Regression over a 20k set (Mordred-parity descriptors): 19,710/20,000 molecules byte-identical across all 3588 descriptors; the ~1.45% that change all move toward Mordred (0 away).
Stale reference corrections
EState.yamlHistidine: stored values were for a different imidazole tautomer; regenerated from current Mordred (osmordred matches to 16 digits).…m(mass-weighted) references: regenerated to current IUPAC atomic weights (consistent with Fix a problem with missing r groups #12).