Skip to content

Update HPO term parsing for new OBO release format#288

Merged
EddieLF merged 6 commits into
stagingfrom
fix_hpo_parsing
Jun 15, 2026
Merged

Update HPO term parsing for new OBO release format#288
EddieLF merged 6 commits into
stagingfrom
fix_hpo_parsing

Conversation

@EddieLF

@EddieLF EddieLF commented Jun 15, 2026

Copy link
Copy Markdown

Fix for HPO.obo file parsing causing the update-hpo-terms Cloudrun jobs to fail

Based on the changes drafted for new seqr here: https://github.com/populationgenomics/cardinal-seqr/pull/10

Parsing OBO trailing modifier blocks in HPO terms

Problem

New HPO releases now emit lines like:

is_a: HP:0032162 {xref="PMID:31677808"}

The parser strips OBO's ! human-readable label suffix, but not the OBO trailing annotation block {xref="..."}.
(or in combination: is_a: HP:0032162 ! Childhood onset {xref="PMID:31677808"})

After .split(' ! ')[0] we get the literal string HP:0032162 {xref="PMID:31677808"} and store that as parent_id. Later, _get_category_id walks the parent chain and looks up that string in parent_id_map. It's not a real key (only the clean HP:0032162 is), so line 236 raises:

ValueError('Strange id: %s' % hpo_id)

Fix

In get_file_iterator, strip the trailing {...} annotation block in addition to ! label.

  • Regex-tidy: re.sub(r'\s*\{[^}]*\}\s*$', '', value.split(' ! ')[0]).strip()

Also add a regression test in update_hpo_tests.py using fixture lines with {xref=...} trailers (both alone and combined with ! label).

The OBO 1.4 spec allows trailing modifiers on basically any clause that takes a target ID (is_a, intersection_of, union_of, relationship, etc.). We currently only parse is_a, so that's the only line to fix.

Testing

>>> import re

>>> # Test with expected ' ! ' and unexpected suffix term '{xref="..."}'
>>> line = 'is_a: HP:0000118 ! Phenotypic abnormality {xref="PMID:31677808"}\n'
>>> value = " ".join(line.split(" ")[1:])
>>> is_a = re.sub(r'\s*\{[^}]*\}\s*$', '', value).split(' ! ')[0].strip()
>>> print(is_a)
HP:0000118

>>> # Test with just unexpected suffix term '{xref="..."}'
>>> line = 'is_a: HP:0000118 {xref="PMID:31677808"}\n'
>>> value = " ".join(line.split(" ")[1:])
>>> is_a = re.sub(r'\s*\{[^}]*\}\s*$', '', value).split(' ! ')[0].strip()
>>> print(is_a)
HP:0000118

@EddieLF EddieLF requested review from MattWellie and cassimons June 15, 2026 03:16
@MattWellie

Copy link
Copy Markdown

I think the regression test file is missing, can you add that in?

This will work for this additional failure mode, and matches the current Seqr code (remove/split on substring, what's left should be a HPO term), but I wonder if it's more future proof to switch the logic to a detection which would automatically adjust for any future format changes.

hpo = re.find(r'HP:\d{7}', value)

@EddieLF

EddieLF commented Jun 15, 2026

Copy link
Copy Markdown
Author

@MattWellie I've added more test coverage for this new change. As for using better regex, we could implement that here, or we could make it part of the other changes in the linked PR from the cardinal-seqr repo

@MattWellie

Copy link
Copy Markdown

Nah I'm sure this is fine, I don't expect the format to change that frequently.

Also I think you need to change a line in the unit test

call('Deleting HumanPhenotypeOntology table with 12 records and creating new table with 7 records'),

(instead of 5 records)

@MattWellie MattWellie left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMatt

@EddieLF EddieLF merged commit d6b9df6 into staging Jun 15, 2026
6 checks passed
@EddieLF EddieLF deleted the fix_hpo_parsing branch June 15, 2026 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants