Update HPO term parsing for new OBO release format by EddieLF · Pull Request #288 · populationgenomics/seqr

EddieLF · 2026-06-15T03:14:03Z

Fix for HPO.obo file parsing causing the `update-hpo-terms` Cloudrun jobs to fail

Based on the changes drafted for new seqr here: https://github.com/populationgenomics/cardinal-seqr/pull/10

Parsing OBO trailing modifier blocks in HPO terms

Problem

New HPO releases now emit lines like:

is_a: HP:0032162 {xref="PMID:31677808"}

The parser strips OBO's ! human-readable label suffix, but not the OBO trailing annotation block {xref="..."}.
(or in combination: is_a: HP:0032162 ! Childhood onset {xref="PMID:31677808"})

After .split(' ! ')[0] we get the literal string HP:0032162 {xref="PMID:31677808"} and store that as parent_id. Later, _get_category_id walks the parent chain and looks up that string in parent_id_map. It's not a real key (only the clean HP:0032162 is), so line 236 raises:

ValueError('Strange id: %s' % hpo_id)

Fix

In get_file_iterator, strip the trailing {...} annotation block in addition to ! label.

Regex-tidy: re.sub(r'\s*\{[^}]*\}\s*$', '', value.split(' ! ')[0]).strip()

Also add a regression test in update_hpo_tests.py using fixture lines with {xref=...} trailers (both alone and combined with ! label).

The OBO 1.4 spec allows trailing modifiers on basically any clause that takes a target ID (is_a, intersection_of, union_of, relationship, etc.). We currently only parse is_a, so that's the only line to fix.

Testing

>>> import re

>>> # Test with expected ' ! ' and unexpected suffix term '{xref="..."}'
>>> line = 'is_a: HP:0000118 ! Phenotypic abnormality {xref="PMID:31677808"}\n'
>>> value = " ".join(line.split(" ")[1:])
>>> is_a = re.sub(r'\s*\{[^}]*\}\s*$', '', value).split(' ! ')[0].strip()
>>> print(is_a)
HP:0000118

>>> # Test with just unexpected suffix term '{xref="..."}'
>>> line = 'is_a: HP:0000118 {xref="PMID:31677808"}\n'
>>> value = " ".join(line.split(" ")[1:])
>>> is_a = re.sub(r'\s*\{[^}]*\}\s*$', '', value).split(' ! ')[0].strip()
>>> print(is_a)
HP:0000118

MattWellie · 2026-06-15T03:26:26Z

I think the regression test file is missing, can you add that in?

This will work for this additional failure mode, and matches the current Seqr code (remove/split on substring, what's left should be a HPO term), but I wonder if it's more future proof to switch the logic to a detection which would automatically adjust for any future format changes.

hpo = re.find(r'HP:\d{7}', value)

EddieLF · 2026-06-15T03:49:29Z

@MattWellie I've added more test coverage for this new change. As for using better regex, we could implement that here, or we could make it part of the other changes in the linked PR from the cardinal-seqr repo

MattWellie · 2026-06-15T04:00:29Z

Nah I'm sure this is fine, I don't expect the format to change that frequently.

Also I think you need to change a line in the unit test

call('Deleting HumanPhenotypeOntology table with 12 records and creating new table with 7 records'),

(instead of 5 records)

MattWellie

LGTMatt

Update HPO term parsing for new OBO release format

6f9586e

EddieLF requested review from MattWellie and cassimons June 15, 2026 03:16

EddieLF added 2 commits June 15, 2026 13:43

Update HPO term parsing test with new OBO formatting

b2f8f08

Update HPO term parsing test with new OBO formatting again

7908fee

EddieLF added 3 commits June 15, 2026 14:05

Fix HPO test

6e6c501

Fix HPO test 2

24206ae

Fix HPO test 3

eb0c57c

MattWellie approved these changes Jun 15, 2026

View reviewed changes

EddieLF merged commit d6b9df6 into staging Jun 15, 2026
6 checks passed

EddieLF deleted the fix_hpo_parsing branch June 15, 2026 05:12

This was referenced Jun 15, 2026

Fix HPO parsing for OBO 1.4 - round 2 #289

Merged

Hotfix HPO terms .obo file parsing #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update HPO term parsing for new OBO release format#288

Update HPO term parsing for new OBO release format#288
EddieLF merged 6 commits into
stagingfrom
fix_hpo_parsing

EddieLF commented Jun 15, 2026 •

edited

Loading

Uh oh!

MattWellie commented Jun 15, 2026

Uh oh!

EddieLF commented Jun 15, 2026

Uh oh!

MattWellie commented Jun 15, 2026

Uh oh!

MattWellie left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EddieLF commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix for HPO.obo file parsing causing the update-hpo-terms Cloudrun jobs to fail

Parsing OBO trailing modifier blocks in HPO terms

Problem

Fix

Testing

Uh oh!

MattWellie commented Jun 15, 2026

Uh oh!

EddieLF commented Jun 15, 2026

Uh oh!

MattWellie commented Jun 15, 2026

Uh oh!

MattWellie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EddieLF commented Jun 15, 2026 •

edited

Loading

Fix for HPO.obo file parsing causing the `update-hpo-terms` Cloudrun jobs to fail