Update HPO term parsing for new OBO release format#288
Conversation
|
I think the regression test file is missing, can you add that in? This will work for this additional failure mode, and matches the current Seqr code (remove/split on substring, what's left should be a HPO term), but I wonder if it's more future proof to switch the logic to a detection which would automatically adjust for any future format changes.
|
|
@MattWellie I've added more test coverage for this new change. As for using better regex, we could implement that here, or we could make it part of the other changes in the linked PR from the cardinal-seqr repo |
|
Nah I'm sure this is fine, I don't expect the format to change that frequently. Also I think you need to change a line in the unit test
(instead of 5 records) |
Fix for HPO.obo file parsing causing the
update-hpo-termsCloudrun jobs to failBased on the changes drafted for new seqr here: https://github.com/populationgenomics/cardinal-seqr/pull/10
Parsing OBO trailing modifier blocks in HPO terms
Problem
New HPO releases now emit lines like:
The parser strips OBO's
! human-readable labelsuffix, but not the OBO trailing annotation block{xref="..."}.(or in combination:
is_a: HP:0032162 ! Childhood onset {xref="PMID:31677808"})After
.split(' ! ')[0]we get the literal stringHP:0032162 {xref="PMID:31677808"}and store that asparent_id. Later,_get_category_idwalks the parent chain and looks up that string inparent_id_map. It's not a real key (only the cleanHP:0032162 is), so line 236 raises:Fix
In
get_file_iterator, strip the trailing{...}annotation block in addition to! label.re.sub(r'\s*\{[^}]*\}\s*$', '', value.split(' ! ')[0]).strip()Also add a regression test in
update_hpo_tests.pyusing fixture lines with{xref=...}trailers (both alone and combined with! label).The OBO 1.4 spec allows trailing modifiers on basically any clause that takes a target ID (
is_a,intersection_of,union_of,relationship, etc.). We currently only parseis_a, so that's the only line to fix.Testing