Right now, NCBI datasets downloads for viruses often include sequences of non-natural origin, predominantly patent related sequences.
It would be great if the metadata fields included one that would allow to easily filter those sequences out. For example, you could include the Genbank division the sequence appears in. There is a dedicated patent division PAT: https://www.ncbi.nlm.nih.gov/education/patent_and_ip_faqs/
Unfortunately, it seems this information is lost in the datasets input pipeline. Would be great if it could be kept and surfaced.
The feature would be immediately and immensely useful to Pathoplexus as it would allow us to not ingest patent sequences - those are out of scope for Pathoplexus as they are not useful in pathogen genomic analyses. See loculus-project/loculus#6450
Right now, NCBI datasets downloads for viruses often include sequences of non-natural origin, predominantly patent related sequences.
It would be great if the metadata fields included one that would allow to easily filter those sequences out. For example, you could include the Genbank division the sequence appears in. There is a dedicated patent division
PAT: https://www.ncbi.nlm.nih.gov/education/patent_and_ip_faqs/Unfortunately, it seems this information is lost in the datasets input pipeline. Would be great if it could be kept and surfaced.
The feature would be immediately and immensely useful to Pathoplexus as it would allow us to not ingest patent sequences - those are out of scope for Pathoplexus as they are not useful in pathogen genomic analyses. See loculus-project/loculus#6450