Skip to content

Rework the pathways serialization format #26

@MaybeJustJames

Description

@MaybeJustJames

The "current" pathways serialization format (what I've been calling "abomination") is a custom format that requires serializer/deserializer modification to extend.

Issue #25 perhaps requires extension of the format. Rather that extend the format, this issue proposes replacing the custom format with a standard serialization format.

Some options to consider (also a paper):

Serialization format Human readable Appendable Multi-language impls Standardised Extensible Compact
Abomination
YAML
JSON
SQLite
JSONlines
CBOR
Protobuffers
Flatbuffers
Avro
Thrift
Cap'n'proto
Twine
Preserves
UBJSON
Postcard
Human readable
Plain text encoding.
Appendable
I don't need to know about more than the single row I'm appending to the file in order to append (e.g. JSON is not appendable because of array delimiters).
Multi-language impls
There are off-the-shelf serializers/deserializes for the format in Python and at least 1 other language.
Standardised
The format is documented in an internet standard from IEEE, W3C, etc.
Extensible
When a field is added, old software can still work with data serialized with the new field.

Why?

I think there are a few good reasons to consider this change.

  1. Current format requires specialised knowledge to understand the data format itself (not just domain knowledge of metabolomics). Using a more common format means that someone receiving the data can use an off-the-shelf parser and be confident that it works.
  2. eval()-ing Python is slow and dangerous. literal_eval() is safer but still much slower that parsing, say, YAML. Also opens up the possibility of using the data in non-Python languages. I'm already doing this a little in the command-line client app I'll give you which is written in Rust. Parsing literal Python values is painful but possible outside of Python and using a more common format makes this much easier.
  3. Extending the data you want to store becomes much easier. You don't have to make fundamental adjustments to the format to add a citation field. In YAML you would just add an optional key to each object in the list that has a citation. In sqlite you'd add an extra nullable column. Both these options are less data than a 3 byte empty list for each entry in your custom format (~35k for a single 12000 entry file).

Metadata

Metadata

Labels

help wantedExtra attention is needed

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions