Structured data format (sdata)

Design goals

open data format for open science projects
self describing data
flexible data structure layout
- hierarchical data structure (nesting groups, dictionaries)
- (posix path syntax support?)
extendable data structure
- data format versions
platform independent
simple object model
support of standard metadata formats (key/value, ...)
support of standard dataset formats (hdf5, netcdf, csv, ...)
support of standard dataset types (datacubes, tables, series, ...)
support of physical units (conversion of units)
transparent, optional data compression (zlib, blosc, ...)
support of (de-)serialization of every dataset type (group, data, metadata)
easy defineable (project) standards, e.g. for a uniaxial tension test (UT)
(optional data encryption (gpg, ...))
change management support?
Enable use of data structures from existing tensor libraries transparently
(single writer/ multiple reader (swmr) support)
(nested data support)

Quickstart

import pandas as pd
from sdata.sclass.dataframe import DataFrame

df = pd.DataFrame({"force": [1.0, 2.0, 3.0]})
data = DataFrame(df=df, name="specimen_01", description="a tension test")
data.metadata.add("max_force", 12.5, unit="kN", dtype="float",
                  description="max force", ontology="bfo:Quality")

print(data.metadata.df[["value", "unit", "dtype", "ontology"]])

                                          value unit  dtype     ontology
key
_sdata_name                         specimen_01    -    str
_sdata_sname    DataFrame__specimen_01__3003...    -    str
...
max_force                                  12.5   kN  float  bfo:Quality

Every object has a deterministic, content-addressable identity (SUUID), a set of reserved _sdata_* attributes (name, sname, suuid, class, ctime, parent, project, topology class) and freely extensible, fully-described user attributes.

Machine-readable metadata (JSON-LD / Linked Data)

The metadata is the backbone of the data description: it lives next to every data blob and fully qualifies the data — units (QUDT/UCUM), ontology classes (BFO), provenance (PROV/DCAT), tabular columns (CSVW) and identity (DID).

# self-describing JSON-LD (qudt:Quantity units, BFO @type, did:suuid @id, csvw columns)
doc = data.to_jsonld()

# RDF/Turtle (uses rdflib if installed, otherwise returns the JSON-LD)
print(data.to_turtle())

# write the data + a sidecar <sname>.meta.jsonld right next to it
data.to_json("specimen_01.sjson", sidecar=True)

# validate / auto-complete against a schema template
from sdata.schema import MetadataSchema, AttrSpec
schema = MetadataSchema("TensileTest", [
    AttrSpec("max_force", dtype="float", unit="kN", required=True, ontology="bfo:Quality"),
])
report = data.metadata.validate(schema)      # ValidationReport (truthy if ok)

# sign the metadata as a W3C Verifiable Credential (pure-Python Ed25519)
from sdata.did import keys, pub_from_priv_jwk
priv = keys.gen_ed25519_jwk()
vc = data.metadata.to_verifiable_credential("did:example:issuer", priv)
subject = data.metadata.from_verifiable_credential(vc, pub_from_priv_jwk(priv))

# interactive: attribute autocomplete + rich Jupyter display
data.metadata.a.max_force        # -> Attribute; tab-completion in Jupyter
data.metadata                    # -> _repr_html_ table

Resulting JSON-LD for max_force (excerpt):

{
  "@id": "did:suuid:DataFrame__specimen_01__3003...:sdata",
  "@type": ["sdata:DataFrame", "bfo:BFO_0000004"],
  "name": "specimen_01",
  "sdata:max_force": {
    "@type": ["qudt:Quantity", "bfo:Quality"],
    "value": {"@value": 12.5, "@type": "xsd:double"},
    "unitRef": "unit:KiloN", "symbol": "kN"
  },
  "columns": [{"name": "force", "datatype": "xsd:double"}]
}

The optional semantic backends degrade gracefully to pure Python (no hard dependency): pip install "sdata[rdf]" (rdflib), sdata[units] (pint), sdata[schema] (jsonschema). Core install needs only numpy, pandas, suuid.

Tabular data (`DataFrame`)

DataFrame wraps a pandas frame plus per-column metadata and serializes to many formats — Parquet, Arrow/Feather, CSV, dict/JSON, JSON-LD, a Frictionless Data Package and HDF5 — with the qualifying metadata embedded or written as a sidecar.

import pandas as pd
from sdata.sclass.dataframe import DataFrame

sdf = DataFrame(df=pd.DataFrame({"weight": [10, 20], "height": [1.5, 1.6]}),
                name="specimen_01", description="a tension test")

# per-column annotations (only the fields you pass are changed)
sdf.set_column("weight", unit="kg", label="Gewicht", ontology="bfo:Quality")
sdf.col["height"].unit = "m"          # mutate the column Attribute in place
sdf.column_units                       # {'weight': 'kg', 'height': 'm'}

# serialize (optional <sname>.meta.jsonld sidecar; bytes/str without a path)
sdf.to_parquet(path="out", sidecar=True)      # out/<sname>.spq + sidecar
sdf.to_csv(path="out")                         # data-only CSV (pure pandas)
sdf.to_feather(path="out")                     # Arrow IPC + native per-column field metadata
sdf.to_datapackage(path="out")                 # Frictionless Data Package (.zip)
sdf.to_hdf(path="out")                         # HDF5 (needs sdata[hdf])
DataFrame.from_parquet("out/specimen_01.spq")

# validate the table against a schema (missing/dtype/unit/extra columns)
from sdata.schema import TableSchema, AttrSpec
schema = TableSchema("TensileTable", [
    AttrSpec("weight", dtype="int", unit="kg", required=True),
])
report = sdf.validate_table(schema)            # ValidationReport (truthy if ok)

Arrow/Feather/Parquet need pip install "sdata[parquet]", HDF5 sdata[hdf]; CSV, dict, JSON-LD and the (CSV) Data Package work with the core install. See the Tabular data docs (docs/usage/dataframe.md) for the full reference.

Howto

Demo App

test the demo app with editor

Try to paste some Excel-Data in the forms ...

Metadata

Attributes

name
value
dtype
unit
description
label
required
ontology (CURIE/IRI of the value's class, e.g. bfo:Quality)

dtypes for attributes

int
float
str
bool
list (list of strings)
timestamp (ISO-8601 with timezone, stdlib zoneinfo)
bytes (base64 in JSON)
json (dict/list)
uri

Set strict=True (e.g. metadata.add(..., strict=True)) to raise on invalid values instead of the lenient default coercion.

paper

Das sdata-Format
- Ingolf Lepenies. (2020). Das sdata-Format (Version 0.8.4). http://doi.org/10.5281/zenodo.4311323
- slides, html, pdf temperaturmessung-001.json temperaturmessung-001.xlsx
sdata
- Ingolf Lepenies. (2020, December 8). sdata - a structured data format (Version 0.8.4). Zenodo. http://doi.org/10.5281/zenodo.4311397

Name		Name	Last commit message	Last commit date
Latest commit History 469 Commits
.github/workflows		.github/workflows
ci		ci
data		data
docs		docs
ipynb		ipynb
ns		ns
paper/2020		paper/2020
sdata		sdata
t		t
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE-MIT		LICENSE-MIT
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
create_pyc_egg.py		create_pyc_egg.py
db.json		db.json
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sdata.data.png		sdata.data.png
setup.cfg		setup.cfg
setup.py		setup.py
t.py		t.py
tabulate.py		tabulate.py
tox.ini		tox.ini
upload_pypi.sh		upload_pypi.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structured data format (sdata)

Design goals

Quickstart

Machine-readable metadata (JSON-LD / Linked Data)

Tabular data (`DataFrame`)

Howto

Demo App

Metadata

Attributes

dtypes for attributes

paper

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Structured data format (sdata)

Design goals

Quickstart

Machine-readable metadata (JSON-LD / Linked Data)

Tabular data (DataFrame)

Howto

Demo App

Metadata

Attributes

dtypes for attributes

paper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Tabular data (`DataFrame`)

Packages