Skip to content

lepy/sdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

469 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI Python versions Docs DOI

Structured data format (sdata)

Design goals

  • open data format for open science projects
  • self describing data
  • flexible data structure layout
    • hierarchical data structure (nesting groups, dictionaries)
    • (posix path syntax support?)
  • extendable data structure
    • data format versions
  • platform independent
  • simple object model
  • support of standard metadata formats (key/value, ...)
  • support of standard dataset formats (hdf5, netcdf, csv, ...)
  • support of standard dataset types (datacubes, tables, series, ...)
  • support of physical units (conversion of units)
  • transparent, optional data compression (zlib, blosc, ...)
  • support of (de-)serialization of every dataset type (group, data, metadata)
  • easy defineable (project) standards, e.g. for a uniaxial tension test (UT)
  • (optional data encryption (gpg, ...))
  • change management support?
  • Enable use of data structures from existing tensor libraries transparently
  • (single writer/ multiple reader (swmr) support)
  • (nested data support)

Quickstart

import pandas as pd
from sdata.sclass.dataframe import DataFrame

df = pd.DataFrame({"force": [1.0, 2.0, 3.0]})
data = DataFrame(df=df, name="specimen_01", description="a tension test")
data.metadata.add("max_force", 12.5, unit="kN", dtype="float",
                  description="max force", ontology="bfo:Quality")

print(data.metadata.df[["value", "unit", "dtype", "ontology"]])
                                          value unit  dtype     ontology
key
_sdata_name                         specimen_01    -    str
_sdata_sname    DataFrame__specimen_01__3003...    -    str
...
max_force                                  12.5   kN  float  bfo:Quality

Every object has a deterministic, content-addressable identity (SUUID), a set of reserved _sdata_* attributes (name, sname, suuid, class, ctime, parent, project, topology class) and freely extensible, fully-described user attributes.

Machine-readable metadata (JSON-LD / Linked Data)

The metadata is the backbone of the data description: it lives next to every data blob and fully qualifies the data — units (QUDT/UCUM), ontology classes (BFO), provenance (PROV/DCAT), tabular columns (CSVW) and identity (DID).

# self-describing JSON-LD (qudt:Quantity units, BFO @type, did:suuid @id, csvw columns)
doc = data.to_jsonld()

# RDF/Turtle (uses rdflib if installed, otherwise returns the JSON-LD)
print(data.to_turtle())

# write the data + a sidecar <sname>.meta.jsonld right next to it
data.to_json("specimen_01.sjson", sidecar=True)

# validate / auto-complete against a schema template
from sdata.schema import MetadataSchema, AttrSpec
schema = MetadataSchema("TensileTest", [
    AttrSpec("max_force", dtype="float", unit="kN", required=True, ontology="bfo:Quality"),
])
report = data.metadata.validate(schema)      # ValidationReport (truthy if ok)

# sign the metadata as a W3C Verifiable Credential (pure-Python Ed25519)
from sdata.did import keys, pub_from_priv_jwk
priv = keys.gen_ed25519_jwk()
vc = data.metadata.to_verifiable_credential("did:example:issuer", priv)
subject = data.metadata.from_verifiable_credential(vc, pub_from_priv_jwk(priv))

# interactive: attribute autocomplete + rich Jupyter display
data.metadata.a.max_force        # -> Attribute; tab-completion in Jupyter
data.metadata                    # -> _repr_html_ table

Resulting JSON-LD for max_force (excerpt):

{
  "@id": "did:suuid:DataFrame__specimen_01__3003...:sdata",
  "@type": ["sdata:DataFrame", "bfo:BFO_0000004"],
  "name": "specimen_01",
  "sdata:max_force": {
    "@type": ["qudt:Quantity", "bfo:Quality"],
    "value": {"@value": 12.5, "@type": "xsd:double"},
    "unitRef": "unit:KiloN", "symbol": "kN"
  },
  "columns": [{"name": "force", "datatype": "xsd:double"}]
}

The optional semantic backends degrade gracefully to pure Python (no hard dependency): pip install "sdata[rdf]" (rdflib), sdata[units] (pint), sdata[schema] (jsonschema). Core install needs only numpy, pandas, suuid.

Tabular data (DataFrame)

DataFrame wraps a pandas frame plus per-column metadata and serializes to many formats — Parquet, Arrow/Feather, CSV, dict/JSON, JSON-LD, a Frictionless Data Package and HDF5 — with the qualifying metadata embedded or written as a sidecar.

import pandas as pd
from sdata.sclass.dataframe import DataFrame

sdf = DataFrame(df=pd.DataFrame({"weight": [10, 20], "height": [1.5, 1.6]}),
                name="specimen_01", description="a tension test")

# per-column annotations (only the fields you pass are changed)
sdf.set_column("weight", unit="kg", label="Gewicht", ontology="bfo:Quality")
sdf.col["height"].unit = "m"          # mutate the column Attribute in place
sdf.column_units                       # {'weight': 'kg', 'height': 'm'}

# serialize (optional <sname>.meta.jsonld sidecar; bytes/str without a path)
sdf.to_parquet(path="out", sidecar=True)      # out/<sname>.spq + sidecar
sdf.to_csv(path="out")                         # data-only CSV (pure pandas)
sdf.to_feather(path="out")                     # Arrow IPC + native per-column field metadata
sdf.to_datapackage(path="out")                 # Frictionless Data Package (.zip)
sdf.to_hdf(path="out")                         # HDF5 (needs sdata[hdf])
DataFrame.from_parquet("out/specimen_01.spq")

# validate the table against a schema (missing/dtype/unit/extra columns)
from sdata.schema import TableSchema, AttrSpec
schema = TableSchema("TensileTable", [
    AttrSpec("weight", dtype="int", unit="kg", required=True),
])
report = sdf.validate_table(schema)            # ValidationReport (truthy if ok)

Arrow/Feather/Parquet need pip install "sdata[parquet]", HDF5 sdata[hdf]; CSV, dict, JSON-LD and the (CSV) Data Package work with the core install. See the Tabular data docs (docs/usage/dataframe.md) for the full reference.

Howto

Demo App

Try to paste some Excel-Data in the forms ...

Metadata

Attributes

  • name
  • value
  • dtype
  • unit
  • description
  • label
  • required
  • ontology (CURIE/IRI of the value's class, e.g. bfo:Quality)

dtypes for attributes

  • int
  • float
  • str
  • bool
  • list (list of strings)
  • timestamp (ISO-8601 with timezone, stdlib zoneinfo)
  • bytes (base64 in JSON)
  • json (dict/list)
  • uri

Set strict=True (e.g. metadata.add(..., strict=True)) to raise on invalid values instead of the lenient default coercion.

paper

About

structured data format

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors