diff --git a/docs/api.md b/docs/api.md index 4b9c766..bac8cf2 100644 --- a/docs/api.md +++ b/docs/api.md @@ -16,6 +16,10 @@ omitted. ::: sdata.sclass.dataframe +## sdata.sclass.blob + +::: sdata.sclass.blob + ## sdata.imagemeta ::: sdata.imagemeta diff --git a/docs/index.md b/docs/index.md index d14d1ac..6c5310a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -41,6 +41,8 @@ print(data.metadata.df[["value", "unit", "dtype", "ontology"]]) - **[Tabular data (DataFrame)](usage/dataframe.md)** — the self-describing table container: column metadata, Parquet/CSV/Arrow/Feather I/O, table-schema validation. +- **[Image metadata](usage/image-metadata.md)** — embed sdata metadata natively into + PNG/JPEG/JP2/GIF/WebP with one API (pure Python, no Pillow needed to read/write it). - **[Machine-readable metadata](usage/metadata-jsonld.md)** — JSON-LD / RDF, units (QUDT/UCUM), ontology classes (BFO), provenance, schema validation and signing. - **[API reference](api.md)** — the full Python API. diff --git a/docs/usage/image-metadata.md b/docs/usage/image-metadata.md new file mode 100644 index 0000000..e288fc7 --- /dev/null +++ b/docs/usage/image-metadata.md @@ -0,0 +1,118 @@ +# Images: embedding sdata metadata + +[`sdata.sclass.image.Image`][sdata.sclass.image.Image] is a +[`Blob`][sdata.sclass.blob.Blob] over image content. sdata can write its metadata +**natively into the image file** — and read it back — across **five containers with +one API**: PNG, JPEG, JPEG 2000 (`jp2`), GIF and WebP. + +The embedding layer [`sdata.imagemeta`][sdata.imagemeta] is **pure Python** +(standard library only): it needs no third-party tool (no `exiftool`) and — crucially +— **no Pillow** to read or write the metadata. Pillow is only used to *decode* pixels +(`img.pil` / `img.to_numpy`) or to *transcode* between formats on `save`. + +| Format | Native carrier of the sdata payload | Marker | +| ------ | ------------------------------------------ | --------------- | +| PNG | `iTXt` chunk before `IEND` | keyword `sdata` | +| JPEG | `APP1` segment right after SOI | `sdata\0` prefix| +| JP2 | `uuid` box (ISO BMFF) before `jp2c` | fixed sdata UUID| +| GIF | comment extension after the header | `sdata\0` prefix| +| WebP | dedicated RIFF chunk `sdAT` | FourCC `sdAT` | + +```bash +pip install pillow # optional: only needed to decode/transcode pixels +``` + +## Round-trip through `Image` + +The same three calls work for every supported format — the container is chosen from +the file suffix on `save`: + +```python +import io +import PIL.Image +from sdata.sclass.image import Image + +# some image bytes (here a freshly encoded JPEG) +buf = io.BytesIO() +PIL.Image.new("RGB", (640, 480), (30, 60, 90)).save(buf, "JPEG") + +img = Image.from_bytes("specimen.jpg", buf.getvalue()) +img.metadata.add("operator", "ada", description="who acquired the image") +img.metadata.add("exposure", 1.5, unit="s", dtype="float") + +img.save("specimen.jpg") # sdata metadata embedded in the APP1 segment + +reloaded = Image.from_file("specimen.jpg") +reloaded.metadata.get("operator").value # 'ada' +reloaded.metadata.get("exposure").value # 1.5 +``` + +`save` is lossless when the stored bytes already use the target container: the +metadata is embedded **without re-encoding** the pixels (and without Pillow). Only a +*format change* (e.g. a PNG saved as `.webp`) transcodes via Pillow: + +```python +png = Image.from_bytes("a.png", png_bytes) +png.metadata.add("note", "converted") +png.save("a.webp") # transcodes to WebP, then embeds the metadata +``` + +Reading the embedded metadata never needs Pillow: + +```python +md = Image.from_file("specimen.jpg").embedded_metadata() # a Metadata, or None +``` + +## Inherited `Blob` capabilities + +Because `Image` is a `Blob`, every image is also a content-addressable asset +(see [RFC 0003](../rfc/0003-blob-as-data-foundation.md)): + +```python +img.size # content size in bytes +img.sha256 # SHA-256 of the content +img.update_checksum() # store the checksum in metadata +img.verify() # check the content against the stored checksum +img.write("s3://bucket/specimen.jpg") # fsspec target (needs sdata[blob]) +``` + +!!! note "Checksum vs. embedded metadata" + Embedding metadata **changes the file bytes** (and therefore its hash). If you + need a stable content hash, compute it **before** embedding, or hash the decoded + pixels — analogous to the data-vs-metadata hash split for `DataFrame` + ([RFC 0004](../rfc/0004-dataframe-and-blob.md)). + +## Low-level API (`sdata.imagemeta`) + +To embed an arbitrary text payload directly into image bytes — independent of +`Image` and of Pillow — use the façade: + +```python +from sdata import imagemeta + +imagemeta.detect_format(data) # 'png' | 'jpeg' | 'jp2' | 'gif' | 'webp' | None +out = imagemeta.embed(data, '{"k": 1}') # format auto-detected; replace semantics +imagemeta.extract(out) # '{"k": 1}' (None if absent/unknown format) +imagemeta.supported_formats() # ('png', 'jpeg', 'jp2', 'gif', 'webp') +``` + +* **Replace semantics:** embedding again **replaces** the previous sdata payload + rather than appending a second one (idempotent). +* **Lenient reads:** `extract` returns `None` for an unknown format or an image + without an sdata payload; `embed` raises + [`UnsupportedImageFormatError`][sdata.imagemeta.UnsupportedImageFormatError] for an + unsupported format and + [`PayloadTooLargeError`][sdata.imagemeta.PayloadTooLargeError] when a JPEG payload + exceeds the single-`APP1` limit (~64 KiB). +* **Extensible registry:** further containers (e.g. TIFF) plug in as two small + functions plus one registry entry. + +## When to use a sidecar instead + +For containers without a native handler, or when metadata must stay external (e.g. +read-only originals), the JSON-LD **sidecar** remains the complement — see +[Machine-readable metadata](metadata-jsonld.md). Both approaches share the same +metadata model; embedding and a sidecar are not mutually exclusive. + +The design and the per-format details are specified in +[RFC 0005 — Native image metadata](../rfc/0005-native-image-metadata.md). diff --git a/mkdocs.yml b/mkdocs.yml index 0c82ba7..fd46930 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -63,6 +63,7 @@ nav: - Home: index.md - Usage: - Tabular data (DataFrame): usage/dataframe.md + - Image metadata: usage/image-metadata.md - Machine-readable metadata: usage/metadata-jsonld.md - API reference: api.md - RFCs: