Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ omitted.

::: sdata.sclass.dataframe

## sdata.sclass.blob

::: sdata.sclass.blob

## sdata.imagemeta

::: sdata.imagemeta
Expand Down
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ print(data.metadata.df[["value", "unit", "dtype", "ontology"]])

- **[Tabular data (DataFrame)](usage/dataframe.md)** — the self-describing table
container: column metadata, Parquet/CSV/Arrow/Feather I/O, table-schema validation.
- **[Image metadata](usage/image-metadata.md)** — embed sdata metadata natively into
PNG/JPEG/JP2/GIF/WebP with one API (pure Python, no Pillow needed to read/write it).
- **[Machine-readable metadata](usage/metadata-jsonld.md)** — JSON-LD / RDF, units
(QUDT/UCUM), ontology classes (BFO), provenance, schema validation and signing.
- **[API reference](api.md)** — the full Python API.
Expand Down
118 changes: 118 additions & 0 deletions docs/usage/image-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Images: embedding sdata metadata

[`sdata.sclass.image.Image`][sdata.sclass.image.Image] is a
[`Blob`][sdata.sclass.blob.Blob] over image content. sdata can write its metadata
**natively into the image file** — and read it back — across **five containers with
one API**: PNG, JPEG, JPEG 2000 (`jp2`), GIF and WebP.

The embedding layer [`sdata.imagemeta`][sdata.imagemeta] is **pure Python**
(standard library only): it needs no third-party tool (no `exiftool`) and — crucially
— **no Pillow** to read or write the metadata. Pillow is only used to *decode* pixels
(`img.pil` / `img.to_numpy`) or to *transcode* between formats on `save`.

| Format | Native carrier of the sdata payload | Marker |
| ------ | ------------------------------------------ | --------------- |
| PNG | `iTXt` chunk before `IEND` | keyword `sdata` |
| JPEG | `APP1` segment right after SOI | `sdata\0` prefix|
| JP2 | `uuid` box (ISO BMFF) before `jp2c` | fixed sdata UUID|
| GIF | comment extension after the header | `sdata\0` prefix|
| WebP | dedicated RIFF chunk `sdAT` | FourCC `sdAT` |

```bash
pip install pillow # optional: only needed to decode/transcode pixels
```

## Round-trip through `Image`

The same three calls work for every supported format — the container is chosen from
the file suffix on `save`:

```python
import io
import PIL.Image
from sdata.sclass.image import Image

# some image bytes (here a freshly encoded JPEG)
buf = io.BytesIO()
PIL.Image.new("RGB", (640, 480), (30, 60, 90)).save(buf, "JPEG")

img = Image.from_bytes("specimen.jpg", buf.getvalue())
img.metadata.add("operator", "ada", description="who acquired the image")
img.metadata.add("exposure", 1.5, unit="s", dtype="float")

img.save("specimen.jpg") # sdata metadata embedded in the APP1 segment

reloaded = Image.from_file("specimen.jpg")
reloaded.metadata.get("operator").value # 'ada'
reloaded.metadata.get("exposure").value # 1.5
```

`save` is lossless when the stored bytes already use the target container: the
metadata is embedded **without re-encoding** the pixels (and without Pillow). Only a
*format change* (e.g. a PNG saved as `.webp`) transcodes via Pillow:

```python
png = Image.from_bytes("a.png", png_bytes)
png.metadata.add("note", "converted")
png.save("a.webp") # transcodes to WebP, then embeds the metadata
```

Reading the embedded metadata never needs Pillow:

```python
md = Image.from_file("specimen.jpg").embedded_metadata() # a Metadata, or None
```

## Inherited `Blob` capabilities

Because `Image` is a `Blob`, every image is also a content-addressable asset
(see [RFC 0003](../rfc/0003-blob-as-data-foundation.md)):

```python
img.size # content size in bytes
img.sha256 # SHA-256 of the content
img.update_checksum() # store the checksum in metadata
img.verify() # check the content against the stored checksum
img.write("s3://bucket/specimen.jpg") # fsspec target (needs sdata[blob])
```

!!! note "Checksum vs. embedded metadata"
Embedding metadata **changes the file bytes** (and therefore its hash). If you
need a stable content hash, compute it **before** embedding, or hash the decoded
pixels — analogous to the data-vs-metadata hash split for `DataFrame`
([RFC 0004](../rfc/0004-dataframe-and-blob.md)).

## Low-level API (`sdata.imagemeta`)

To embed an arbitrary text payload directly into image bytes — independent of
`Image` and of Pillow — use the façade:

```python
from sdata import imagemeta

imagemeta.detect_format(data) # 'png' | 'jpeg' | 'jp2' | 'gif' | 'webp' | None
out = imagemeta.embed(data, '{"k": 1}') # format auto-detected; replace semantics
imagemeta.extract(out) # '{"k": 1}' (None if absent/unknown format)
imagemeta.supported_formats() # ('png', 'jpeg', 'jp2', 'gif', 'webp')
```

* **Replace semantics:** embedding again **replaces** the previous sdata payload
rather than appending a second one (idempotent).
* **Lenient reads:** `extract` returns `None` for an unknown format or an image
without an sdata payload; `embed` raises
[`UnsupportedImageFormatError`][sdata.imagemeta.UnsupportedImageFormatError] for an
unsupported format and
[`PayloadTooLargeError`][sdata.imagemeta.PayloadTooLargeError] when a JPEG payload
exceeds the single-`APP1` limit (~64 KiB).
* **Extensible registry:** further containers (e.g. TIFF) plug in as two small
functions plus one registry entry.

## When to use a sidecar instead

For containers without a native handler, or when metadata must stay external (e.g.
read-only originals), the JSON-LD **sidecar** remains the complement — see
[Machine-readable metadata](metadata-jsonld.md). Both approaches share the same
metadata model; embedding and a sidecar are not mutually exclusive.

The design and the per-format details are specified in
[RFC 0005 — Native image metadata](../rfc/0005-native-image-metadata.md).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ nav:
- Home: index.md
- Usage:
- Tabular data (DataFrame): usage/dataframe.md
- Image metadata: usage/image-metadata.md
- Machine-readable metadata: usage/metadata-jsonld.md
- API reference: api.md
- RFCs:
Expand Down