Skip to content

feat(sources/itan): add ITAN Global Publishing import adapter#447

Open
mekarpeles wants to merge 2 commits into
masterfrom
feat/sources-itan
Open

feat(sources/itan): add ITAN Global Publishing import adapter#447
mekarpeles wants to merge 2 commits into
masterfrom
feat/sources-itan

Conversation

@mekarpeles
Copy link
Copy Markdown
Member

Summary

Adds sources/itan/ — the first concrete adapter using the new DataProvider/DataProviderRecord primitives in openlibrary-client.

Closes internetarchive/openlibrary#12091

What this does

ITANProvider streams the ITAN catalog JSONL (67 records of African-authored works) and yields validated OLImportRecord instances. It's a one-liner — traversal is fully inherited from JSONLProvider.

ITANRecord handles three cleanup steps found by inspecting the real data:

Issue Fix
ebook_access key not in OL schema Absorbed silently via extra='allow'; stripped from output
isbn_13: ["0"] — ITAN placeholder (35/67 records) Filtered by ISBN-13 format regex
isbn_13: ["978"] — truncated placeholder (11/67 records) Filtered by same regex
" Contemporary Fiction" — leading whitespace in subjects Stripped in to_ol_import()

Architecture

sources/
  itan/
    record.py    ITANRecord(DataProviderRecord)  — typed source schema + cleanup
    provider.py  ITANProvider(JSONLProvider)     — one-liner, URL + RECORD_CLASS
  tests/
    test_itan.py  22 tests (unit + live end-to-end against real ITAN URL)

Tests

22 passed in 0.19s

Includes a live end-to-end fixture that fetches all 67 records from the real ITAN URL and cross-validates every output record against import.schema.json. Marked pytest.skip if network is unavailable (CI-safe).

Dependency

Requires internetarchive/openlibrary-client#435 to be merged first (olclient.imports primitives).

First concrete DataProvider/DataProviderRecord implementation using the
primitives from openlibrary-client#435.

Sources the ITAN catalog JSONL from:
  https://github.com/ITANigp/itan-ebook-backend (feature/open-library branch)

ITANRecord (DataProviderRecord subclass):
- Fields map ITAN's pre-OL-formatted schema (they already use OL field names)
- extra='allow' absorbs ebook_access and any future ITAN-specific keys
- to_ol_import() applies three cleanup steps:
  - Strips leading/trailing whitespace from subjects (" Contemporary Fiction")
  - Filters malformed isbn_13 values using the schema pattern (ITAN uses "0"
    and "978" as placeholders — 35 and 11 occurrences respectively in 67 records)
  - Drops ebook_access (not in OL import schema; would be silently ignored by API)

ITANProvider (JSONLProvider subclass):
- One-liner: declares SOURCE_URL and RECORD_CLASS; traversal is inherited

Tests (22):
- Unit tests for all cleanup/skip logic with synthetic fixtures
- Live end-to-end test fetching all 67 records from real ITAN URL
- Every output record cross-validated against import.schema.json
- pytest.skip if network unavailable (CI-safe)

Closes internetarchive/openlibrary#12091
Depends on internetarchive/openlibrary-client#435
Copilot AI review requested due to automatic review settings May 7, 2026 00:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new sources/itan/ adapter that uses olclient.imports DataProvider primitives to stream ITAN Global Publishing’s JSONL catalog and emit cleaned/validated OLImportRecord objects for Open Library imports.

Changes:

  • Introduces ITANRecord to validate ITAN source rows and normalize fields (subjects trimming, placeholder ISBN filtering, dropping non-schema fields).
  • Adds ITANProvider as a JSONLProvider subclass targeting the ITAN catalog JSONL URL.
  • Adds a comprehensive pytest suite covering record cleanup and a live end-to-end run with JSON Schema validation.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sources/itan/record.py Defines the ITAN source record model and to_ol_import() cleanup/transformation into OLImportRecord.
sources/itan/provider.py Defines the JSONL streaming provider configuration for the ITAN catalog.
sources/tests/test_itan.py Adds unit tests for record cleanup plus a live end-to-end provider test with JSON Schema validation.
sources/init.py Package marker file.
sources/itan/init.py Package marker file.
sources/tests/init.py Package marker file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +18 to +20
import jsonschema
import pytest

Comment on lines +191 to +193
"""Fetch all ITAN records once for the module; skip if network unavailable."""
try:
return list(ITANProvider().iter_ol_records())
Comment thread sources/itan/provider.py
TITLE = "ITAN Global Publishing"
SOURCE_URL = (
"https://raw.githubusercontent.com/ITANigp/itan-ebook-backend"
"/refs/heads/feature/open-library/data/itan_catalog.jsonl"
Comment on lines +181 to +187
assert result.subjects is None


# ---------------------------------------------------------------------------
# ITANProvider — live end-to-end over real data
# ---------------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inquiry: Importing ITAN Global Publishing Catalog to Open Library

2 participants