feat(sources/itan): add ITAN Global Publishing import adapter#447
Open
mekarpeles wants to merge 2 commits into
Open
feat(sources/itan): add ITAN Global Publishing import adapter#447mekarpeles wants to merge 2 commits into
mekarpeles wants to merge 2 commits into
Conversation
First concrete DataProvider/DataProviderRecord implementation using the primitives from openlibrary-client#435. Sources the ITAN catalog JSONL from: https://github.com/ITANigp/itan-ebook-backend (feature/open-library branch) ITANRecord (DataProviderRecord subclass): - Fields map ITAN's pre-OL-formatted schema (they already use OL field names) - extra='allow' absorbs ebook_access and any future ITAN-specific keys - to_ol_import() applies three cleanup steps: - Strips leading/trailing whitespace from subjects (" Contemporary Fiction") - Filters malformed isbn_13 values using the schema pattern (ITAN uses "0" and "978" as placeholders — 35 and 11 occurrences respectively in 67 records) - Drops ebook_access (not in OL import schema; would be silently ignored by API) ITANProvider (JSONLProvider subclass): - One-liner: declares SOURCE_URL and RECORD_CLASS; traversal is inherited Tests (22): - Unit tests for all cleanup/skip logic with synthetic fixtures - Live end-to-end test fetching all 67 records from real ITAN URL - Every output record cross-validated against import.schema.json - pytest.skip if network unavailable (CI-safe) Closes internetarchive/openlibrary#12091 Depends on internetarchive/openlibrary-client#435
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Adds a new sources/itan/ adapter that uses olclient.imports DataProvider primitives to stream ITAN Global Publishing’s JSONL catalog and emit cleaned/validated OLImportRecord objects for Open Library imports.
Changes:
- Introduces
ITANRecordto validate ITAN source rows and normalize fields (subjects trimming, placeholder ISBN filtering, dropping non-schema fields). - Adds
ITANProvideras aJSONLProvidersubclass targeting the ITAN catalog JSONL URL. - Adds a comprehensive pytest suite covering record cleanup and a live end-to-end run with JSON Schema validation.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sources/itan/record.py | Defines the ITAN source record model and to_ol_import() cleanup/transformation into OLImportRecord. |
| sources/itan/provider.py | Defines the JSONL streaming provider configuration for the ITAN catalog. |
| sources/tests/test_itan.py | Adds unit tests for record cleanup plus a live end-to-end provider test with JSON Schema validation. |
| sources/init.py | Package marker file. |
| sources/itan/init.py | Package marker file. |
| sources/tests/init.py | Package marker file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+18
to
+20
| import jsonschema | ||
| import pytest | ||
|
|
Comment on lines
+191
to
+193
| """Fetch all ITAN records once for the module; skip if network unavailable.""" | ||
| try: | ||
| return list(ITANProvider().iter_ol_records()) |
| TITLE = "ITAN Global Publishing" | ||
| SOURCE_URL = ( | ||
| "https://raw.githubusercontent.com/ITANigp/itan-ebook-backend" | ||
| "/refs/heads/feature/open-library/data/itan_catalog.jsonl" |
Comment on lines
+181
to
+187
| assert result.subjects is None | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # ITANProvider — live end-to-end over real data | ||
| # --------------------------------------------------------------------------- | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
sources/itan/— the first concrete adapter using the new DataProvider/DataProviderRecord primitives inopenlibrary-client.Closes internetarchive/openlibrary#12091
What this does
ITANProviderstreams the ITAN catalog JSONL (67 records of African-authored works) and yields validatedOLImportRecordinstances. It's a one-liner — traversal is fully inherited fromJSONLProvider.ITANRecordhandles three cleanup steps found by inspecting the real data:ebook_accesskey not in OL schemaextra='allow'; stripped from outputisbn_13: ["0"]— ITAN placeholder (35/67 records)isbn_13: ["978"]— truncated placeholder (11/67 records)" Contemporary Fiction"— leading whitespace in subjectsto_ol_import()Architecture
Tests
Includes a live end-to-end fixture that fetches all 67 records from the real ITAN URL and cross-validates every output record against
import.schema.json. Markedpytest.skipif network is unavailable (CI-safe).Dependency
Requires
internetarchive/openlibrary-client#435to be merged first (olclient.importsprimitives).