Skip to content

6 corrupt PDFs in /drugsatfda_docs/ referenced by drugsfda bulk JSON (broken XRef tables) #220

Description

@billdenney

Summary

The drug/drugsfda bulk JSON download (https://download.open.fda.gov/drug/drugsfda/drug-drugsfda-0001-of-0001.json.zip) references 6 PDFs in accessdata.fda.gov/drugsatfda_docs/ that are served with a corrupt cross-reference table. They were broken at upload time — the Internet Archive's Wayback Machine has byte-identical snapshots showing the same corruption — and no PDF tooling I'm aware of can recover them.

Affected files

# Application Sponsor Drug (brand) Doc type Size URL
1 NDA019044 GE Healthcare Indium In-111 Oxyquinoline (radiopharmaceutical) Review 21 MB https://www.accessdata.fda.gov/drugsatfda_docs/nda/pre96/019044_s010_s011_s013_indium%20in-111%20oxyquinoline.pdf
2 ANDA079075 Watson Labs Fentanyl Citrate tablets Review 30 MB https://www.accessdata.fda.gov/drugsatfda_docs/anda/2011/079075Orig1s000.pdf
3 NDA020944 Haleon US Holdings Children's Advil / Junior Strength Advil (ibuprofen, OTC) Review 2.1 MB https://www.accessdata.fda.gov/drugsatfda_docs/nda/2004/020944Orig1s002_s003.pdf
4 NDA021119 Bausch & Lomb Ireland Visudyne (verteporfin, ophthalmic photodynamic therapy) Label 70 KB https://www.accessdata.fda.gov/drugsatfda_docs/label/2001/21119s1lbl.pdf
5 ANDA075548 Dr Reddy's Labs Microgestin / Microgestin Fe (oral contraceptive) Review 4.0 MB https://www.accessdata.fda.gov/drugsatfda_docs/anda/2001/075548Orig1s000.pdf
6 ANDA076184 Teva Pharmaceuticals Alendronate Sodium (generic Fosamax) Review 110 MB https://www.accessdata.fda.gov/drugsatfda_docs/nda/2009/076184_original_approval_pkg.pdf

Reproduction

# Pick any of the URLs above; same failure pattern on all 6.
curl -O https://www.accessdata.fda.gov/drugsatfda_docs/nda/2009/076184_original_approval_pkg.pdf
qpdf --check 076184_original_approval_pkg.pdf

Output (representative — same shape on every file):

WARNING: 076184_original_approval_pkg.pdf: file is damaged
WARNING: 076184_original_approval_pkg.pdf: can't find startxref
WARNING: 076184_original_approval_pkg.pdf: Attempting to reconstruct cross-reference table
ERROR: 076184_original_approval_pkg.pdf: unable to find /Root dictionary

Tools I tried (all failed to recover any pages)

Tool Version Result
qpdf --object-streams=generate 11.9.0 Either exit-code 2/3 with no output, or a 400-600 byte stub with no content
mutool clean -gggg 1.23.10 exit-code 0 with a 234-487 byte stub; qpdf --check of the output also fails
gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress 10.02.1 Looks superficially successful (1-page PDF) but stderr says "Couldn't initialise file. Output may be incorrect. No pages will be processed (FirstPage > LastPage)" and pdftotext extracts zero characters
pikepdf.open(...).save(...) 10.5.1 exit-code 1, no output
Chromium PDFium (via Docling 2.92, used by browsers' built-in viewer) 2026 PdfiumError: Failed to load document (PDFium: Data format error)

Source-side persistence

I confirmed these are not transient or CDN-related:

  • Re-fetched on 2026-05-06 with a fresh request (browser User-Agent, fresh TCP connection, no conditional headers, ~1 req/s polite rate). Bytes are byte-identical to what was previously stored: same SHA-256, same byte length, same qpdf --check failure.
  • Internet Archive Wayback Machine has snapshots of every one of these URLs. Each archived copy is byte-identical to today's served version — i.e., the corruption was already present when the file was first uploaded.

Suggested resolution

If the original CDER submission packages are still on file with the review divisions, re-uploading them from the source would replace the broken public copies. The corruption is not in the underlying document content — the page streams may still be physically present in the file — it's specifically the startxref pointer + trailer dictionary + /Root catalog that are missing or damaged, which prevents any PDF reader from locating the page tree.

Why this matters

These 6 files are referenced from the drug/drugsfda bulk JSON dataset's application_docs[].url field, so any consumer of the openFDA bulk data that follows those URLs (mirroring tools, RAG pipelines, FOIA archivists) hits the same wall. The Visudyne label (#4 above, NDA021119) in particular is the only public version of that label revision; for the others, alternate review documents from the same applications appear to be intact and parseable.

Happy to provide the full per-file diagnostic logs (qpdf, mutool, ghostscript, pikepdf invocations + their stderr) if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions