Summary
The drug/drugsfda bulk JSON download (https://download.open.fda.gov/drug/drugsfda/drug-drugsfda-0001-of-0001.json.zip) references 6 PDFs in accessdata.fda.gov/drugsatfda_docs/ that are served with a corrupt cross-reference table. They were broken at upload time — the Internet Archive's Wayback Machine has byte-identical snapshots showing the same corruption — and no PDF tooling I'm aware of can recover them.
Affected files
Reproduction
# Pick any of the URLs above; same failure pattern on all 6.
curl -O https://www.accessdata.fda.gov/drugsatfda_docs/nda/2009/076184_original_approval_pkg.pdf
qpdf --check 076184_original_approval_pkg.pdf
Output (representative — same shape on every file):
WARNING: 076184_original_approval_pkg.pdf: file is damaged
WARNING: 076184_original_approval_pkg.pdf: can't find startxref
WARNING: 076184_original_approval_pkg.pdf: Attempting to reconstruct cross-reference table
ERROR: 076184_original_approval_pkg.pdf: unable to find /Root dictionary
Tools I tried (all failed to recover any pages)
| Tool |
Version |
Result |
qpdf --object-streams=generate |
11.9.0 |
Either exit-code 2/3 with no output, or a 400-600 byte stub with no content |
mutool clean -gggg |
1.23.10 |
exit-code 0 with a 234-487 byte stub; qpdf --check of the output also fails |
gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress |
10.02.1 |
Looks superficially successful (1-page PDF) but stderr says "Couldn't initialise file. Output may be incorrect. No pages will be processed (FirstPage > LastPage)" and pdftotext extracts zero characters |
pikepdf.open(...).save(...) |
10.5.1 |
exit-code 1, no output |
| Chromium PDFium (via Docling 2.92, used by browsers' built-in viewer) |
2026 |
PdfiumError: Failed to load document (PDFium: Data format error) |
Source-side persistence
I confirmed these are not transient or CDN-related:
- Re-fetched on 2026-05-06 with a fresh request (browser User-Agent, fresh TCP connection, no conditional headers, ~1 req/s polite rate). Bytes are byte-identical to what was previously stored: same SHA-256, same byte length, same
qpdf --check failure.
- Internet Archive Wayback Machine has snapshots of every one of these URLs. Each archived copy is byte-identical to today's served version — i.e., the corruption was already present when the file was first uploaded.
Suggested resolution
If the original CDER submission packages are still on file with the review divisions, re-uploading them from the source would replace the broken public copies. The corruption is not in the underlying document content — the page streams may still be physically present in the file — it's specifically the startxref pointer + trailer dictionary + /Root catalog that are missing or damaged, which prevents any PDF reader from locating the page tree.
Why this matters
These 6 files are referenced from the drug/drugsfda bulk JSON dataset's application_docs[].url field, so any consumer of the openFDA bulk data that follows those URLs (mirroring tools, RAG pipelines, FOIA archivists) hits the same wall. The Visudyne label (#4 above, NDA021119) in particular is the only public version of that label revision; for the others, alternate review documents from the same applications appear to be intact and parseable.
Happy to provide the full per-file diagnostic logs (qpdf, mutool, ghostscript, pikepdf invocations + their stderr) if useful.
Summary
The
drug/drugsfdabulk JSON download (https://download.open.fda.gov/drug/drugsfda/drug-drugsfda-0001-of-0001.json.zip) references 6 PDFs inaccessdata.fda.gov/drugsatfda_docs/that are served with a corrupt cross-reference table. They were broken at upload time — the Internet Archive's Wayback Machine has byte-identical snapshots showing the same corruption — and no PDF tooling I'm aware of can recover them.Affected files
Reproduction
# Pick any of the URLs above; same failure pattern on all 6. curl -O https://www.accessdata.fda.gov/drugsatfda_docs/nda/2009/076184_original_approval_pkg.pdf qpdf --check 076184_original_approval_pkg.pdfOutput (representative — same shape on every file):
Tools I tried (all failed to recover any pages)
qpdf --object-streams=generatemutool clean -ggggqpdf --checkof the output also failsgs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress"Couldn't initialise file. Output may be incorrect. No pages will be processed (FirstPage > LastPage)"andpdftotextextracts zero characterspikepdf.open(...).save(...)PdfiumError: Failed to load document (PDFium: Data format error)Source-side persistence
I confirmed these are not transient or CDN-related:
qpdf --checkfailure.Suggested resolution
If the original CDER submission packages are still on file with the review divisions, re-uploading them from the source would replace the broken public copies. The corruption is not in the underlying document content — the page streams may still be physically present in the file — it's specifically the
startxrefpointer + trailer dictionary +/Rootcatalog that are missing or damaged, which prevents any PDF reader from locating the page tree.Why this matters
These 6 files are referenced from the
drug/drugsfdabulk JSON dataset'sapplication_docs[].urlfield, so any consumer of the openFDA bulk data that follows those URLs (mirroring tools, RAG pipelines, FOIA archivists) hits the same wall. The Visudyne label (#4 above, NDA021119) in particular is the only public version of that label revision; for the others, alternate review documents from the same applications appear to be intact and parseable.Happy to provide the full per-file diagnostic logs (qpdf, mutool, ghostscript, pikepdf invocations + their stderr) if useful.