6 corrupt PDFs in /drugsatfda_docs/ referenced by drugsfda bulk JSON (broken XRef tables)

## Summary

The `drug/drugsfda` bulk JSON download (`https://download.open.fda.gov/drug/drugsfda/drug-drugsfda-0001-of-0001.json.zip`) references 6 PDFs in `accessdata.fda.gov/drugsatfda_docs/` that are served with a corrupt cross-reference table. They were broken at upload time — the Internet Archive's Wayback Machine has byte-identical snapshots showing the same corruption — and no PDF tooling I'm aware of can recover them.

## Affected files

| # | Application | Sponsor | Drug (brand) | Doc type | Size | URL |
|---|---|---|---|---|---|---|
| 1 | NDA019044 | GE Healthcare | Indium In-111 Oxyquinoline (radiopharmaceutical) | Review | 21 MB | https://www.accessdata.fda.gov/drugsatfda_docs/nda/pre96/019044_s010_s011_s013_indium%20in-111%20oxyquinoline.pdf |
| 2 | ANDA079075 | Watson Labs | Fentanyl Citrate tablets | Review | 30 MB | https://www.accessdata.fda.gov/drugsatfda_docs/anda/2011/079075Orig1s000.pdf |
| 3 | NDA020944 | Haleon US Holdings | Children's Advil / Junior Strength Advil (ibuprofen, OTC) | Review | 2.1 MB | https://www.accessdata.fda.gov/drugsatfda_docs/nda/2004/020944Orig1s002_s003.pdf |
| 4 | NDA021119 | Bausch & Lomb Ireland | Visudyne (verteporfin, ophthalmic photodynamic therapy) | Label | 70 KB | https://www.accessdata.fda.gov/drugsatfda_docs/label/2001/21119s1lbl.pdf |
| 5 | ANDA075548 | Dr Reddy's Labs | Microgestin / Microgestin Fe (oral contraceptive) | Review | 4.0 MB | https://www.accessdata.fda.gov/drugsatfda_docs/anda/2001/075548Orig1s000.pdf |
| 6 | ANDA076184 | Teva Pharmaceuticals | Alendronate Sodium (generic Fosamax) | Review | 110 MB | https://www.accessdata.fda.gov/drugsatfda_docs/nda/2009/076184_original_approval_pkg.pdf |

## Reproduction

```bash
# Pick any of the URLs above; same failure pattern on all 6.
curl -O https://www.accessdata.fda.gov/drugsatfda_docs/nda/2009/076184_original_approval_pkg.pdf
qpdf --check 076184_original_approval_pkg.pdf
```

Output (representative — same shape on every file):

```
WARNING: 076184_original_approval_pkg.pdf: file is damaged
WARNING: 076184_original_approval_pkg.pdf: can't find startxref
WARNING: 076184_original_approval_pkg.pdf: Attempting to reconstruct cross-reference table
ERROR: 076184_original_approval_pkg.pdf: unable to find /Root dictionary
```

## Tools I tried (all failed to recover any pages)

| Tool | Version | Result |
|---|---|---|
| `qpdf --object-streams=generate` | 11.9.0 | Either exit-code 2/3 with no output, or a 400-600 byte stub with no content |
| `mutool clean -gggg` | 1.23.10 | exit-code 0 with a 234-487 byte stub; `qpdf --check` of the output also fails |
| `gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress` | 10.02.1 | Looks superficially successful (1-page PDF) but stderr says `"Couldn't initialise file. Output may be incorrect. No pages will be processed (FirstPage > LastPage)"` and `pdftotext` extracts zero characters |
| `pikepdf.open(...).save(...)` | 10.5.1 | exit-code 1, no output |
| Chromium PDFium (via Docling 2.92, used by browsers' built-in viewer) | 2026 | `PdfiumError: Failed to load document (PDFium: Data format error)` |

## Source-side persistence

I confirmed these are not transient or CDN-related:

- **Re-fetched on 2026-05-06** with a fresh request (browser User-Agent, fresh TCP connection, no conditional headers, ~1 req/s polite rate). Bytes are byte-identical to what was previously stored: same SHA-256, same byte length, same `qpdf --check` failure.
- **Internet Archive Wayback Machine** has snapshots of every one of these URLs. Each archived copy is byte-identical to today's served version — i.e., the corruption was already present when the file was first uploaded.

## Suggested resolution

If the original CDER submission packages are still on file with the review divisions, re-uploading them from the source would replace the broken public copies. The corruption is not in the underlying document content — the page streams may still be physically present in the file — it's specifically the `startxref` pointer + trailer dictionary + `/Root` catalog that are missing or damaged, which prevents any PDF reader from locating the page tree.

## Why this matters

These 6 files are referenced from the `drug/drugsfda` bulk JSON dataset's `application_docs[].url` field, so any consumer of the openFDA bulk data that follows those URLs (mirroring tools, RAG pipelines, FOIA archivists) hits the same wall. The Visudyne label (#4 above, NDA021119) in particular is the only public version of that label revision; for the others, alternate review documents from the same applications appear to be intact and parseable.

Happy to provide the full per-file diagnostic logs (qpdf, mutool, ghostscript, pikepdf invocations + their stderr) if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6 corrupt PDFs in /drugsatfda_docs/ referenced by drugsfda bulk JSON (broken XRef tables) #220

Summary

Affected files

Reproduction

Tools I tried (all failed to recover any pages)

Source-side persistence

Suggested resolution

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Application	Sponsor	Drug (brand)	Doc type	Size	URL
1	NDA019044	GE Healthcare	Indium In-111 Oxyquinoline (radiopharmaceutical)	Review	21 MB	https://www.accessdata.fda.gov/drugsatfda_docs/nda/pre96/019044_s010_s011_s013_indium%20in-111%20oxyquinoline.pdf
2	ANDA079075	Watson Labs	Fentanyl Citrate tablets	Review	30 MB	https://www.accessdata.fda.gov/drugsatfda_docs/anda/2011/079075Orig1s000.pdf
3	NDA020944	Haleon US Holdings	Children's Advil / Junior Strength Advil (ibuprofen, OTC)	Review	2.1 MB	https://www.accessdata.fda.gov/drugsatfda_docs/nda/2004/020944Orig1s002_s003.pdf
4	NDA021119	Bausch & Lomb Ireland	Visudyne (verteporfin, ophthalmic photodynamic therapy)	Label	70 KB	https://www.accessdata.fda.gov/drugsatfda_docs/label/2001/21119s1lbl.pdf
5	ANDA075548	Dr Reddy's Labs	Microgestin / Microgestin Fe (oral contraceptive)	Review	4.0 MB	https://www.accessdata.fda.gov/drugsatfda_docs/anda/2001/075548Orig1s000.pdf
6	ANDA076184	Teva Pharmaceuticals	Alendronate Sodium (generic Fosamax)	Review	110 MB	https://www.accessdata.fda.gov/drugsatfda_docs/nda/2009/076184_original_approval_pkg.pdf

Tool	Version	Result
`qpdf --object-streams=generate`	11.9.0	Either exit-code 2/3 with no output, or a 400-600 byte stub with no content
`mutool clean -gggg`	1.23.10	exit-code 0 with a 234-487 byte stub; `qpdf --check` of the output also fails
`gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress`	10.02.1	Looks superficially successful (1-page PDF) but stderr says `"Couldn't initialise file. Output may be incorrect. No pages will be processed (FirstPage > LastPage)"` and `pdftotext` extracts zero characters
`pikepdf.open(...).save(...)`	10.5.1	exit-code 1, no output
Chromium PDFium (via Docling 2.92, used by browsers' built-in viewer)	2026	`PdfiumError: Failed to load document (PDFium: Data format error)`

Uh oh!

6 corrupt PDFs in /drugsatfda_docs/ referenced by drugsfda bulk JSON (broken XRef tables) #220

Description

Summary

Affected files

Reproduction

Tools I tried (all failed to recover any pages)

Source-side persistence

Suggested resolution

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions