GitHub - world-wide-dev/chatgpt-export: Hydration-aware semantic ChatGPT exporter focused on deterministic archival snapshots.

ChatGPT Export

“this shouldn’t need to exist… but reality currently requires it”

A local-first Chrome extension for extracting and exporting ChatGPT conversations into stable canonical snapshots.

Built from the simple desire to “just save my chats quickly,” then progressively forced into becoming a hydration-aware semantic extraction system after repeated encounters with a shape-shifting frontend runtime.

Rather than cloning the ChatGPT UI directly, the extension extracts and normalizes semantic conversation content into durable canonical representations that can later be exported deterministically as Markdown, HTML, JSON, and future formats.

Version

Current version: 1.0.2

Motivation

Most existing ChatGPT export approaches fall into one of two categories:

raw DOM dumping
screenshot/PDF capture

Both approaches tend to break over time, produce unstable output, or tightly couple exports to transient frontend implementation details.

This project instead focuses on:

semantic preservation
deterministic exports
incremental persistence
local-first storage
resilience against frontend mutation
reproducible archival output

The goal is not to reproduce the ChatGPT interface pixel-for-pixel.

The goal is to preserve conversations as durable structured documents.

Features

Incremental conversation extraction
Hydration-aware extraction pipeline
Stable canonical local snapshots
Deterministic HTML export
Deterministic Markdown export
JSON export
Full database dump export
Canonical image and gallery reconstruction
Incremental IndexedDB persistence
Safe re-extraction and update handling
Local-first architecture

Installation

Option 1 — Clone repository

git clone https://github.com/world-wide-dev/chatgpt-export.git

Option 2 — Download ZIP

Download and extract the project ZIP locally.

Load extension in Chrome

Open chrome://extensions
Enable Developer mode
Click Load unpacked
Select the chatgpt-export directory

Usage

Open a ChatGPT conversation
Click Extract Conversation
Wait for extraction to complete
Export using one of the available formats:

Markdown
HTML
JSON
Full database dump

Exports operate exclusively on the latest successfully extracted local snapshot of the conversation.

Re-running extraction incrementally updates the locally stored canonical snapshot.

How It Works

Extraction progressively traverses a hydrated DOM snapshot and converts the conversation into stable canonical objects stored locally in the extension database.

Exporters operate exclusively on the latest successfully extracted snapshot, making exports resilient against frontend runtime and hydration changes.

The architecture intentionally separates:

Extraction → Persistence → Export

This separation allows exports to remain deterministic and reproducible without depending directly on the live ChatGPT runtime during export generation.

High-Level Architecture

ChatGPT Runtime
        ↓
Hydration-Aware Extraction
        ↓
Semantic Normalization
        ↓
IndexedDB Persistence
        ↓
Deterministic Export Layer
        ↓
Markdown / HTML / JSON Output

Core Architecture

Layer	Responsibility
Extraction	Read and normalize semantic content from the runtime DOM
Persistence	Store canonical conversation/message representations
Transformation	Generate deterministic export formats
UI	Browse and manage archived conversations
Export	Produce Markdown, HTML, JSON, and DB snapshots

Design Philosophy

Local-First

No backend. No server dependency. No external storage.

All extraction, persistence, and export generation happen locally inside the browser.

Semantic Preservation over DOM Mirroring

The ChatGPT UI is treated as a transient rendering layer.

The extension extracts semantic content:

paragraphs
lists
code blocks
tables
images
blockquotes
headings

while intentionally removing:

layout wrappers
action bars
runtime-specific UI containers
styling artifacts
interaction controls

The resulting output is stable, portable, and independent from the original frontend structure.

Canonical Snapshots as Source of Truth

The system stores normalized canonical conversation snapshots locally in IndexedDB.

Exporters never scrape the live DOM directly.

Instead:

extract → normalize → persist → export

This architecture makes exports:

reproducible
deterministic
resilient against runtime mutation
independent from hydration timing
safe to regenerate later

Markdown, HTML, and future export formats are therefore treated as derived representations generated from canonical stored data.

Idempotent Extraction

Extraction is intentionally safe to re-run.

Messages are persisted individually using stable platform-native identifiers extracted directly from the ChatGPT runtime.

This enables:

incremental updates
partial recovery
safe refresh handling
deterministic ordering
duplicate prevention

The system favors correctness and resilience over aggressive optimization.

Hydration-Aware Extraction

Modern frontend runtimes increasingly virtualize large conversations.

This means visible UI content is not always guaranteed to exist persistently in the DOM.

To remain reliable under virtualization, the exporter performs localized hydration-aware extraction by:

progressively traversing conversation regions
waiting for runtime hydration
extracting semantic content immediately
persisting normalized output incrementally

The resulting system behaves more like a streaming archival pipeline than a static DOM scraper.

Extraction Strategy

Preserved Elements

p
ul, ol, li
pre, code
table
blockquote
strong, em
a
img
h1–h4
hr

Removed Elements

runtime wrappers
action bars
copy/share controls
layout containers
transient UI elements
styling-specific classes

Code Block Canonicalization

All code blocks are normalized into deterministic semantic structures.

Language extraction is preserved whenever available.

The export layer then derives:

```language
code
```

from canonical semantic representations.

Image Preservation

Images are extracted independently from runtime wrappers and reinjected into canonical semantic structures during normalization.

This avoids runtime-specific layout instability while preserving:

image ordering
gallery grouping
deterministic rendering
portable export structure

The extraction layer intentionally avoids relying on transient frontend layout wrappers whenever possible.

Persistence Model

The extension uses IndexedDB with dedicated stores for:

conversations
messages
images

Messages are stored individually rather than as conversation-sized blobs.

This enables:

incremental persistence
partial recovery
deterministic updates
scalable handling of long conversations

Conversation Model

{
  "id": "69f79886-ac48-8328-9d3a-98fa285bce9f",
  "title": "conversation-title",
  "model": "gpt-5-3",
  "first_seen_at": 1778791573306,
  "updated_at": 1778791575269,
  "last_message_id": "eac90e6a-6def-4251-9697-aef09507cd3a",
  "extracting": false
}

Message Model

{
  "id": "eac90e6a-6def-4251-9697-aef09507cd3a",
  "conversation_id": "69f79886-ac48-8328-9d3a-98fa285bce9f",
  "index": 10,
  "role": "assistant",
  "model": "gpt-5-3",
  "content_html": "<p>...</p>",
  "image_ids": []
}

Export Model

Exports are generated on demand from canonical stored representations.

The export process:

User triggers extraction
        ↓
Conversation hydrates progressively
        ↓
Semantic content is normalized
        ↓
IndexedDB updates incrementally
        ↓
Deterministic export generated
        ↓
Markdown / HTML / JSON snapshot downloaded

The resulting exports include:

metadata manifests
semantic code blocks
preserved image galleries
canonical ordering
stable formatting

HTML Export Philosophy

The HTML export intentionally behaves like a portable archival document rather than a UI replay.

Features include:

deterministic structure
bounded image rendering
semantic typography
portable CSS
export metadata
stable printability

The export prioritizes readability and durability over visual fidelity to the live ChatGPT interface.

Design Tradeoffs

Chosen

semantic HTML over DOM snapshots
local-first persistence over backend infrastructure
deterministic exports over runtime replay
explicit extraction flow over hidden background automation
incremental persistence over full-conversation rewrites
canonical representation over duplicated transformation layers

Avoided

PDF-first architecture
screenshot-based capture
visual DOM cloning
backend synchronization complexity
state-heavy frontend orchestration
fragile UI-coupled rendering assumptions

Current Features

semantic conversation extraction
hydration-aware traversal
incremental IndexedDB persistence
deterministic Markdown export
deterministic HTML export
JSON export
full database dump export
code block normalization
language-aware fenced code blocks
image preservation and gallery reconstruction
conversation metadata manifests
idempotent extraction pipeline

Future Considerations

Potential future additions:

syntax highlighting during HTML rendering
full-text search
bulk export
optional external sync
additional export renderers
extraction interruption / resume support
per-conversation extraction status indicators

The project intentionally avoids uncontrolled feature growth in favor of maintaining a stable archival core.

Notes

Built while actively co-engineering extraction logic with ChatGPT itself - which, in retrospect, feels appropriately recursive.

Special thanks to the OpenAI frontend team for repeatedly evolving the ChatGPT runtime during development, transforming a straightforward exporter into a hydration-aware semantic archival system.

Disclaimer

This project is not affiliated with OpenAI.

Frontend runtime structures may change over time, requiring extractor updates.

Summary

This project focuses on long-term conversation durability rather than short-term UI mirroring.

The resulting system behaves less like a browser scraper and more like a semantic archival pipeline:

hydrate
→ normalize
→ persist
→ regenerate deterministically

The architecture intentionally prioritizes:

resilience
reproducibility
semantic clarity
portability
maintainability

over visual cloning or frontend-specific assumptions.

Ultimately, the project exists because reliable conversation preservation should not depend on transient frontend state.

License

MIT License – see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
vendor		vendor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_LICENSES.txt		THIRD_PARTY_LICENSES.txt
background.js		background.js
content.js		content.js
database.js		database.js
direct-console.js		direct-console.js
export.js		export.js
extract.js		extract.js
helper.js		helper.js
manifest.json		manifest.json
popup.html		popup.html
popup.js		popup.js

Folders and files

Latest commit

History

Repository files navigation

ChatGPT Export

Version

Motivation

Features

Installation

Option 1 — Clone repository

Option 2 — Download ZIP

Load extension in Chrome

Usage

How It Works

High-Level Architecture

Core Architecture

Design Philosophy

Local-First

Semantic Preservation over DOM Mirroring

Canonical Snapshots as Source of Truth

Idempotent Extraction

Hydration-Aware Extraction

Extraction Strategy

Preserved Elements

Removed Elements

Code Block Canonicalization

Image Preservation

Persistence Model

Conversation Model

Message Model

Export Model

HTML Export Philosophy

Design Tradeoffs

Chosen

Avoided

Current Features

Future Considerations

Notes

Disclaimer

Summary

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages