feat: add article parsing and handling to timeline and scraper modules by fibonacci998 · Pull Request #195 · the-convocation/twitter-scraper

fibonacci998 · 2026-05-25T02:52:58Z

feat: add article parsing and handling to timeline and scraper modules
test: add article parsing coverage to tweets.test.ts

Ports the article-extraction work from PR the-convocation#146 (LiamVDB1) onto current main. The previous PR drifted behind main since 2025-07-11 and never got its requested tests; this commit applies the same diff cleanly and follow-up commits add the tests + drop the unrelated `prepare` script change that broke CI for downstream consumers. Adds support for X "Articles" (long-form posts) inside the timeline data structure: * `ArticleRaw`, `ArticleResultRaw`, `ArticleContentStateRaw` interfaces in src/timeline-v1.ts representing the raw article payload, including metadata, media, and content state. * `parseArticleToMarkdown` and `parseArticle` in src/timeline-v2.ts that walk `content_state.blocks` and produce markdown (handling text, links, bold/italic, headers, lists, and inline media). * `parseResult` now detects `result.article.article_results.result` and, when present, sets `tweet.isArticle = true`, populates `tweet.article`, and overwrites `tweet.text` with the rendered markdown (since `legacy.full_text` for an Article tweet is just the t.co URL stub). * `Tweet` interface gains optional `isArticle` and `article` fields. Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>

Addresses karashiiro's review request on PR the-convocation#146. Two tests against the public article tweet 2053808119709659225 (subnetamplify): * isArticle flag is set, article.id matches the article rest_id (not the tweet id — they are distinct), and content_state is populated. * tweet.text is replaced with the rendered markdown body, far larger than the t.co URL stub and starting with an H1 of the article title. Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>

mohamedorigami-jpg

Clean types and solid test coverage. The reverse-offset entity range sorting is the right approach for in-place text replacement.

One question: when is true, the tweet text field becomes the full article markdown body. If a consumer just wants the article title or metadata without rendering the full body, they'd reach for directly. That seems fine -- the markdown body is the natural representation of the article text. But are there cases where the original t.co URL stub in would be more useful than the article body (e.g. for link previews or excerpt generation)?

mohamedorigami-jpg

Clean types and solid test coverage. The reverse-offset entity range sorting is the right approach for in-place text replacement.

One question: when isArticle is true, the tweet text field becomes the full article markdown body. If a consumer just wants the article title or metadata without rendering the full body, they'd reach for tweet.article.title directly. That seems fine - the markdown body is the natural representation of the article text. But are there cases where the original t.co URL stub in legacy.full_text would be more useful than the article body (e.g. for link previews or excerpt generation)?

fibonacci998 · 2026-06-02T01:46:51Z

Good question — and you're right that we overwrite text with the markdown body when isArticle is true, so the original legacy.full_text stub isn't preserved on the parsed Tweet.

The reason I went that way: for an Article tweet, legacy.full_text is just the t.co self-link stub (~23 chars in the test fixture) — it carries no actual content, and it resolves back to the tweet's own permalink. So preserving it as text was actively lossy: consumers got a bare URL and the entire article body disappeared. Rendering the body into text makes Article tweets behave like every other tweet (text = the readable content), which is what most consumers expect.

For your two cases specifically:

Link previews — the canonical link isn't lost. tweet.permanentUrl and tweet.urls[] are still populated as usual, so anything building a preview/unfurl should reach for those rather than the t.co stub (which only pointed back at the same tweet anyway).
Excerpts — the structured article data is richer than the stub for this: tweet.article.title for the headline, tweet.article.cover for the hero image, and tweet.article.content_state.blocks if you want to derive a first-paragraph excerpt without parsing the rendered markdown.

So I don't think the stub is more useful than the body in either case — the link is preserved elsewhere, and the article fields give a cleaner excerpt source. That said, if there's a concrete consumer that needs the raw pre-render full_text, I'm happy to stash it on a dedicated field (e.g. keep text as the t.co stub and expose markdown as tweet.article.markdown instead) rather than overwriting. Let me know if you'd prefer that shape.

mohamedorigami-jpg · 2026-06-02T13:21:58Z

That makes sense. Keeping text as the readable article body is better than preserving the self-link stub, since permanentUrl and urls cover the canonical link, and the article fields are better for excerpts. I do not see a reason to change the shape from this.

fibonacci998 and others added 2 commits May 25, 2026 09:48

fibonacci998 mentioned this pull request May 25, 2026

feat: add article parsing and handling to timeline and scraper modules #146

Open

mohamedorigami-jpg approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add article parsing and handling to timeline and scraper modules#195

feat: add article parsing and handling to timeline and scraper modules#195
fibonacci998 wants to merge 2 commits into
the-convocation:mainfrom
fibonacci998:feat/article-parsing

fibonacci998 commented May 25, 2026

Uh oh!

mohamedorigami-jpg left a comment

Uh oh!

mohamedorigami-jpg left a comment

Uh oh!

fibonacci998 commented Jun 2, 2026

Uh oh!

mohamedorigami-jpg commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fibonacci998 commented May 25, 2026

Uh oh!

mohamedorigami-jpg left a comment

Choose a reason for hiding this comment

Uh oh!

mohamedorigami-jpg left a comment

Choose a reason for hiding this comment

Uh oh!

fibonacci998 commented Jun 2, 2026

Uh oh!

mohamedorigami-jpg commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants