Skip to content

feat: add article parsing and handling to timeline and scraper modules#195

Open
fibonacci998 wants to merge 2 commits into
the-convocation:mainfrom
fibonacci998:feat/article-parsing
Open

feat: add article parsing and handling to timeline and scraper modules#195
fibonacci998 wants to merge 2 commits into
the-convocation:mainfrom
fibonacci998:feat/article-parsing

Conversation

@fibonacci998
Copy link
Copy Markdown

  • feat: add article parsing and handling to timeline and scraper modules
  • test: add article parsing coverage to tweets.test.ts

fibonacci998 and others added 2 commits May 25, 2026 09:48
Ports the article-extraction work from PR the-convocation#146 (LiamVDB1) onto current
main. The previous PR drifted behind main since 2025-07-11 and never got
its requested tests; this commit applies the same diff cleanly and
follow-up commits add the tests + drop the unrelated `prepare` script
change that broke CI for downstream consumers.

Adds support for X "Articles" (long-form posts) inside the timeline
data structure:

* `ArticleRaw`, `ArticleResultRaw`, `ArticleContentStateRaw` interfaces
  in src/timeline-v1.ts representing the raw article payload, including
  metadata, media, and content state.
* `parseArticleToMarkdown` and `parseArticle` in src/timeline-v2.ts that
  walk `content_state.blocks` and produce markdown (handling text,
  links, bold/italic, headers, lists, and inline media).
* `parseResult` now detects `result.article.article_results.result` and,
  when present, sets `tweet.isArticle = true`, populates `tweet.article`,
  and overwrites `tweet.text` with the rendered markdown (since
  `legacy.full_text` for an Article tweet is just the t.co URL stub).
* `Tweet` interface gains optional `isArticle` and `article` fields.

Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>
Addresses karashiiro's review request on PR the-convocation#146. Two tests against the
public article tweet 2053808119709659225 (subnetamplify):

* isArticle flag is set, article.id matches the article rest_id (not the
  tweet id — they are distinct), and content_state is populated.
* tweet.text is replaced with the rendered markdown body, far larger
  than the t.co URL stub and starting with an H1 of the article title.

Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>
Copy link
Copy Markdown

@mohamedorigami-jpg mohamedorigami-jpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean types and solid test coverage. The reverse-offset entity range sorting is the right approach for in-place text replacement.

One question: when is true, the tweet text field becomes the full article markdown body. If a consumer just wants the article title or metadata without rendering the full body, they'd reach for directly. That seems fine -- the markdown body is the natural representation of the article text. But are there cases where the original t.co URL stub in would be more useful than the article body (e.g. for link previews or excerpt generation)?

Copy link
Copy Markdown

@mohamedorigami-jpg mohamedorigami-jpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean types and solid test coverage. The reverse-offset entity range sorting is the right approach for in-place text replacement.

One question: when isArticle is true, the tweet text field becomes the full article markdown body. If a consumer just wants the article title or metadata without rendering the full body, they'd reach for tweet.article.title directly. That seems fine - the markdown body is the natural representation of the article text. But are there cases where the original t.co URL stub in legacy.full_text would be more useful than the article body (e.g. for link previews or excerpt generation)?

@fibonacci998
Copy link
Copy Markdown
Author

Good question — and you're right that we overwrite text with the markdown body when isArticle is true, so the original legacy.full_text stub isn't preserved on the parsed Tweet.

The reason I went that way: for an Article tweet, legacy.full_text is just the t.co self-link stub (~23 chars in the test fixture) — it carries no actual content, and it resolves back to the tweet's own permalink. So preserving it as text was actively lossy: consumers got a bare URL and the entire article body disappeared. Rendering the body into text makes Article tweets behave like every other tweet (text = the readable content), which is what most consumers expect.

For your two cases specifically:

  • Link previews — the canonical link isn't lost. tweet.permanentUrl and tweet.urls[] are still populated as usual, so anything building a preview/unfurl should reach for those rather than the t.co stub (which only pointed back at the same tweet anyway).
  • Excerpts — the structured article data is richer than the stub for this: tweet.article.title for the headline, tweet.article.cover for the hero image, and tweet.article.content_state.blocks if you want to derive a first-paragraph excerpt without parsing the rendered markdown.

So I don't think the stub is more useful than the body in either case — the link is preserved elsewhere, and the article fields give a cleaner excerpt source. That said, if there's a concrete consumer that needs the raw pre-render full_text, I'm happy to stash it on a dedicated field (e.g. keep text as the t.co stub and expose markdown as tweet.article.markdown instead) rather than overwriting. Let me know if you'd prefer that shape.

@mohamedorigami-jpg
Copy link
Copy Markdown

That makes sense. Keeping text as the readable article body is better than preserving the self-link stub, since permanentUrl and urls cover the canonical link, and the article fields are better for excerpts. I do not see a reason to change the shape from this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants