feat: add article parsing and handling to timeline and scraper modules#195
feat: add article parsing and handling to timeline and scraper modules#195fibonacci998 wants to merge 2 commits into
Conversation
fibonacci998
commented
May 25, 2026
- feat: add article parsing and handling to timeline and scraper modules
- test: add article parsing coverage to tweets.test.ts
Ports the article-extraction work from PR the-convocation#146 (LiamVDB1) onto current main. The previous PR drifted behind main since 2025-07-11 and never got its requested tests; this commit applies the same diff cleanly and follow-up commits add the tests + drop the unrelated `prepare` script change that broke CI for downstream consumers. Adds support for X "Articles" (long-form posts) inside the timeline data structure: * `ArticleRaw`, `ArticleResultRaw`, `ArticleContentStateRaw` interfaces in src/timeline-v1.ts representing the raw article payload, including metadata, media, and content state. * `parseArticleToMarkdown` and `parseArticle` in src/timeline-v2.ts that walk `content_state.blocks` and produce markdown (handling text, links, bold/italic, headers, lists, and inline media). * `parseResult` now detects `result.article.article_results.result` and, when present, sets `tweet.isArticle = true`, populates `tweet.article`, and overwrites `tweet.text` with the rendered markdown (since `legacy.full_text` for an Article tweet is just the t.co URL stub). * `Tweet` interface gains optional `isArticle` and `article` fields. Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>
Addresses karashiiro's review request on PR the-convocation#146. Two tests against the public article tweet 2053808119709659225 (subnetamplify): * isArticle flag is set, article.id matches the article rest_id (not the tweet id — they are distinct), and content_state is populated. * tweet.text is replaced with the rendered markdown body, far larger than the t.co URL stub and starting with an H1 of the article title. Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>
mohamedorigami-jpg
left a comment
There was a problem hiding this comment.
Clean types and solid test coverage. The reverse-offset entity range sorting is the right approach for in-place text replacement.
One question: when is true, the tweet text field becomes the full article markdown body. If a consumer just wants the article title or metadata without rendering the full body, they'd reach for directly. That seems fine -- the markdown body is the natural representation of the article text. But are there cases where the original t.co URL stub in would be more useful than the article body (e.g. for link previews or excerpt generation)?
mohamedorigami-jpg
left a comment
There was a problem hiding this comment.
Clean types and solid test coverage. The reverse-offset entity range sorting is the right approach for in-place text replacement.
One question: when isArticle is true, the tweet text field becomes the full article markdown body. If a consumer just wants the article title or metadata without rendering the full body, they'd reach for tweet.article.title directly. That seems fine - the markdown body is the natural representation of the article text. But are there cases where the original t.co URL stub in legacy.full_text would be more useful than the article body (e.g. for link previews or excerpt generation)?
|
Good question — and you're right that we overwrite The reason I went that way: for an Article tweet, For your two cases specifically:
So I don't think the stub is more useful than the body in either case — the link is preserved elsewhere, and the article fields give a cleaner excerpt source. That said, if there's a concrete consumer that needs the raw pre-render |
|
That makes sense. Keeping text as the readable article body is better than preserving the self-link stub, since permanentUrl and urls cover the canonical link, and the article fields are better for excerpts. I do not see a reason to change the shape from this. |