Clarify Text CSA tokenization behavior and add type selection guidance by austonli · Pull Request #4595 · temporalio/documentation

austonli · 2026-05-19T20:16:47Z

Summary

Adds a "Choose a string type" section to the Search Attributes page with a comparison table for Keyword vs KeywordList vs Text, including indexing behavior, supported operators, and best-use guidance
Expands the Text matching section in the List Filter page with concrete tokenization examples showing how common value shapes (dotted class names, hyphenated IDs, file paths) are split into tokens
Adds caution admonitions on both pages warning that Text type splits structured identifiers at punctuation, with cross-links between the two pages
Adds a forward-looking note that broader tokenizer improvements are being evaluated as part of future Visibility enhancements

Context

Customers frequently create Text CSAs for structured identifiers (workflow type names, UUIDs, hyphenated IDs) expecting exact-match or substring behavior, then hit confusing search failures because the standard tokenizer silently splits at punctuation. The existing docs mention the distinction briefly but don't explain the implications clearly enough to prevent this mistake.

Test plan

npm run build succeeds with no errors
Built HTML verified: tables, admonitions, and cross-links render correctly on both pages
Visual review of /search-attribute#choose-a-string-type
Visual review of /list-filter#text

🤖 Generated with Claude Code

┆Attachments: EDU-6394 Clarify Text CSA tokenization behavior and add type selection guidance

vercel · 2026-05-19T20:16:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
temporal-documentation	Ready	Preview, Comment	May 20, 2026 10:26pm

CLAassistant · 2026-05-19T20:16:56Z

All committers have signed the CLA.

github-actions · 2026-05-19T20:17:21Z

📖 Docs PR preview links

jsundai · 2026-05-20T20:53:44Z

+##### Tokenization {#text-tokenization}
+
+The standard tokenizer applies [Unicode word boundary rules](https://unicode.org/reports/tr29/) to split values into tokens.
+All tokens are lowercased.


This section is going to save so many people from headaches. One tiny thing on this line: the lowercase behavior seems to come from the standard analyzer's lowercase token filter rather than the tokenizer itself (the tokenizer alone would leave QUICK as QUICK). Since Temporal uses the standard analyzer for Text fields, everything written is true in practice but it might just be worth swapping "standard tokenizer" for "standard analyzer" so it holds up if someone goes digging in the ES docs. Super minor, great work on this!

jsundai · 2026-05-20T21:04:42Z

+All tokens are lowercased.
+The splitting behavior depends on the delimiter:
+
+| Delimiter | Splits? | Example |


Love the tokenization table! Heads up on an edge case that might be worth double checking, a dot between a digit and a letter may split, so something like v1.ProcessOrder might tokenize into v1 and ProcessOrder rather than staying as one token. Could be worth a small note since that pattern shows up a lot in workflow names such as "Note: A dot between a digit and a letter (for example, v1.ProcessOrder) may split into separate tokens." Either way this is a great addition!

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

UAX #29 classifies underscore as ExtendNumLet (connector punctuation), so the standard tokenizer keeps underscore-joined words as single tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dots between words do not split tokens (contrary to initial assumption). Replace token-output table with query-based examples showing what actually matches vs. doesn't. Reframe guidance around unpredictability of standard tokenizer for structured identifiers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verified against aul-tokenizer-demo namespace: - Query strings are tokenized too, so = "order-processing" DOES match order-processing-v2 (both query tokens present in index) - = operator uses OR matching: = "order-processing-v2" matches any workflow with order OR processing OR v2 as a token - Dots between digits stay together: v1.2.3 is one token - Added OR matching warning and expanded example table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split the explanation into two clear sections: 1. Tokenization: how values are split into tokens (with delimiter table) 2. Search matching: = operator uses OR across tokens The caution now explains both problems separately: - Inconsistent tokenization (hyphens split, underscores/dots don't) - OR matching causes false positives (processing-v2 matches anything with processing OR v2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

search-attributes.mdx: Replace comparison table with goal-oriented guidance (use Keyword for X, KeywordList for Y, Text only for Z). Add tip that Keyword is the right default for most use cases. list-filter.mdx: Reframe Text section as factual behavior explanation rather than warnings. Lead with redirect to Keyword for exact matching. Move Visibility 2.0 note to list-filter.mdx alongside the Text details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use "standard analyzer" instead of "standard tokenizer" since the lowercase behavior comes from the analyzer's lowercase filter - Link to standard analyzer docs instead of tokenizer docs - Add note about dot-digit-letter edge case (v1.ProcessOrder may split) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tested all 33 char_group delimiters against a live Cloud namespace. Reframed table to highlight what does NOT split (the exceptions), with a single splitting example for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

austonli requested a review from a team as a code owner May 19, 2026 20:16

vercel Bot deployed to Preview May 19, 2026 20:28 View deployment

vercel Bot deployed to Preview May 19, 2026 20:49 View deployment

vercel Bot deployed to Preview May 19, 2026 21:02 View deployment

vercel Bot deployed to Preview May 19, 2026 21:19 View deployment

vercel Bot deployed to Preview May 19, 2026 21:33 View deployment

vercel Bot deployed to Preview May 19, 2026 22:22 View deployment

jsundai approved these changes May 20, 2026

View reviewed changes

jsundai reviewed May 20, 2026

View reviewed changes

Comment thread docs/encyclopedia/visibility/list-filter.mdx Outdated

jsundai reviewed May 20, 2026

View reviewed changes

Comment thread docs/encyclopedia/visibility/list-filter.mdx Outdated

jsundai reviewed May 20, 2026

View reviewed changes

vercel Bot deployed to Preview May 20, 2026 21:54 View deployment

austonli and others added 6 commits May 20, 2026 14:58

Clarify Text CSA tokenization behavior and add type selection guidance

aa7a9d7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix tokenizer delimiter list: underscore does not split tokens

5252765

UAX #29 classifies underscore as ExtendNumLet (connector punctuation), so the standard tokenizer keeps underscore-joined words as single tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

austonli force-pushed the docs/clarify-text-csa-tokenization branch from e673be6 to 3306ee0 Compare May 20, 2026 22:00

vercel Bot deployed to Preview May 20, 2026 22:02 View deployment

vercel Bot deployed to Preview May 20, 2026 22:04 View deployment

vercel Bot deployed to Preview May 20, 2026 22:26 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify Text CSA tokenization behavior and add type selection guidance#4595

Clarify Text CSA tokenization behavior and add type selection guidance#4595
austonli wants to merge 8 commits into
mainfrom
docs/clarify-text-csa-tokenization

austonli commented May 19, 2026 •

edited by sync-by-unito Bot

Loading

Uh oh!

vercel Bot commented May 19, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

jsundai May 20, 2026

Uh oh!

Uh oh!

Uh oh!

jsundai May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

austonli commented May 19, 2026 • edited by sync-by-unito Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

vercel Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📖 Docs PR preview links

Uh oh!

jsundai May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jsundai May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

austonli commented May 19, 2026 •

edited by sync-by-unito Bot

Loading

vercel Bot commented May 19, 2026 •

edited

Loading

CLAassistant commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

jsundai May 20, 2026 •

edited

Loading