Skip to content

Clarify Text CSA tokenization behavior and add type selection guidance#4595

Open
austonli wants to merge 8 commits into
mainfrom
docs/clarify-text-csa-tokenization
Open

Clarify Text CSA tokenization behavior and add type selection guidance#4595
austonli wants to merge 8 commits into
mainfrom
docs/clarify-text-csa-tokenization

Conversation

@austonli
Copy link
Copy Markdown
Contributor

@austonli austonli commented May 19, 2026

Summary

  • Adds a "Choose a string type" section to the Search Attributes page with a comparison table for Keyword vs KeywordList vs Text, including indexing behavior, supported operators, and best-use guidance
  • Expands the Text matching section in the List Filter page with concrete tokenization examples showing how common value shapes (dotted class names, hyphenated IDs, file paths) are split into tokens
  • Adds caution admonitions on both pages warning that Text type splits structured identifiers at punctuation, with cross-links between the two pages
  • Adds a forward-looking note that broader tokenizer improvements are being evaluated as part of future Visibility enhancements

Context

Customers frequently create Text CSAs for structured identifiers (workflow type names, UUIDs, hyphenated IDs) expecting exact-match or substring behavior, then hit confusing search failures because the standard tokenizer silently splits at punctuation. The existing docs mention the distinction briefly but don't explain the implications clearly enough to prevent this mistake.

Test plan

  • npm run build succeeds with no errors
  • Built HTML verified: tables, admonitions, and cross-links render correctly on both pages
  • Visual review of /search-attribute#choose-a-string-type
  • Visual review of /list-filter#text

🤖 Generated with Claude Code

┆Attachments: EDU-6394 Clarify Text CSA tokenization behavior and add type selection guidance

@austonli austonli requested a review from a team as a code owner May 19, 2026 20:16
@vercel
Copy link
Copy Markdown

vercel Bot commented May 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Ready Ready Preview, Comment May 20, 2026 10:26pm

Request Review

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 19, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

📖 Docs PR preview links

##### Tokenization {#text-tokenization}

The standard tokenizer applies [Unicode word boundary rules](https://unicode.org/reports/tr29/) to split values into tokens.
All tokens are lowercased.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is going to save so many people from headaches. One tiny thing on this line: the lowercase behavior seems to come from the standard analyzer's lowercase token filter rather than the tokenizer itself (the tokenizer alone would leave QUICK as QUICK). Since Temporal uses the standard analyzer for Text fields, everything written is true in practice but it might just be worth swapping "standard tokenizer" for "standard analyzer" so it holds up if someone goes digging in the ES docs. Super minor, great work on this!

Comment thread docs/encyclopedia/visibility/list-filter.mdx Outdated
Comment thread docs/encyclopedia/visibility/list-filter.mdx Outdated
All tokens are lowercased.
The splitting behavior depends on the delimiter:

| Delimiter | Splits? | Example |
Copy link
Copy Markdown
Contributor

@jsundai jsundai May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the tokenization table! Heads up on an edge case that might be worth double checking, a dot between a digit and a letter may split, so something like v1.ProcessOrder might tokenize into v1 and ProcessOrder rather than staying as one token. Could be worth a small note since that pattern shows up a lot in workflow names such as "Note: A dot between a digit and a letter (for example, v1.ProcessOrder) may split into separate tokens." Either way this is a great addition!

austonli and others added 6 commits May 20, 2026 14:58
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
UAX #29 classifies underscore as ExtendNumLet (connector punctuation),
so the standard tokenizer keeps underscore-joined words as single tokens.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dots between words do not split tokens (contrary to initial assumption).
Replace token-output table with query-based examples showing what
actually matches vs. doesn't. Reframe guidance around unpredictability
of standard tokenizer for structured identifiers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verified against aul-tokenizer-demo namespace:
- Query strings are tokenized too, so = "order-processing" DOES match
  order-processing-v2 (both query tokens present in index)
- = operator uses OR matching: = "order-processing-v2" matches any
  workflow with order OR processing OR v2 as a token
- Dots between digits stay together: v1.2.3 is one token
- Added OR matching warning and expanded example table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split the explanation into two clear sections:
1. Tokenization: how values are split into tokens (with delimiter table)
2. Search matching: = operator uses OR across tokens

The caution now explains both problems separately:
- Inconsistent tokenization (hyphens split, underscores/dots don't)
- OR matching causes false positives (processing-v2 matches anything
  with processing OR v2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
search-attributes.mdx: Replace comparison table with goal-oriented
guidance (use Keyword for X, KeywordList for Y, Text only for Z).
Add tip that Keyword is the right default for most use cases.

list-filter.mdx: Reframe Text section as factual behavior explanation
rather than warnings. Lead with redirect to Keyword for exact matching.
Move Visibility 2.0 note to list-filter.mdx alongside the Text details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use "standard analyzer" instead of "standard tokenizer" since the
  lowercase behavior comes from the analyzer's lowercase filter
- Link to standard analyzer docs instead of tokenizer docs
- Add note about dot-digit-letter edge case (v1.ProcessOrder may split)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tested all 33 char_group delimiters against a live Cloud namespace.
Reframed table to highlight what does NOT split (the exceptions),
with a single splitting example for context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants