Clarify Text CSA tokenization behavior and add type selection guidance#4595
Clarify Text CSA tokenization behavior and add type selection guidance#4595austonli wants to merge 8 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links |
| ##### Tokenization {#text-tokenization} | ||
|
|
||
| The standard tokenizer applies [Unicode word boundary rules](https://unicode.org/reports/tr29/) to split values into tokens. | ||
| All tokens are lowercased. |
There was a problem hiding this comment.
This section is going to save so many people from headaches. One tiny thing on this line: the lowercase behavior seems to come from the standard analyzer's lowercase token filter rather than the tokenizer itself (the tokenizer alone would leave QUICK as QUICK). Since Temporal uses the standard analyzer for Text fields, everything written is true in practice but it might just be worth swapping "standard tokenizer" for "standard analyzer" so it holds up if someone goes digging in the ES docs. Super minor, great work on this!
| All tokens are lowercased. | ||
| The splitting behavior depends on the delimiter: | ||
|
|
||
| | Delimiter | Splits? | Example | |
There was a problem hiding this comment.
Love the tokenization table! Heads up on an edge case that might be worth double checking, a dot between a digit and a letter may split, so something like v1.ProcessOrder might tokenize into v1 and ProcessOrder rather than staying as one token. Could be worth a small note since that pattern shows up a lot in workflow names such as "Note: A dot between a digit and a letter (for example, v1.ProcessOrder) may split into separate tokens." Either way this is a great addition!
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
UAX #29 classifies underscore as ExtendNumLet (connector punctuation), so the standard tokenizer keeps underscore-joined words as single tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dots between words do not split tokens (contrary to initial assumption). Replace token-output table with query-based examples showing what actually matches vs. doesn't. Reframe guidance around unpredictability of standard tokenizer for structured identifiers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verified against aul-tokenizer-demo namespace: - Query strings are tokenized too, so = "order-processing" DOES match order-processing-v2 (both query tokens present in index) - = operator uses OR matching: = "order-processing-v2" matches any workflow with order OR processing OR v2 as a token - Dots between digits stay together: v1.2.3 is one token - Added OR matching warning and expanded example table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split the explanation into two clear sections: 1. Tokenization: how values are split into tokens (with delimiter table) 2. Search matching: = operator uses OR across tokens The caution now explains both problems separately: - Inconsistent tokenization (hyphens split, underscores/dots don't) - OR matching causes false positives (processing-v2 matches anything with processing OR v2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
search-attributes.mdx: Replace comparison table with goal-oriented guidance (use Keyword for X, KeywordList for Y, Text only for Z). Add tip that Keyword is the right default for most use cases. list-filter.mdx: Reframe Text section as factual behavior explanation rather than warnings. Lead with redirect to Keyword for exact matching. Move Visibility 2.0 note to list-filter.mdx alongside the Text details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e673be6 to
3306ee0
Compare
- Use "standard analyzer" instead of "standard tokenizer" since the lowercase behavior comes from the analyzer's lowercase filter - Link to standard analyzer docs instead of tokenizer docs - Add note about dot-digit-letter edge case (v1.ProcessOrder may split) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tested all 33 char_group delimiters against a live Cloud namespace. Reframed table to highlight what does NOT split (the exceptions), with a single splitting example for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Context
Customers frequently create Text CSAs for structured identifiers (workflow type names, UUIDs, hyphenated IDs) expecting exact-match or substring behavior, then hit confusing search failures because the standard tokenizer silently splits at punctuation. The existing docs mention the distinction briefly but don't explain the implications clearly enough to prevent this mistake.
Test plan
npm run buildsucceeds with no errors/search-attribute#choose-a-string-type/list-filter#text🤖 Generated with Claude Code
┆Attachments: EDU-6394 Clarify Text CSA tokenization behavior and add type selection guidance