expression: add full SQL auto embedding support for starter mode#69014
expression: add full SQL auto embedding support for starter mode#69014ChangRui-Ryan wants to merge 4 commits into
Conversation
|
@ChangRui-Ryan I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @ChangRui-Ryan. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds hosted-embedding config and sysvars, an EmbedFn runtime with batching and provider clients, EMBED_TEXT builtin and AST helpers, DDL validation for auto-embedding, executor pre-evaluation for generated embedding columns, planner rewrites, and extensive provider/unit/integration tests. ChangesEmbedding feature merge
Sequence Diagram(s)sequenceDiagram
participant Client
participant Planner
participant Executor
participant Domain
participant EmbedFn
participant Batcher
participant Provider
Client->>Planner: SQL parse & rewrite (VEC_EMBED_* -> EMBED_TEXT)
Planner->>Executor: execute plan with generated columns
Executor->>Domain: request embed function (via domain.GetEmbedFn)
Domain->>EmbedFn: GetEmbedFn()
Executor->>EmbedFn: Embed(shouldCancel, model/provider, text, opts)
EmbedFn->>EmbedFn: check cache (get)
EmbedFn->>Batcher: CreateEmbeddings(model/provider, texts, opts)
Batcher->>Provider: CreateEmbeddings(batch request)
Provider-->>Batcher: embeddings or error
Batcher-->>EmbedFn: batched results
EmbedFn->>EmbedFn: cache set
EmbedFn-->>Executor: embedding vector
Executor-->>Client: DML/Query result
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 19
🧹 Nitpick comments (12)
pkg/planner/core/expression_rewriter.go (1)
3027-3031: ⚡ Quick winImprove error message robustness by handling empty column name.
Line 3029 uses
vecArg.OrigNamein the error message, which may be empty. Consider falling back tovecArg.String()or a placeholder to ensure the error message is always informative.♻️ Proposed improvement
+ colName := vecArg.OrigName + if colName == "" { + colName = vecArg.String() + } vecArg, ok := args[0].(*expression.Column) if !ok || vecArg.GeneratedExprString == "" || !vecArg.RetType.EvalType().IsVectorKind() { - er.err = fmt.Errorf("%s() first argument must be a vector embedding column generated by EMBED_TEXT()", v.FnName.O) + er.err = fmt.Errorf("%s() first argument must be a vector embedding column generated by EMBED_TEXT(), got %s", v.FnName.O, colName) return true }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/planner/core/expression_rewriter.go` around lines 3027 - 3031, The error message for the vector argument in expression_rewriter.go can be empty because vecArg.OrigName may be empty; update the check around the vecArg handling in the block that sets er.err (where vecArg is derived from args[0] and v.FnName.O is used) to build a robust name: if vecArg.OrigName is empty, use vecArg.String() or a constant placeholder like "<unknown column>" when formatting the error text so the message is always informative and never blank.pkg/inference/embedding/base/BUILD.bazel (1)
15-15: 💤 Low valueReconsider the flaky marking for deterministic tests.
The
base_testis marked asflaky = True, but reviewingbase_test.goshows it only performs deterministic base64 decoding and validation tests without concurrency or timing dependencies. Unless there's a known flakiness issue, consider removing this marking to ensure test failures are investigated rather than automatically retried.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/base/BUILD.bazel` at line 15, The test target "base_test" in the BUILD rule is marked flaky = True even though base_test only runs deterministic base64 decode/validation; remove the flaky = True attribute (or set flaky = False) from the base_test rule so failures surface for investigation, then run the test suite locally to confirm no flakiness before committing.pkg/inference/embedding/batcher/batcher.go (1)
244-260: ⚡ Quick winDocument the batch-level context cancellation semantics.
The batch request context (
reqCtx) only cancels when all individual request contexts are done. This means if requests with different timeouts are batched together, a request with a short timeout may wait longer than expected (bounded bybatchWindow). While this is correct batching behavior, consider adding a comment explaining this trade-off for future maintainers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/batcher/batcher.go` around lines 244 - 260, Add a clear comment above the reqCtx/cancelReq creation and goroutine explaining that reqCtx only cancels after all individual request contexts in thisBatch.calls are done (the goroutine loops over call.ctx and only calls cancelReq after every call is done), so batching can cause a request with a short timeout to wait longer (up to batchWindow) before its batch-level context cancels; state this is intentional for batching and note the trade-off for maintainers and any expectations around embedder.CreateEmbeddings using reqCtx.pkg/inference/embedding/huggingface/BUILD.bazel (2)
23-23: ⚡ Quick winInvestigate why tests are marked flaky.
The
flaky = Trueattribute suggests these tests have non-deterministic failures. Consider investigating and fixing the root cause rather than marking tests as flaky, as this can mask real issues and reduce test signal quality.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/huggingface/BUILD.bazel` at line 23, The BUILD.bazel target currently sets flaky = True, which masks nondeterministic test failures; remove or revert the flaky = True attribute on the target and instead reproduce the intermittent failures by running the corresponding Bazel test target(s) repeatedly (e.g., bazel test --runs_per_test=50), add deterministic fixes (stabilize timers, seed randomness, isolate shared state, add proper setup/teardown) in the test code, and only reintroduce flaky metadata after root causes are identified; look for references to the BUILD target that contains flaky = True to locate the tests to run and stabilize.
18-26: 💤 Low valueInconsistent shard_count across provider tests.
This test uses
shard_count = 10, while the Gemini tests useshard_count = 7. Consider establishing a consistent sharding strategy across similar test suites, or document why different providers need different shard counts.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/huggingface/BUILD.bazel` around lines 18 - 26, The shard_count for the Huggingface test target "huggingface_test" is set to 10 which is inconsistent with the Gemini provider tests (which use shard_count = 7); update the "huggingface_test" target in BUILD.bazel to use the same shard_count value as other provider tests (e.g., change shard_count to 7) or add a brief comment in the BUILD file next to the "huggingface_test" rule explaining why it must remain different so the discrepancy is documented and intentional.pkg/inference/embedding/gemini/BUILD.bazel (1)
18-26: ⚡ Quick winInvestigate why tests are marked flaky.
The
flaky = Trueattribute suggests these tests have non-deterministic failures. Consider investigating and fixing the root cause rather than marking tests as flaky, as this can mask real issues and reduce test signal quality.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/gemini/BUILD.bazel` around lines 18 - 26, The test target "gemini_test" in the BUILD.bazel currently sets flaky = True which masks nondeterministic failures; investigate and fix the root causes in gemini_test.go (race conditions, timing, shared state, or reliance on external services) by reproducing failures locally (run the "gemini_test" target without flaky and with shard_count reduced), add deterministic seeds or explicit synchronization in the test code, increase the timeout only if legitimately slow, and then remove the flaky = True attribute from the "gemini_test" rule once the tests are stable.pkg/inference/embedding/jina/jina.go (2)
55-55: ⚖️ Poor tradeoffConfigure HTTP client timeout to prevent indefinite hangs.
The
http.Clientis initialized with default settings (no timeout), which can cause requests to hang indefinitely if the JinaAI API is slow or unresponsive. While the context provides cancellation, the client itself should have a reasonable timeout.⏱️ Suggested fix to add timeout configuration
func NewJinaEmbedder(cfg EmbedderConfig) *Embedder { return &Embedder{ - client: http.Client{}, + client: http.Client{ + Timeout: 30 * time.Second, + }, cfg: cfg, }Consider making the timeout configurable via
EmbedderConfigif different deployments need different values.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/jina/jina.go` at line 55, The http.Client is created with no timeout (client: http.Client{}), risking indefinite hangs; update the embedder initialization (where the client is constructed in jina.go) to set a sensible Timeout on the http.Client (e.g., from a new EmbedderConfig field) and wire that timeout through the constructor/path that creates the client so the timeout becomes configurable via EmbedderConfig and used when initializing the client.
116-119: 💤 Low valueAPI key may be logged in error responses.
When logging failed requests at line 118, the
bodymay contain echoed authorization information. While theAuthorizationheader isn't logged directly, verify whether JinaAI's error responses could leak sensitive data.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/jina/jina.go` around lines 116 - 119, The error log in pkg/inference/embedding/jina/jina.go currently logs the full response body in the logutil.BgLogger().Error call (the line calling logutil.BgLogger().Error with zap.String("body", string(body))) which could leak echoed API keys; before calling that logger (or replace that call), sanitize the response body by parsing/masking any sensitive fields (e.g., "authorization", "api_key", "token", "access_token", "credentials") or redact Authorization-like patterns (e.g., Bearer <token>) and then log the sanitized string; update the code around the resp handling in the same function to produce a safeBody string and pass zap.String("body", safeBody) to the logger.pkg/inference/embedding/huggingface/huggingface.go (2)
121-124: 💤 Low valueAPI key may be logged in error responses.
When logging failed requests at line 123, the
bodymay contain echoed authorization information or API keys from the error response. While theAuthorizationheader itself isn't logged here, some APIs echo request details in error responses. Consider whether this is a concern for HuggingFace's API.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/huggingface/huggingface.go` around lines 121 - 124, The error log at the HuggingFace request failure uses the raw response body (zap.String("body", string(body))) which may contain echoed API keys or authorization data; update the failure handling in the function that makes the request (the code that calls resp and builds the log with logutil.BgLogger()) to avoid logging sensitive contents by either removing the full body from the log or sanitizing it first (e.g., detect and redact common keys like "authorization", "api_key", "token" or apply a safe snippet/length limit and replace sensitive substrings with "[REDACTED]"), and keep the existing metadata (resp.StatusCode) in the log. Ensure you change the zap.String("body", ...) usage to use the sanitized string instead of the raw body.
54-54: ⚖️ Poor tradeoffConfigure HTTP client timeout to prevent indefinite hangs.
The
http.Clientis initialized with default settings, which means no timeout. This can cause requests to hang indefinitely if the HuggingFace API is slow or unresponsive. While the context passed tohttp.NewRequestWithContextprovides cancellation, the client itself should have a reasonable timeout for connection and overall request duration.⏱️ Suggested fix to add timeout configuration
func NewHuggingFaceEmbedder(cfg EmbedderConfig) *Embedder { return &Embedder{ - client: http.Client{}, + client: http.Client{ + Timeout: 30 * time.Second, + }, cfg: cfg, }Alternatively, make the timeout configurable via
EmbedderConfigif different deployments need different values.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/huggingface/huggingface.go` at line 54, The http.Client is created with zero timeout (client: http.Client{}) which can hang; update the HuggingFace embedder initialization to set a sensible Timeout (e.g., 15–30s) on the http.Client instance used by the embedder, or wire a timeout value from EmbedderConfig into the client construction so the Timeout field is configurable; change the client creation referenced by the client: http.Client{} initializer in the huggingface embedder to use http.Client{Timeout: ...} (or build the client with a config-supplied duration).pkg/inference/embedding/jina/jina_test.go (1)
1-227: ⚡ Quick winConsider adding missing API key test coverage.
The HuggingFace tests include
TestHuggingFaceEmbedder_NoAPIKey,TestHuggingFaceEmbedder_CustomErrorHandling, andTestHuggingFaceEmbedder_CustomUnauthorizedErrorto verify API key validation and custom error handling. The Jina tests lack equivalent coverage for these scenarios. Consider adding similar tests for consistency and completeness.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/jina/jina_test.go` around lines 1 - 227, Add unit tests to cover missing API key and custom error handling for the Jina embedder: create tests similar to HuggingFace ones that exercise NewJinaEmbedder and CreateEmbeddings for (1) no API key supplied (e.g., TestJinaEmbedder_NoAPIKey) verifying it returns an error and nil embeddings, (2) custom unauthorized response handling (e.g., TestJinaEmbedder_CustomUnauthorizedError) mocking a 401 response and asserting the error contains expected message, and (3) custom error payload handling (e.g., TestJinaEmbedder_CustomErrorHandling) mocking non-2xx responses with body and asserting CreateEmbeddings surfaces the body details; place these tests alongside existing TestJinaEmbedder_* functions and reuse httptest.NewServer, EmbedderConfig GetAPIKey/GetBaseURL, and assertions on returned embeddings/errors to mirror the HuggingFace tests.pkg/inference/embedding/nvidia/protocol.go (1)
17-22: 💤 Low valueRemove unused
Requeststruct or use it in the implementation.The
Requeststruct is defined but never used. Innvidia.golines 89-93, the request body is built using a map literal passed tobase.MarshalJSONWithOptionsinstead of marshalling this struct. This is dead code that adds maintenance burden.🧹 Option 1: Remove the unused struct
-// Request is the model for Nvidia NIM embeddings API request. -type Request struct { - Input []string `json:"input"` - Model string `json:"model"` - EncodingFormat string `json:"encoding_format"` -} - // Response is the model for Nvidia NIM embeddings API response.Alternatively, if the struct is intended for future use or documentation purposes, add a comment explaining why it's not currently used in the implementation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/inference/embedding/nvidia/protocol.go` around lines 17 - 22, The Request struct (type Request) in protocol.go is dead code because the Nvidia embeddings request body is constructed directly via a map passed to base.MarshalJSONWithOptions in nvidia.go; either remove the unused Request type or explicitly use it when marshalling (replace the map literal with an instance of Request and marshal that), or if you keep the type for future/documentation purposes add a clear comment above type Request explaining why it is intentionally unused today and referencing the code path in nvidia.go that builds the payload via base.MarshalJSONWithOptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/executor/insert_common.go`:
- Around line 688-699: The code captures a single shared evalCtx (evalCtx :=
e.Ctx().GetExprCtx().GetEvalCtx()) and reuses it from every goroutine, which can
race statement-scoped state when calling task.expr.Eval; instead create a
distinct EvalCtx per worker before calling task.expr.Eval (e.g., call
e.Ctx().GetExprCtx().GetEvalCtx() or use a provided Clone/New method inside the
eg.Go closure) so each goroutine has its own evalCtx; update the goroutine body
around task.expr.Eval and chunk.MutRowFromDatums(task.row).ToRow() to use that
per-goroutine evalCtx and ensure any cleanup if the EvalCtx type requires it.
- Around line 647-650: The code snapshots a single uint64 rowCntInLoadData from
e.rowCount for the whole batch when
e.Ctx().GetSessionVars().StmtCtx.InLoadDataStmt is true, then passes that same
value into HandleBadNull for every row causing LOAD DATA errors to be reported
against the batch's last row; change the logic to use the per-row LOAD DATA
index instead of the single snapshot: inside the per-row loop that calls
HandleBadNull, compute the correct per-row LOAD DATA index (using the per-row
counter/variable used to iterate the batch or the LOAD DATA line index) and pass
that per-row value to HandleBadNull rather than rowCntInLoadData; update any use
sites referencing rowCntInLoadData (e.g., where HandleBadNull is invoked) to use
the per-row index so failures are reported against the failing row.
In `@pkg/expression/builtin_inference.go`:
- Around line 1-13: Update the license header year in
pkg/expression/builtin_inference.go to match the corresponding test file
(builtin_inference_test.go) so the copyright years are consistent across related
files; open builtin_inference.go, locate the top-of-file license comment block
and change the "Copyright 2025" year to the same year used in
builtin_inference_test.go (2026) or vice versa if you prefer to standardize on
2025—ensure both files have the identical year in their header comment.
In `@pkg/expression/inference_helper.go`:
- Around line 83-85: ExtractAutoEmbedInfoFromAST currently only rejects fewer
than 2 args but silently accepts extra args; change the argument validation to
explicitly allow only the supported arities (e.g., exactly 2 or exactly 3
arguments) and return an error for any other arity. Update the len(fnCall.Args)
check(s) that currently read "if len(fnCall.Args) < 2" to instead validate
membership in the allowed set (len == 2 || len == 3) and return
fmt.Errorf("invalid EMBED_TEXT() usage: unexpected arity") when not allowed;
apply this same change to the second arg-count validation block in the file that
also inspects fnCall.Args (the later branch inside ExtractAutoEmbedInfoFromAST).
In `@pkg/expression/integration_test/integration_test.go`:
- Around line 4507-4513: The test mutates the global sysvar
tidb_exp_embed_jina_ai_api_key and currently only resets it at the end of the
happy path, risking leakage if an assertion fails; register an unconditional
cleanup before the first mutation (e.g., call t.Cleanup(func(){ tk.MustExec("set
@@global.tidb_exp_embed_jina_ai_api_key = ''") })) so the global is always
reset, and then proceed with the existing tk.MustExec/tk.MustQuery assertions
that set and check tidb_exp_embed_jina_ai_api_key.
In `@pkg/inference/embedding/cohere/cohere.go`:
- Around line 52-56: NewCohereEmbedder constructs an http.Client without a
timeout causing potential indefinite hangs; update NewCohereEmbedder to
initialize client: http.Client{Timeout: time.Second * <reasonable_value>}
(choose e.g. 10s or configurable), and add the "time" import; ensure the
Embedder struct's client field still uses that configured http.Client so all
Cohere requests get the client-level timeout as defense-in-depth.
In `@pkg/inference/embedding/gemini/gemini.go`:
- Around line 98-100: The model string is interpolated directly into the path
when building fullURL, which can break URLs or allow unintended endpoints;
before the fmt.Sprintf call that sets fullURL, validate that model is non-empty
and does not contain path traversal or illegal chars (e.g., reject segments with
"/" or "..") or alternatively escape it with a path-encoding function (e.g.,
url.PathEscape) and use the escaped value (escapedModel) in place of model;
update the code around variables fullURL, model and baseURL to perform this
validation/escaping and return an error if validation fails.
- Around line 51-56: The NewGeminiEmbedder function constructs an Embedder with
an http.Client that lacks a timeout; update NewGeminiEmbedder to set a sensible
client timeout (e.g., via http.Client{Timeout: time.Second * <N>}) so requests
cannot hang indefinitely, and add the "time" import; modify the Embedder
construction (function NewGeminiEmbedder, type Embedder and its client field) to
use the configured timeout value.
In `@pkg/inference/embedding/huggingface/protocol.go`:
- Around line 22-24: The comment above the Response type is inconsistent with
the actual type; update the docstring for Response (type Response [][]float32 in
protocol.go) to state that HuggingFace returns a 2D array of float32 values, or
if you confirm the API returns float64 instead, change the type to [][]float64
and update all usages accordingly; ensure the comment and the type (Response)
match exactly.
In `@pkg/inference/embedding/jina/BUILD.bazel`:
- Around line 23-24: The BUILD target currently sets flaky = True and
shard_count = 6; remove these patch-work flags and fix the underlying test
instability instead: locate the failing tests under the
pkg/inference/embedding/jina package, reproduce the flakes, and address root
causes such as race conditions, timing-dependent assertions, shared state, or
unmocked external dependencies by adding proper synchronization, deterministic
fixtures, explicit mocks, or adjusting timeouts; once tests are deterministic,
delete flaky = True and shard_count = 6 from the BUILD.bazel target so CI runs a
single stable shard.
In `@pkg/inference/embedding/mock/BUILD.bazel`:
- Around line 16-17: The BUILD.bazel target in pkg/inference/embedding/mock
currently sets flaky = True and shard_count = 3—remove those attributes and make
the tests deterministic instead of masking failures; locate the mock test target
in BUILD.bazel and (1) remove flaky and shard_count, (2) audit tests in
pkg/inference/embedding/mock for shared global state, non-deterministic
timing/race conditions, or reliance on test ordering, and (3) fix by adding
proper setup/teardown (reset globals, use fresh temp dirs, inject deterministic
mocks/fakes, or serialize accesses with locks) so the tests pass reliably in
parallel without sharding. Ensure the target no longer needs flaky/shard_count
after these fixes and run the test suite to confirm stability.
In `@pkg/inference/embedding/mock/mock.go`:
- Around line 72-84: The delay handling sleeps unconditionally and ignores the
provided context, causing hangs if the context is cancelled; modify the block
that parses opts["delay"] so that instead of time.Sleep(dur) you wait with a
select on time.After(dur) and ctx.Done(), returning early with ctx.Err() when
the context is cancelled. Keep the existing parsing and error messages (the opts
map and delay parsing code) and only replace the unconditional sleep with a
context-aware select that returns (nil, ctx.Err()) on cancellation.
In `@pkg/inference/embedding/nvidia/nvidia.go`:
- Around line 147-161: The current loop assigns embeddings by the sorted
position (row) which can misplace vectors if indices are duplicated or
non-sequential; instead validate that respObj.Data contains exactly one entry
for each index 0..len(texts)-1 and then assign using the actual Index.
Specifically, after sort.Slice(respObj.Data, ...) iterate respObj.Data and for
each item check that 0 <= item.Index < len(texts) and that
embeddings[item.Index] is nil (to detect duplicates); on any invalid or
duplicate index return an error indicating the problematic item.Index; otherwise
decode with base.DecodeFloat32ArrayBytes and set embeddings[item.Index] = e (not
embeddings[row]).
- Around line 54-56: The HTTP client for Embedder is created with zero timeout
(http.Client{}), which can cause indefinite hangs; set a sensible Timeout on the
client (e.g., http.Client{Timeout: ...}) and make that timeout configurable via
EmbedderConfig (add a Timeout field and use it when constructing the Embedder's
client); update the constructor that returns &Embedder{client: http.Client{},
cfg: cfg, ...} to initialize client with the configured Timeout and provide a
reasonable default if cfg.Timeout is unset.
In `@pkg/inference/embedding/openai/openai_test.go`:
- Line 28: Replace the misleading comment "Mock successful response from real
Jina API" in openai_test.go with a correct description for the OpenAI embedder
(e.g., "Mock successful response from real OpenAI API" or "Mock successful
response for OpenAI embedder") so the comment accurately matches the test;
locate the comment by searching for that exact string in the test file and
update it in the test function that sets up the mocked OpenAI response.
In `@pkg/inference/embedding/openai/openai.go`:
- Around line 142-157: After sorting respObj.Data by Index, validate that each
item's Index is within [0, len(texts)-1] and that indices form a unique sequence
(no duplicates) before assigning into embeddings; for each item in respObj.Data
call base.DecodeFloat32ArrayBytes(item.Embedding) as before but place the
decoded vector into embeddings[item.Index] (not embeddings[row]), and return a
clear error if any item.Index is out of range or a duplicate/missing so we never
misassign embeddings; reference respObj.Data, item.Index, embeddings and
base.DecodeFloat32ArrayBytes to locate and implement the checks and assignment
change.
- Around line 54-58: The http.Client in NewOpenAIEmbedder is created with
default settings (no timeout); update NewOpenAIEmbedder to set a sensible
timeout on Embedder.client by using a configured value from EmbedderConfig
(e.g., add a Timeout field to EmbedderConfig) and fall back to a reasonable
default if not provided; ensure the Embedder struct uses http.Client{Timeout:
cfg.Timeout} (or default) so embedding requests cannot hang indefinitely.
In `@pkg/planner/core/expression_rewriter.go`:
- Around line 3038-3042: The error wrapping currently replaces the original
error with a new one via fmt.Errorf in the block that calls
expression.ExtractAutoEmbedInfoFromAST(astExpr), losing the original
stack/context; change the assignment to preserve the original error by using
errors.Annotatef (or the project's equivalent) to annotate err and assign that
to er.err (refer to expression.ExtractAutoEmbedInfoFromAST, er.err, and
fmt.Errorf) so the original error context is preserved while adding the "column
%s cannot be used in %s" message.
- Around line 3057-3060: The inline initializer fnInit currently does an
unchecked type assertion to *expression.BuiltinEmbedTextSig which can panic;
update fnInit (used by er.newFunctionWithInit) to perform a safe type assertion
(e.g., using the "v, ok := sf.Function.(*expression.BuiltinEmbedTextSig)"
pattern or a type switch), set v.IsFromVecSearch = true only when ok, and return
a descriptive error (instead of panicking) when the function is not the expected
BuiltinEmbedTextSig type so callers can handle the failure.
---
Nitpick comments:
In `@pkg/inference/embedding/base/BUILD.bazel`:
- Line 15: The test target "base_test" in the BUILD rule is marked flaky = True
even though base_test only runs deterministic base64 decode/validation; remove
the flaky = True attribute (or set flaky = False) from the base_test rule so
failures surface for investigation, then run the test suite locally to confirm
no flakiness before committing.
In `@pkg/inference/embedding/batcher/batcher.go`:
- Around line 244-260: Add a clear comment above the reqCtx/cancelReq creation
and goroutine explaining that reqCtx only cancels after all individual request
contexts in thisBatch.calls are done (the goroutine loops over call.ctx and only
calls cancelReq after every call is done), so batching can cause a request with
a short timeout to wait longer (up to batchWindow) before its batch-level
context cancels; state this is intentional for batching and note the trade-off
for maintainers and any expectations around embedder.CreateEmbeddings using
reqCtx.
In `@pkg/inference/embedding/gemini/BUILD.bazel`:
- Around line 18-26: The test target "gemini_test" in the BUILD.bazel currently
sets flaky = True which masks nondeterministic failures; investigate and fix the
root causes in gemini_test.go (race conditions, timing, shared state, or
reliance on external services) by reproducing failures locally (run the
"gemini_test" target without flaky and with shard_count reduced), add
deterministic seeds or explicit synchronization in the test code, increase the
timeout only if legitimately slow, and then remove the flaky = True attribute
from the "gemini_test" rule once the tests are stable.
In `@pkg/inference/embedding/huggingface/BUILD.bazel`:
- Line 23: The BUILD.bazel target currently sets flaky = True, which masks
nondeterministic test failures; remove or revert the flaky = True attribute on
the target and instead reproduce the intermittent failures by running the
corresponding Bazel test target(s) repeatedly (e.g., bazel test
--runs_per_test=50), add deterministic fixes (stabilize timers, seed randomness,
isolate shared state, add proper setup/teardown) in the test code, and only
reintroduce flaky metadata after root causes are identified; look for references
to the BUILD target that contains flaky = True to locate the tests to run and
stabilize.
- Around line 18-26: The shard_count for the Huggingface test target
"huggingface_test" is set to 10 which is inconsistent with the Gemini provider
tests (which use shard_count = 7); update the "huggingface_test" target in
BUILD.bazel to use the same shard_count value as other provider tests (e.g.,
change shard_count to 7) or add a brief comment in the BUILD file next to the
"huggingface_test" rule explaining why it must remain different so the
discrepancy is documented and intentional.
In `@pkg/inference/embedding/huggingface/huggingface.go`:
- Around line 121-124: The error log at the HuggingFace request failure uses the
raw response body (zap.String("body", string(body))) which may contain echoed
API keys or authorization data; update the failure handling in the function that
makes the request (the code that calls resp and builds the log with
logutil.BgLogger()) to avoid logging sensitive contents by either removing the
full body from the log or sanitizing it first (e.g., detect and redact common
keys like "authorization", "api_key", "token" or apply a safe snippet/length
limit and replace sensitive substrings with "[REDACTED]"), and keep the existing
metadata (resp.StatusCode) in the log. Ensure you change the zap.String("body",
...) usage to use the sanitized string instead of the raw body.
- Line 54: The http.Client is created with zero timeout (client: http.Client{})
which can hang; update the HuggingFace embedder initialization to set a sensible
Timeout (e.g., 15–30s) on the http.Client instance used by the embedder, or wire
a timeout value from EmbedderConfig into the client construction so the Timeout
field is configurable; change the client creation referenced by the client:
http.Client{} initializer in the huggingface embedder to use
http.Client{Timeout: ...} (or build the client with a config-supplied duration).
In `@pkg/inference/embedding/jina/jina_test.go`:
- Around line 1-227: Add unit tests to cover missing API key and custom error
handling for the Jina embedder: create tests similar to HuggingFace ones that
exercise NewJinaEmbedder and CreateEmbeddings for (1) no API key supplied (e.g.,
TestJinaEmbedder_NoAPIKey) verifying it returns an error and nil embeddings, (2)
custom unauthorized response handling (e.g.,
TestJinaEmbedder_CustomUnauthorizedError) mocking a 401 response and asserting
the error contains expected message, and (3) custom error payload handling
(e.g., TestJinaEmbedder_CustomErrorHandling) mocking non-2xx responses with body
and asserting CreateEmbeddings surfaces the body details; place these tests
alongside existing TestJinaEmbedder_* functions and reuse httptest.NewServer,
EmbedderConfig GetAPIKey/GetBaseURL, and assertions on returned
embeddings/errors to mirror the HuggingFace tests.
In `@pkg/inference/embedding/jina/jina.go`:
- Line 55: The http.Client is created with no timeout (client: http.Client{}),
risking indefinite hangs; update the embedder initialization (where the client
is constructed in jina.go) to set a sensible Timeout on the http.Client (e.g.,
from a new EmbedderConfig field) and wire that timeout through the
constructor/path that creates the client so the timeout becomes configurable via
EmbedderConfig and used when initializing the client.
- Around line 116-119: The error log in pkg/inference/embedding/jina/jina.go
currently logs the full response body in the logutil.BgLogger().Error call (the
line calling logutil.BgLogger().Error with zap.String("body", string(body)))
which could leak echoed API keys; before calling that logger (or replace that
call), sanitize the response body by parsing/masking any sensitive fields (e.g.,
"authorization", "api_key", "token", "access_token", "credentials") or redact
Authorization-like patterns (e.g., Bearer <token>) and then log the sanitized
string; update the code around the resp handling in the same function to produce
a safeBody string and pass zap.String("body", safeBody) to the logger.
In `@pkg/inference/embedding/nvidia/protocol.go`:
- Around line 17-22: The Request struct (type Request) in protocol.go is dead
code because the Nvidia embeddings request body is constructed directly via a
map passed to base.MarshalJSONWithOptions in nvidia.go; either remove the unused
Request type or explicitly use it when marshalling (replace the map literal with
an instance of Request and marshal that), or if you keep the type for
future/documentation purposes add a clear comment above type Request explaining
why it is intentionally unused today and referencing the code path in nvidia.go
that builds the payload via base.MarshalJSONWithOptions.
In `@pkg/planner/core/expression_rewriter.go`:
- Around line 3027-3031: The error message for the vector argument in
expression_rewriter.go can be empty because vecArg.OrigName may be empty; update
the check around the vecArg handling in the block that sets er.err (where vecArg
is derived from args[0] and v.FnName.O is used) to build a robust name: if
vecArg.OrigName is empty, use vecArg.String() or a constant placeholder like
"<unknown column>" when formatting the error text so the message is always
informative and never blank.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6b57fe1c-2b9b-4cb6-9768-bcc6e68c1971
📒 Files selected for processing (75)
pkg/config/config.gopkg/config/config.toml.examplepkg/config/config.toml.nextgen.examplepkg/ddl/add_column.gopkg/ddl/create_table.gopkg/ddl/generated_column.gopkg/domain/BUILD.bazelpkg/domain/domain.gopkg/domain/inference.gopkg/executor/insert_common.gopkg/expression/BUILD.bazelpkg/expression/builtin.gopkg/expression/builtin_inference.gopkg/expression/builtin_inference_test.gopkg/expression/builtin_threadunsafe_generated.gopkg/expression/column.gopkg/expression/expression.gopkg/expression/function_traits.gopkg/expression/function_traits_test.gopkg/expression/generator/builtin_threadsafe.gopkg/expression/inference_helper.gopkg/expression/inference_helper_test.gopkg/expression/integration_test/BUILD.bazelpkg/expression/integration_test/integration_test.gopkg/inference/BUILD.bazelpkg/inference/domainadaptor/BUILD.bazelpkg/inference/domainadaptor/adaptor.gopkg/inference/embedding/base/BUILD.bazelpkg/inference/embedding/base/base.gopkg/inference/embedding/base/base_test.gopkg/inference/embedding/batcher/BUILD.bazelpkg/inference/embedding/batcher/batcher.gopkg/inference/embedding/batcher/batcher_test.gopkg/inference/embedding/cohere/BUILD.bazelpkg/inference/embedding/cohere/cohere.gopkg/inference/embedding/cohere/cohere_test.gopkg/inference/embedding/cohere/protocol.gopkg/inference/embedding/gemini/BUILD.bazelpkg/inference/embedding/gemini/gemini.gopkg/inference/embedding/gemini/gemini_test.gopkg/inference/embedding/gemini/protocol.gopkg/inference/embedding/huggingface/BUILD.bazelpkg/inference/embedding/huggingface/huggingface.gopkg/inference/embedding/huggingface/huggingface_test.gopkg/inference/embedding/huggingface/protocol.gopkg/inference/embedding/jina/BUILD.bazelpkg/inference/embedding/jina/jina.gopkg/inference/embedding/jina/jina_test.gopkg/inference/embedding/jina/protocol.gopkg/inference/embedding/mock/BUILD.bazelpkg/inference/embedding/mock/mock.gopkg/inference/embedding/mock/mock_test.gopkg/inference/embedding/nvidia/BUILD.bazelpkg/inference/embedding/nvidia/nvidia.gopkg/inference/embedding/nvidia/nvidia_test.gopkg/inference/embedding/nvidia/protocol.gopkg/inference/embedding/openai/BUILD.bazelpkg/inference/embedding/openai/openai.gopkg/inference/embedding/openai/openai_test.gopkg/inference/embedding/openai/protocol.gopkg/inference/embedding/tidbcloud/BUILD.bazelpkg/inference/embedding/tidbcloud/protocol.gopkg/inference/embedding/tidbcloud/tidbcloud_free.gopkg/inference/embedding/tidbcloud/tidbcloud_free_test.gopkg/inference/manager_test.gopkg/inference/sqlembed.gopkg/parser/ast/functions.gopkg/planner/core/BUILD.bazelpkg/planner/core/expression_rewriter.gopkg/planner/core/logical_plan_builder.gopkg/sessionctx/vardef/tidb_vars.gopkg/sessionctx/variable/BUILD.bazelpkg/sessionctx/variable/embedding_vars.gopkg/sessionctx/variable/embedding_vars_test.gopkg/sessionctx/variable/sysvar.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #69014 +/- ##
================================================
+ Coverage 76.3213% 76.8714% +0.5500%
================================================
Files 2041 2051 +10
Lines 562689 567130 +4441
================================================
+ Hits 429452 435961 +6509
+ Misses 132324 129438 -2886
- Partials 913 1731 +818
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/expression/builtin_inference.go (1)
147-158: ⚡ Quick winConsider adding a comment to explain the
@searchsuffix behavior.The
@searchsuffix processing logic is subtle: whenisFromVecSearchis true, options like"key@search": valueare renamed to"key": value; when false, they're dropped entirely. This allows callers to specify search-time-only options that are ignored during direct EMBED_TEXT() calls but honored when the function is rewritten from VEC_EMBED_*_DISTANCE(). A brief comment would help future maintainers understand this intent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/expression/builtin_inference.go` around lines 147 - 158, Add a short clarifying comment above the block that collects and processes keys with the "`@search`" suffix explaining that entries like "key@search" are treated as search-time-only options: if isFromVecSearch is true they are promoted to "key" (opts[strings.TrimSuffix(k, "`@search`")] = opts[k]) to be honored when the call was rewritten from a VEC_EMBED_*_DISTANCE path, otherwise they are removed (delete(opts, k)) so direct EMBED_TEXT() calls ignore them; reference the variables searchOpts, opts and isFromVecSearch in the comment so future readers understand the intent of this rename-or-drop behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/expression/builtin_inference.go`:
- Around line 147-158: Add a short clarifying comment above the block that
collects and processes keys with the "`@search`" suffix explaining that entries
like "key@search" are treated as search-time-only options: if isFromVecSearch is
true they are promoted to "key" (opts[strings.TrimSuffix(k, "`@search`")] =
opts[k]) to be honored when the call was rewritten from a VEC_EMBED_*_DISTANCE
path, otherwise they are removed (delete(opts, k)) so direct EMBED_TEXT() calls
ignore them; reference the variables searchOpts, opts and isFromVecSearch in the
comment so future readers understand the intent of this rename-or-drop behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 32cbf36e-e183-4934-addb-0aaa6faec878
📒 Files selected for processing (28)
pkg/executor/BUILD.bazelpkg/executor/insert_common.gopkg/expression/builtin_inference.gopkg/expression/inference_helper.gopkg/expression/inference_helper_test.gopkg/expression/integration_test/integration_test.gopkg/inference/embedding/base/BUILD.bazelpkg/inference/embedding/base/base.gopkg/inference/embedding/base/base_test.gopkg/inference/embedding/cohere/cohere.gopkg/inference/embedding/gemini/BUILD.bazelpkg/inference/embedding/gemini/gemini.gopkg/inference/embedding/gemini/gemini_test.gopkg/inference/embedding/huggingface/huggingface.gopkg/inference/embedding/huggingface/protocol.gopkg/inference/embedding/jina/BUILD.bazelpkg/inference/embedding/jina/jina.gopkg/inference/embedding/jina/jina_test.gopkg/inference/embedding/mock/mock.gopkg/inference/embedding/mock/mock_test.gopkg/inference/embedding/nvidia/BUILD.bazelpkg/inference/embedding/nvidia/nvidia.gopkg/inference/embedding/nvidia/nvidia_test.gopkg/inference/embedding/openai/BUILD.bazelpkg/inference/embedding/openai/openai.gopkg/inference/embedding/openai/openai_test.gopkg/inference/embedding/tidbcloud/tidbcloud_free.gopkg/planner/core/expression_rewriter.go
🚧 Files skipped from review as they are similar to previous changes (20)
- pkg/inference/embedding/base/BUILD.bazel
- pkg/inference/embedding/openai/BUILD.bazel
- pkg/inference/embedding/huggingface/protocol.go
- pkg/inference/embedding/mock/mock.go
- pkg/inference/embedding/jina/BUILD.bazel
- pkg/inference/embedding/mock/mock_test.go
- pkg/inference/embedding/cohere/cohere.go
- pkg/inference/embedding/nvidia/nvidia.go
- pkg/inference/embedding/openai/openai.go
- pkg/inference/embedding/gemini/BUILD.bazel
- pkg/inference/embedding/jina/jina_test.go
- pkg/inference/embedding/nvidia/BUILD.bazel
- pkg/inference/embedding/openai/openai_test.go
- pkg/inference/embedding/huggingface/huggingface.go
- pkg/planner/core/expression_rewriter.go
- pkg/expression/inference_helper_test.go
- pkg/inference/embedding/tidbcloud/tidbcloud_free.go
- pkg/inference/embedding/gemini/gemini_test.go
- pkg/expression/integration_test/integration_test.go
- pkg/executor/insert_common.go
What problem does this PR solve?
Issue Number: ref #67765
Problem Summary:
TiDB Cloud Starter needs SQL Auto Embedding support in the upstream TiDB kernel so that Starter can use an upstream-built TiDB binary while still providing the documented auto-embedding experience.
Before this PR, upstream TiDB did not have a complete SQL-side embedding path. Users could not call
EMBED_TEXT()from SQL to obtain vector embeddings, could not define stored generated vector columns backed by embedding models, and could not use text-query vector distance helper functions such asVEC_EMBED_L2_DISTANCE()to automatically embed query text during vector search.The feature also needs to be deployment-aware. Auto Embedding is intended to be available only in Starter deployment mode. Non-Starter deployments should not accidentally invoke external embedding services or hosted TiDB Cloud embedding infrastructure through
EMBED_TEXT().What changed and how does it work?
This PR adds full SQL Auto Embedding support to the upstream TiDB kernel.
At the SQL layer, it introduces the
EMBED_TEXT(model, text[, options])builtin, which returns aVECTOR<FLOAT32>embedding. The function supports provider-qualified model names and JSON options, and it is treated as a non-foldable, mutable-effect function because it may call external inference services and depends on runtime configuration.At the inference layer, this PR adds a shared embedding framework under
pkg/inference, including provider implementations for OpenAI-compatible endpoints, Jina AI, Cohere, Gemini, Hugging Face, NVIDIA, and TiDB Cloud hosted embedding. It also adds batching, request cancellation, per-session/global configuration integration, API key masking, endpoint validation, and test mock providers.For auto-embedding generated columns, this PR allows
EMBED_TEXT()to be used as the top-level expression of a stored generated vector column. The DDL validation enforces the supported shape: the generated column must be stored, the model/options arguments must be constants, and dependent generated columns on auto-embedding columns are rejected. During DML, TiDB evaluates the embedding expression and materializes the generated vector value on insert/update.For vector search, this PR adds planner rewrite support for
VEC_EMBED_*_DISTANCE()helper functions. When the vector column is backed by an auto-embedding generated column, TiDB rewrites text-query distance functions into the corresponding vector distance function with anEMBED_TEXT()call on the query text. This lets users query by text without manually embedding the query first.The embedding function is managed through the TiDB domain so all sessions can share the same embedding manager and provider configuration. A small domain adaptor is added to avoid package dependency cycles between expression evaluation and domain initialization.
Finally, this PR gates runtime use of
EMBED_TEXT()by deployment mode.EMBED_TEXT()now checksdeploymode.IsStarter()before evaluating arguments or calling any provider. If the current TiDB instance is not running in Starter deployment mode, the function returns an error instead of invoking an embedding provider.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Configuration
Tests