feat(query): add multi_match query support#110
Conversation
Add MultiMatchQuery with BestFields, MostFields, Phrase, and PhrasePrefix types. Searches a query string across multiple fields with optional field weights. Tie-breaker formula implemented for best_fields type. - Add MultiMatchType enum and MultiMatchQuery struct to cloudsearch-common - Add SearchQuery::MultiMatch variant - Implement score_multi_match_query with best_fields/most_fields/phrase types - Add HTTP API parsing for multi_match clause with fields as object - Add test coverage for best_fields and most_fields scoring behavior Closes: #84
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 39 minutes and 20 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR extends the search engine to support Elasticsearch-style ChangesMulti-Match Query Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
rust/crates/cloudsearch-index/tests/coverage.rs (1)
669-709: ⚡ Quick winMake these tests assert the scoring contract more directly.
Both cases are permissive enough that a cross-field leakage bug still passes: the first test never proves
best_fieldsequals the best single-field score, and the second only provesmost_fields > best_fields. Please compare against equivalent single-field queries and add a negative case like{"title":"foo","content":"bar"}with query"foo bar"so each field is forced to contribute only its own tokens. Atype: phrasecase that rejects non-consecutive tokens would also be valuable here.Also applies to: 712-769
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/tests/coverage.rs` around lines 669 - 709, Update multi_match_best_fields_returns_max_score to assert scoring against equivalent single-field queries: after the MultiMatch search (MultiMatchQuery with MultiMatchType::BestFields and tie_breaker 0.0), run two single-field SearchRequest queries (one targeting only "title" and one only "content") and capture their hit scores, then assert that the multi_match score equals the max of those single-field scores (use the same doc id and ensure score.is_some()). Also add a negative/phrase case: index an additional document with both tokens in the same field and/or run a MultiMatch with type=Phrase to assert that non-consecutive token matches do not inflate BestFields (i.e., phrase query should reject non-consecutive tokens and the score should differ), and ensure to reference functions/types from the test (multi_match_best_fields_returns_max_score, SearchRequest, SearchQuery::MultiMatch, MultiMatchQuery, MultiMatchType::BestFields) when locating where to add these assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rust/crates/cloudsearch-api/src/lib.rs`:
- Around line 1008-1057: The parse_multi_match_query function is failing rustfmt
checks; run rustfmt (or `cargo fmt --all`) and reformat this file so CI
passes—ensure the block around parse_multi_match_query (including variable
declarations like parsed_fields, the for loop over fields, the multi_match_type
match, and the tie_breaker mapping) is reformatted to follow rustfmt rules and
then commit the formatted file.
- Around line 1029-1032: The code currently silently defaults non-numeric field
weights to 1.0 (using weight_val.as_f64().map_or(1.0,...)), which should instead
fail fast; locate the loop iterating over fields (symbols: fields, weight_val,
parsed_fields) and change the logic to validate weight_val is numeric and
convertible to f32, returning an error (or propagating one) when
weight_val.as_f64() is None or conversion is invalid, rather than inserting a
default 1.0 into parsed_fields.
- Around line 1035-1043: The parsing currently silently maps unknown
multi_match.type strings to MultiMatchType::BestFields; change the
object.get("type") parsing (the multi_match_type assignment) to return an error
instead of defaulting — i.e., make the match arm produce a
Result<MultiMatchType, Error> (or Option with validation) and if the string is
not one of "most_fields", "phrase", "phrase_prefix", or "best_fields" return a
descriptive parsing error (propagate via ? or convert to the function's Result),
update the surrounding function signature to return Result if necessary, and
ensure callers of the function handle the error; reference the multi_match_type
variable and the MultiMatchType enum when implementing this validation so
unknown types are rejected rather than defaulted.
In `@rust/crates/cloudsearch-common/src/lib.rs`:
- Line 308: The MultiMatchQuery struct currently uses the field multi_match_type
which serializes to "multi_match_type" but the API expects the key "type";
update the struct to rename that field for Serde JSON to "type" (e.g. add
#[serde(rename = "type")] to the multi_match_type field) so MultiMatchQuery
payloads match the single-search JSON shape; ensure the field name remains
multi_match_type and that the MultiMatchType type is unchanged so both serialize
and deserialize correctly for MultiSearchRequest and single-search payloads.
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 2072-2115: score_match_query_for_field currently counts term
frequency from postings across all fields (via positions_readers), inflating TF
for fields that don't actually contain the term; fix by scoping TF to the
current field: when computing tf for a token, first try to get the term
frequency from the positions reader that corresponds to the given field (or, if
positions_readers are not field-scoped, skip using them and derive tf solely
from field_tokens by counting occurrences in field_tokens), then fall back to
the existing field_tokens.count() only if the field-scoped postings are absent;
update the tf accumulation logic in score_match_query_for_field (refer to
variables positions_readers, field_tokens, tf and the call to
bm25_ctx.bm25_term_score) so TF reflects occurrences in this specific field
before computing term_score.
- Around line 2049-2066: The MultiMatchType::Phrase branch currently falls back
to best_fields when score_phrase_for_fields returns None; change it so that if
score_phrase_for_fields(document, &query_tokens, &query.fields, doc_id,
positions_readers, bm25_ctx) returns None the branch returns None immediately
(do not compute or return field_scores, max, or tie_breaker-based sums). In
short, in the MultiMatchType::Phrase arm, only return Some(ps) when
score_phrase_for_fields yields Some, otherwise return None to ensure phrase-only
matching.
- Around line 2134-2136: The loop over (field, weight) uses
document.source.get(field)?.as_str()? which returns None and aborts phrase
scoring when any field is missing or non-string; change this to skip such fields
by checking get(field) and as_str() with if let (or match) so only fields that
yield a &str are tokenized and scored (i.e., replace the `?` chaining with a
safe conditional that continues the loop on missing/non-string values), keeping
the rest of the logic that calls tokenize(field_str) and accumulates scores for
valid fields.
---
Nitpick comments:
In `@rust/crates/cloudsearch-index/tests/coverage.rs`:
- Around line 669-709: Update multi_match_best_fields_returns_max_score to
assert scoring against equivalent single-field queries: after the MultiMatch
search (MultiMatchQuery with MultiMatchType::BestFields and tie_breaker 0.0),
run two single-field SearchRequest queries (one targeting only "title" and one
only "content") and capture their hit scores, then assert that the multi_match
score equals the max of those single-field scores (use the same doc id and
ensure score.is_some()). Also add a negative/phrase case: index an additional
document with both tokens in the same field and/or run a MultiMatch with
type=Phrase to assert that non-consecutive token matches do not inflate
BestFields (i.e., phrase query should reject non-consecutive tokens and the
score should differ), and ensure to reference functions/types from the test
(multi_match_best_fields_returns_max_score, SearchRequest,
SearchQuery::MultiMatch, MultiMatchQuery, MultiMatchType::BestFields) when
locating where to add these assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0b922f19-9361-4eaa-8f4b-e163f189ddea
📒 Files selected for processing (4)
rust/crates/cloudsearch-api/src/lib.rsrust/crates/cloudsearch-common/src/lib.rsrust/crates/cloudsearch-index/src/lib.rsrust/crates/cloudsearch-index/tests/coverage.rs
| let multi_match_type = object | ||
| .get("type") | ||
| .and_then(Value::as_str) | ||
| .map_or(MultiMatchType::BestFields, |s| match s { | ||
| "most_fields" => MultiMatchType::MostFields, | ||
| "phrase" => MultiMatchType::Phrase, | ||
| "phrase_prefix" => MultiMatchType::PhrasePrefix, | ||
| _ => MultiMatchType::BestFields, | ||
| }); |
There was a problem hiding this comment.
Reject unknown multi_match.type instead of silently falling back
Line 1038 currently maps any unknown type to BestFields, which hides user mistakes and changes query semantics without an error.
Proposed fix
- let multi_match_type = object
- .get("type")
- .and_then(Value::as_str)
- .map_or(MultiMatchType::BestFields, |s| match s {
- "most_fields" => MultiMatchType::MostFields,
- "phrase" => MultiMatchType::Phrase,
- "phrase_prefix" => MultiMatchType::PhrasePrefix,
- _ => MultiMatchType::BestFields,
- });
+ let multi_match_type = match object.get("type").and_then(Value::as_str) {
+ None => MultiMatchType::BestFields,
+ Some("best_fields") => MultiMatchType::BestFields,
+ Some("most_fields") => MultiMatchType::MostFields,
+ Some("phrase") => MultiMatchType::Phrase,
+ Some("phrase_prefix") => MultiMatchType::PhrasePrefix,
+ Some(other) => {
+ return Err(ApiError(CloudSearchError::InvalidSearchRequest(format!(
+ "unsupported multi_match type '{other}'"
+ ))));
+ }
+ };📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let multi_match_type = object | |
| .get("type") | |
| .and_then(Value::as_str) | |
| .map_or(MultiMatchType::BestFields, |s| match s { | |
| "most_fields" => MultiMatchType::MostFields, | |
| "phrase" => MultiMatchType::Phrase, | |
| "phrase_prefix" => MultiMatchType::PhrasePrefix, | |
| _ => MultiMatchType::BestFields, | |
| }); | |
| let multi_match_type = match object.get("type").and_then(Value::as_str) { | |
| None => MultiMatchType::BestFields, | |
| Some("best_fields") => MultiMatchType::BestFields, | |
| Some("most_fields") => MultiMatchType::MostFields, | |
| Some("phrase") => MultiMatchType::Phrase, | |
| Some("phrase_prefix") => MultiMatchType::PhrasePrefix, | |
| Some(other) => { | |
| return Err(ApiError(CloudSearchError::InvalidSearchRequest(format!( | |
| "unsupported multi_match type '{other}'" | |
| )))); | |
| } | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rust/crates/cloudsearch-api/src/lib.rs` around lines 1035 - 1043, The parsing
currently silently maps unknown multi_match.type strings to
MultiMatchType::BestFields; change the object.get("type") parsing (the
multi_match_type assignment) to return an error instead of defaulting — i.e.,
make the match arm produce a Result<MultiMatchType, Error> (or Option with
validation) and if the string is not one of "most_fields", "phrase",
"phrase_prefix", or "best_fields" return a descriptive parsing error (propagate
via ? or convert to the function's Result), update the surrounding function
signature to return Result if necessary, and ensure callers of the function
handle the error; reference the multi_match_type variable and the MultiMatchType
enum when implementing this validation so unknown types are rejected rather than
defaulted.
| #[serde(default)] | ||
| pub fields: BTreeMap<String, f32>, | ||
| #[serde(default)] | ||
| pub multi_match_type: MultiMatchType, |
There was a problem hiding this comment.
MultiMatchQuery uses a different wire key than the HTTP parser expects
Line 308 serializes/deserializes as multi_match_type, but the API parser accepts "type" for multi_match. This makes typed MultiSearchRequest payloads inconsistent with single-search JSON shape.
Proposed fix
pub struct MultiMatchQuery {
pub query: String,
#[serde(default)]
pub fields: BTreeMap<String, f32>,
- #[serde(default)]
+ #[serde(default, rename = "type", alias = "multi_match_type")]
pub multi_match_type: MultiMatchType,
#[serde(default)]
pub tie_breaker: f32,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub multi_match_type: MultiMatchType, | |
| pub struct MultiMatchQuery { | |
| pub query: String, | |
| #[serde(default)] | |
| pub fields: BTreeMap<String, f32>, | |
| #[serde(default, rename = "type", alias = "multi_match_type")] | |
| pub multi_match_type: MultiMatchType, | |
| #[serde(default)] | |
| pub tie_breaker: f32, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rust/crates/cloudsearch-common/src/lib.rs` at line 308, The MultiMatchQuery
struct currently uses the field multi_match_type which serializes to
"multi_match_type" but the API expects the key "type"; update the struct to
rename that field for Serde JSON to "type" (e.g. add #[serde(rename = "type")]
to the multi_match_type field) so MultiMatchQuery payloads match the
single-search JSON shape; ensure the field name remains multi_match_type and
that the MultiMatchType type is unchanged so both serialize and deserialize
correctly for MultiSearchRequest and single-search payloads.
948d402 to
cb2552b
Compare
…king, add strict phrase semantics - Add score_phrase_prefix_for_fields for phrase_prefix type with prefix matching on last token while requiring consecutive tokens for the rest - Fix score_phrase_for_fields: positions are byte offsets, not token indices, so consecutive check must use prev_pos + len(prev_token) algorithm - Add multi_match_phrase_requires_consecutive_tokens test - Add multi_match_phrase_prefix_with_prefix_last_token test - Add multi_match_tie_breaker_formula test - Add empty fields validation in validate_query - Strict phrase semantics: no fallback to best_fields when phrase doesn't match - Add #[allow(clippy::too_many_lines)] to score_phrase_prefix_for_fields
cb2552b to
bbbdf1e
Compare
Extract tokens_are_consecutive and score_phrase_in_field helpers to reduce nesting and eliminate spaghetti control flow in phrase matching. Also fix clippy issues: explicit_iter_loop, question_mark, and add #[allow(clippy::cast_possible_truncation)] where needed.
Summary
MultiMatchTypeenum (BestFields | MostFields | Phrase | PhrasePrefix) andMultiMatchQuerystruct to cloudsearch-commonSearchQuery::MultiMatch(MultiMatchQuery)variantscore_multi_match_querywithbest_fields(max + tie_breaker × sum),most_fields(sum), andphrase(consecutive token match) scoringmulti_matchclause withfieldsas BTreeMap<String, f32>multi_match_best_fields_returns_max_scoreandmulti_match_most_fields_sums_scoresTest plan
cargo test --workspace— all pass (402 tests)cargo clippy --workspace -- -D warnings -D clippy::pedantic— cleanJSON API usage
{ "multi_match": { "query": "brown fox", "fields": {"title": 2.0, "content": 1.0}, "type": "best_fields", "tie_breaker": 0.3 } }Summary by CodeRabbit
Release Notes
New Features
Tests