Skip to content

feat(query): add multi_match query support#110

Merged
poyrazK merged 5 commits into
mainfrom
feature/multi-match-query
May 24, 2026
Merged

feat(query): add multi_match query support#110
poyrazK merged 5 commits into
mainfrom
feature/multi-match-query

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 22, 2026

Summary

  • Add MultiMatchType enum (BestFields | MostFields | Phrase | PhrasePrefix) and MultiMatchQuery struct to cloudsearch-common
  • Add SearchQuery::MultiMatch(MultiMatchQuery) variant
  • Implement score_multi_match_query with best_fields (max + tie_breaker × sum), most_fields (sum), and phrase (consecutive token match) scoring
  • Add HTTP API parsing for multi_match clause with fields as BTreeMap<String, f32>
  • Add 2 test cases: multi_match_best_fields_returns_max_score and multi_match_most_fields_sums_scores

Test plan

  • cargo test --workspace — all pass (402 tests)
  • cargo clippy --workspace -- -D warnings -D clippy::pedantic — clean
  • New coverage tests pass

JSON API usage

{
  "multi_match": {
    "query": "brown fox",
    "fields": {"title": 2.0, "content": 1.0},
    "type": "best_fields",
    "tie_breaker": 0.3
  }
}

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for multi-match queries, enabling simultaneous searches across multiple fields with configurable per-field weights.
    • Multi-match queries support different matching strategies (BestFields, MostFields, phrase, phrase_prefix) and optional tie-breaker configuration for results scoring.
  • Tests

    • Added comprehensive test coverage for multi-match query scoring behavior across different field combinations and match types.

Review Change Stack

Add MultiMatchQuery with BestFields, MostFields, Phrase, and PhrasePrefix
types. Searches a query string across multiple fields with optional field
weights. Tie-breaker formula implemented for best_fields type.

- Add MultiMatchType enum and MultiMatchQuery struct to cloudsearch-common
- Add SearchQuery::MultiMatch variant
- Implement score_multi_match_query with best_fields/most_fields/phrase types
- Add HTTP API parsing for multi_match clause with fields as object
- Add test coverage for best_fields and most_fields scoring behavior

Closes: #84
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Warning

Review limit reached

@poyrazK, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 39 minutes and 20 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e23d5fa8-26cc-4dac-baee-f056797bfe31

📥 Commits

Reviewing files that changed from the base of the PR and between 2dd2e48 and c7fda27.

📒 Files selected for processing (3)
  • rust/crates/cloudsearch-api/src/lib.rs
  • rust/crates/cloudsearch-index/src/lib.rs
  • rust/crates/cloudsearch-index/tests/coverage.rs
📝 Walkthrough

Walkthrough

This PR extends the search engine to support Elasticsearch-style multi_match queries. It adds type definitions for multi-match configuration, JSON parsing logic in the API layer, validation and term extraction in the index layer, and scoring algorithms that compute weighted BM25 scores across multiple fields with support for BestFields and MostFields modes.

Changes

Multi-Match Query Implementation

Layer / File(s) Summary
Schema types and trait updates
rust/crates/cloudsearch-common/src/lib.rs
Introduces MultiMatchType enum (BestFields/MostFields/Phrase/PhrasePrefix) and MultiMatchQuery struct with per-field weights and configuration. Updates derives to remove Eq from SearchRequest, SearchQuery, BoolQuery, and MultiSearchRequest.
API parsing for multi-match clauses
rust/crates/cloudsearch-api/src/lib.rs
Extends search query parser to recognize multi_match JSON clauses, validate required query and fields properties, convert per-field weights to BTreeMap, apply optional type and tie_breaker defaults, and construct SearchQuery::MultiMatch.
Request validation and query-term extraction
rust/crates/cloudsearch-index/src/lib.rs
Extends validate_query to check that all multi-match target fields are text fields; extends extract_query_terms and get_query_terms to tokenize the multi-match query string for scoring and highlighting.
Multi-match BM25 scoring
rust/crates/cloudsearch-index/src/lib.rs
Adds score_multi_match_query dispatcher and helpers to compute per-field BM25 match scores with optional phrase verification, apply multi-match type strategies (BestFields selects maximum, MostFields sums weighted scores), and route phrase modes through phrase matching.
Multi-match coverage tests
rust/crates/cloudsearch-index/tests/coverage.rs
Adds two test cases verifying BestFields mode returns the maximum single-field score and MostFields mode produces higher scores than BestFields when matching across multiple fields.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • poyrazK/cloudSearch#95: Adds the foundational BM25-based scoring pipeline that this PR extends with multi-field score aggregation and phrase matching support.

Poem

🐰 A bunny bounces through the fields,
Each weighted query now yields,
Best or most fields, phrase or match,
Elasticsearch queries hatch!
Multi-search dreams come true at last, 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding multi_match query support, which aligns with the primary objective and all major file changes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/multi-match-query

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
rust/crates/cloudsearch-index/tests/coverage.rs (1)

669-709: ⚡ Quick win

Make these tests assert the scoring contract more directly.

Both cases are permissive enough that a cross-field leakage bug still passes: the first test never proves best_fields equals the best single-field score, and the second only proves most_fields > best_fields. Please compare against equivalent single-field queries and add a negative case like {"title":"foo","content":"bar"} with query "foo bar" so each field is forced to contribute only its own tokens. A type: phrase case that rejects non-consecutive tokens would also be valuable here.

Also applies to: 712-769

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/tests/coverage.rs` around lines 669 - 709,
Update multi_match_best_fields_returns_max_score to assert scoring against
equivalent single-field queries: after the MultiMatch search (MultiMatchQuery
with MultiMatchType::BestFields and tie_breaker 0.0), run two single-field
SearchRequest queries (one targeting only "title" and one only "content") and
capture their hit scores, then assert that the multi_match score equals the max
of those single-field scores (use the same doc id and ensure score.is_some()).
Also add a negative/phrase case: index an additional document with both tokens
in the same field and/or run a MultiMatch with type=Phrase to assert that
non-consecutive token matches do not inflate BestFields (i.e., phrase query
should reject non-consecutive tokens and the score should differ), and ensure to
reference functions/types from the test
(multi_match_best_fields_returns_max_score, SearchRequest,
SearchQuery::MultiMatch, MultiMatchQuery, MultiMatchType::BestFields) when
locating where to add these assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rust/crates/cloudsearch-api/src/lib.rs`:
- Around line 1008-1057: The parse_multi_match_query function is failing rustfmt
checks; run rustfmt (or `cargo fmt --all`) and reformat this file so CI
passes—ensure the block around parse_multi_match_query (including variable
declarations like parsed_fields, the for loop over fields, the multi_match_type
match, and the tie_breaker mapping) is reformatted to follow rustfmt rules and
then commit the formatted file.
- Around line 1029-1032: The code currently silently defaults non-numeric field
weights to 1.0 (using weight_val.as_f64().map_or(1.0,...)), which should instead
fail fast; locate the loop iterating over fields (symbols: fields, weight_val,
parsed_fields) and change the logic to validate weight_val is numeric and
convertible to f32, returning an error (or propagating one) when
weight_val.as_f64() is None or conversion is invalid, rather than inserting a
default 1.0 into parsed_fields.
- Around line 1035-1043: The parsing currently silently maps unknown
multi_match.type strings to MultiMatchType::BestFields; change the
object.get("type") parsing (the multi_match_type assignment) to return an error
instead of defaulting — i.e., make the match arm produce a
Result<MultiMatchType, Error> (or Option with validation) and if the string is
not one of "most_fields", "phrase", "phrase_prefix", or "best_fields" return a
descriptive parsing error (propagate via ? or convert to the function's Result),
update the surrounding function signature to return Result if necessary, and
ensure callers of the function handle the error; reference the multi_match_type
variable and the MultiMatchType enum when implementing this validation so
unknown types are rejected rather than defaulted.

In `@rust/crates/cloudsearch-common/src/lib.rs`:
- Line 308: The MultiMatchQuery struct currently uses the field multi_match_type
which serializes to "multi_match_type" but the API expects the key "type";
update the struct to rename that field for Serde JSON to "type" (e.g. add
#[serde(rename = "type")] to the multi_match_type field) so MultiMatchQuery
payloads match the single-search JSON shape; ensure the field name remains
multi_match_type and that the MultiMatchType type is unchanged so both serialize
and deserialize correctly for MultiSearchRequest and single-search payloads.

In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 2072-2115: score_match_query_for_field currently counts term
frequency from postings across all fields (via positions_readers), inflating TF
for fields that don't actually contain the term; fix by scoping TF to the
current field: when computing tf for a token, first try to get the term
frequency from the positions reader that corresponds to the given field (or, if
positions_readers are not field-scoped, skip using them and derive tf solely
from field_tokens by counting occurrences in field_tokens), then fall back to
the existing field_tokens.count() only if the field-scoped postings are absent;
update the tf accumulation logic in score_match_query_for_field (refer to
variables positions_readers, field_tokens, tf and the call to
bm25_ctx.bm25_term_score) so TF reflects occurrences in this specific field
before computing term_score.
- Around line 2049-2066: The MultiMatchType::Phrase branch currently falls back
to best_fields when score_phrase_for_fields returns None; change it so that if
score_phrase_for_fields(document, &query_tokens, &query.fields, doc_id,
positions_readers, bm25_ctx) returns None the branch returns None immediately
(do not compute or return field_scores, max, or tie_breaker-based sums). In
short, in the MultiMatchType::Phrase arm, only return Some(ps) when
score_phrase_for_fields yields Some, otherwise return None to ensure phrase-only
matching.
- Around line 2134-2136: The loop over (field, weight) uses
document.source.get(field)?.as_str()? which returns None and aborts phrase
scoring when any field is missing or non-string; change this to skip such fields
by checking get(field) and as_str() with if let (or match) so only fields that
yield a &str are tokenized and scored (i.e., replace the `?` chaining with a
safe conditional that continues the loop on missing/non-string values), keeping
the rest of the logic that calls tokenize(field_str) and accumulates scores for
valid fields.

---

Nitpick comments:
In `@rust/crates/cloudsearch-index/tests/coverage.rs`:
- Around line 669-709: Update multi_match_best_fields_returns_max_score to
assert scoring against equivalent single-field queries: after the MultiMatch
search (MultiMatchQuery with MultiMatchType::BestFields and tie_breaker 0.0),
run two single-field SearchRequest queries (one targeting only "title" and one
only "content") and capture their hit scores, then assert that the multi_match
score equals the max of those single-field scores (use the same doc id and
ensure score.is_some()). Also add a negative/phrase case: index an additional
document with both tokens in the same field and/or run a MultiMatch with
type=Phrase to assert that non-consecutive token matches do not inflate
BestFields (i.e., phrase query should reject non-consecutive tokens and the
score should differ), and ensure to reference functions/types from the test
(multi_match_best_fields_returns_max_score, SearchRequest,
SearchQuery::MultiMatch, MultiMatchQuery, MultiMatchType::BestFields) when
locating where to add these assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b922f19-9361-4eaa-8f4b-e163f189ddea

📥 Commits

Reviewing files that changed from the base of the PR and between 1c31a0f and 2dd2e48.

📒 Files selected for processing (4)
  • rust/crates/cloudsearch-api/src/lib.rs
  • rust/crates/cloudsearch-common/src/lib.rs
  • rust/crates/cloudsearch-index/src/lib.rs
  • rust/crates/cloudsearch-index/tests/coverage.rs

Comment thread rust/crates/cloudsearch-api/src/lib.rs
Comment thread rust/crates/cloudsearch-api/src/lib.rs
Comment thread rust/crates/cloudsearch-api/src/lib.rs Outdated
Comment on lines +1035 to +1043
let multi_match_type = object
.get("type")
.and_then(Value::as_str)
.map_or(MultiMatchType::BestFields, |s| match s {
"most_fields" => MultiMatchType::MostFields,
"phrase" => MultiMatchType::Phrase,
"phrase_prefix" => MultiMatchType::PhrasePrefix,
_ => MultiMatchType::BestFields,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject unknown multi_match.type instead of silently falling back

Line 1038 currently maps any unknown type to BestFields, which hides user mistakes and changes query semantics without an error.

Proposed fix
-    let multi_match_type = object
-        .get("type")
-        .and_then(Value::as_str)
-        .map_or(MultiMatchType::BestFields, |s| match s {
-            "most_fields" => MultiMatchType::MostFields,
-            "phrase" => MultiMatchType::Phrase,
-            "phrase_prefix" => MultiMatchType::PhrasePrefix,
-            _ => MultiMatchType::BestFields,
-        });
+    let multi_match_type = match object.get("type").and_then(Value::as_str) {
+        None => MultiMatchType::BestFields,
+        Some("best_fields") => MultiMatchType::BestFields,
+        Some("most_fields") => MultiMatchType::MostFields,
+        Some("phrase") => MultiMatchType::Phrase,
+        Some("phrase_prefix") => MultiMatchType::PhrasePrefix,
+        Some(other) => {
+            return Err(ApiError(CloudSearchError::InvalidSearchRequest(format!(
+                "unsupported multi_match type '{other}'"
+            ))));
+        }
+    };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let multi_match_type = object
.get("type")
.and_then(Value::as_str)
.map_or(MultiMatchType::BestFields, |s| match s {
"most_fields" => MultiMatchType::MostFields,
"phrase" => MultiMatchType::Phrase,
"phrase_prefix" => MultiMatchType::PhrasePrefix,
_ => MultiMatchType::BestFields,
});
let multi_match_type = match object.get("type").and_then(Value::as_str) {
None => MultiMatchType::BestFields,
Some("best_fields") => MultiMatchType::BestFields,
Some("most_fields") => MultiMatchType::MostFields,
Some("phrase") => MultiMatchType::Phrase,
Some("phrase_prefix") => MultiMatchType::PhrasePrefix,
Some(other) => {
return Err(ApiError(CloudSearchError::InvalidSearchRequest(format!(
"unsupported multi_match type '{other}'"
))));
}
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-api/src/lib.rs` around lines 1035 - 1043, The parsing
currently silently maps unknown multi_match.type strings to
MultiMatchType::BestFields; change the object.get("type") parsing (the
multi_match_type assignment) to return an error instead of defaulting — i.e.,
make the match arm produce a Result<MultiMatchType, Error> (or Option with
validation) and if the string is not one of "most_fields", "phrase",
"phrase_prefix", or "best_fields" return a descriptive parsing error (propagate
via ? or convert to the function's Result), update the surrounding function
signature to return Result if necessary, and ensure callers of the function
handle the error; reference the multi_match_type variable and the MultiMatchType
enum when implementing this validation so unknown types are rejected rather than
defaulted.

#[serde(default)]
pub fields: BTreeMap<String, f32>,
#[serde(default)]
pub multi_match_type: MultiMatchType,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

MultiMatchQuery uses a different wire key than the HTTP parser expects

Line 308 serializes/deserializes as multi_match_type, but the API parser accepts "type" for multi_match. This makes typed MultiSearchRequest payloads inconsistent with single-search JSON shape.

Proposed fix
 pub struct MultiMatchQuery {
     pub query: String,
     #[serde(default)]
     pub fields: BTreeMap<String, f32>,
-    #[serde(default)]
+    #[serde(default, rename = "type", alias = "multi_match_type")]
     pub multi_match_type: MultiMatchType,
     #[serde(default)]
     pub tie_breaker: f32,
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub multi_match_type: MultiMatchType,
pub struct MultiMatchQuery {
pub query: String,
#[serde(default)]
pub fields: BTreeMap<String, f32>,
#[serde(default, rename = "type", alias = "multi_match_type")]
pub multi_match_type: MultiMatchType,
#[serde(default)]
pub tie_breaker: f32,
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-common/src/lib.rs` at line 308, The MultiMatchQuery
struct currently uses the field multi_match_type which serializes to
"multi_match_type" but the API expects the key "type"; update the struct to
rename that field for Serde JSON to "type" (e.g. add #[serde(rename = "type")]
to the multi_match_type field) so MultiMatchQuery payloads match the
single-search JSON shape; ensure the field name remains multi_match_type and
that the MultiMatchType type is unchanged so both serialize and deserialize
correctly for MultiSearchRequest and single-search payloads.

Comment thread rust/crates/cloudsearch-index/src/lib.rs Outdated
Comment thread rust/crates/cloudsearch-index/src/lib.rs
Comment thread rust/crates/cloudsearch-index/src/lib.rs
@poyrazK poyrazK force-pushed the feature/multi-match-query branch from 948d402 to cb2552b Compare May 24, 2026 12:22
…king, add strict phrase semantics

- Add score_phrase_prefix_for_fields for phrase_prefix type with prefix
  matching on last token while requiring consecutive tokens for the rest
- Fix score_phrase_for_fields: positions are byte offsets, not token indices,
  so consecutive check must use prev_pos + len(prev_token) algorithm
- Add multi_match_phrase_requires_consecutive_tokens test
- Add multi_match_phrase_prefix_with_prefix_last_token test
- Add multi_match_tie_breaker_formula test
- Add empty fields validation in validate_query
- Strict phrase semantics: no fallback to best_fields when phrase doesn't match
- Add #[allow(clippy::too_many_lines)] to score_phrase_prefix_for_fields
@poyrazK poyrazK force-pushed the feature/multi-match-query branch from cb2552b to bbbdf1e Compare May 24, 2026 12:29
poyrazK added 3 commits May 24, 2026 15:37
Extract tokens_are_consecutive and score_phrase_in_field helpers to
reduce nesting and eliminate spaghetti control flow in phrase matching.

Also fix clippy issues: explicit_iter_loop, question_mark, and add
#[allow(clippy::cast_possible_truncation)] where needed.
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit 9f7775b into main May 24, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant