OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services by snyaggarwal · Pull Request #878 · OpenConceptLab/oclapi2

snyaggarwal · 2026-06-12T01:45:50Z

Linked Issue

Closes OpenConceptLab/ocl_online#116

…nking with fallback to inline load of model in api/indexing services

paynejd

Reviewed end-to-end. Strong PR — using off-the-shelf Infinity instead of a hand-rolled /$embed/+/$rerank/ service is the right call, the NO_LM removal is clean (no lingering settings.LM/settings.ENCODER refs), and the VectorEmbed/Reranker unit tests are good. A few items to address before merge (infra items are on oclinfrastructure#6):

Should-fix

#1 + #3 (inline suggestion below): the local fallback rebuilds SentenceTransformer on every call (was a preloaded singleton) and returns np.float32 instead of .tolist() native floats. Both fixed in one suggestion.
#2 (infra): the memory right-sizing on #6 assumes models never load in-process, but the fallback loads them (the api fallback also loads the ~1GB+ CrossEncoder). A service outage — the exact case the fallback exists for — is when the slimmed container is most likely to OOM. Keep headroom for a fallback load, or make the fallback explicitly disable-able so a slimmed container fails fast instead of OOM-thrashing.

Verify

#5 — re-index + normalization parity: Infinity may L2-normalize all-MiniLM-L6-v2 output by default; SentenceTransformer.encode does not. Index-time and query-time are consistent going forward (both via the service), but (a) a full re-index is required at cutover since existing prod vectors were built in-process, and (b) a local fallback query against a service-built index could silently degrade kNN if normalization differs. Please confirm cosine(same text) ≈ 1.0 between Infinity and the local model before relying on the fallback.

Nits

(a) .env.example:12 and docker-compose.override.yml.bak:40 still set NO_LM=TRUE — stale, replace with EMBEDDING_SERVICE_URL / INFINITY_API_KEY. (Couldn't inline-suggest — those lines aren't in the diff.)
(b) inline below — hoist import requests to module top.
(f) leaving api without a depends_on on ocl-embeddings-api is correct — the existing "do not depend on other services" comment is intentional and the fallback handles startup ordering gracefully. No change.

Testing: the fallback tests mock _get_embedding_locally, so they never exercise the real per-call reload (#1) or the np.float32 return (#3), and nothing covers the indexing path (documents.py) going through the service. Worth one test that lets the real local path run on a tiny input and asserts the returned vector is a list of plain floats.

paynejd · 2026-06-17T14:24:39Z

Re-reviewed at 77370a6d. Good progress — NO_LM removal is clean, import requests hoisted, VectorEmbed now caches the model class-level and returns .tolist(), and the new unit tests are solid. Remaining:

B3 (must-fix) — Reranker reloads the CrossEncoder on every search. VectorEmbed got a class-level _LOCAL_MODELS cache; Reranker did not. Reranker(encoder_model) is constructed per search request (core/common/search.py:222) with self.encoder = None, and _get_rerank_scores_locally builds a fresh CrossEncoder(...) (~1GB, bge-reranker-v2-m3) per instance. So the moment the service is unreachable — the exact case the fallback is for — every reranked search reloads a 1GB model from disk. Give Reranker the same class-level {model_name: encoder} cache.

B4 (must-fix) — Is the fallback actually viable in prod? Three preconditions, none currently met: (a) the Dockerfile has no model preload, and api/celery mount a volume at /tmp, not the HF cache dir — so a fallback would try to download from huggingface.co at request time (fails if egress is locked, download-storm if not); (b) infra trimmed the containers' RAM (see OpenConceptLab/oclinfrastructure#6) so a fallback load risks OOM; (c) per B3 the reranker reloads every call. As written the fallback can turn a clean "service down" into OOM-thrash or a download storm. Either make it genuinely viable (bake models / mount an HF-cache volume + keep RAM headroom) or make it disable-able per env so a trimmed container fails fast.

Nit — asymmetric error handling. _get_rerank_scores_locally catches and returns MISSING_SCORE, but VectorEmbed._get_embedding_locally has no try/except — a local embed failure raises through documents.py prepare() and fails the whole index task. Make them consistent.

Verify:

CI / toggle: the deleted get_embeddings carried if not Toggle.get('SEMANTIC_SEARCH_TOGGLE') or settings.ENV == 'ci': return None. The toggle check now lives in core/concepts/views.py:885, but the indexing path (core/concepts/documents.py:247) guards only on has_semantic_match_algorithm with no ci short-circuit. Confirm CI never hits a semantic-enabled source (else it tries to load/download a model mid-test).
Normalization (walking back my round-one Bump django from 3.0.9 to 3.1.6 #5): the dense_vector mapping sets neither similarity nor dims, so ES 8.15 defaults to cosine, which is magnitude-invariant — Infinity's L2-normalization vs SentenceTransformer's raw output washes out, and a full re-index is probably not required at cutover. Please just confirm the mapping is cosine (not dot_product) and that Infinity's MiniLM pooling matches sentence-transformers (one cosine(same_text) spot-check settles it).
qa behavior change: ENCODER_MODEL_NAME is now set unconditionally (old if ENV not in ['qa'] guard gone) and qa points at the prod service — qa previously did no reranking and now will. Intended?
Indexing brownout: embed timeout=10 per call; a slow (not down) service makes every embed wait 10s then fall back — indexing crawls. Consider a shorter embed timeout + circuit-breaker.

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and rera…

5601a89

…nking with fallback to inline load of model in api/indexing services

snyaggarwal requested a review from paynejd June 12, 2026 01:45

snyaggarwal self-assigned this Jun 12, 2026

OpenConceptLab/ocl_online#116 | suppressing tracking

ab846d9

paynejd reviewed Jun 12, 2026

View reviewed changes

Comment thread core/common/search.py Outdated

Comment thread core/common/search.py Outdated

snyaggarwal added 3 commits June 15, 2026 10:13

OpenConceptLab/ocl_online#116 | extracting common import

3c4328d

OpenConceptLab/ocl_online#116 | keeping the model

5f66a44

OpenConceptLab/ocl_online#116 | removed NO_LM env var

77370a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878
snyaggarwal wants to merge 5 commits into
masterfrom
ocl_online/issues#116

snyaggarwal commented Jun 12, 2026

Uh oh!

paynejd left a comment

Uh oh!

Uh oh!

Uh oh!

paynejd commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

snyaggarwal commented Jun 12, 2026

Linked Issue

Uh oh!

paynejd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

paynejd commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants