Skip to content

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878

Open
snyaggarwal wants to merge 5 commits into
masterfrom
ocl_online/issues#116
Open

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878
snyaggarwal wants to merge 5 commits into
masterfrom
ocl_online/issues#116

Conversation

@snyaggarwal

Copy link
Copy Markdown
Contributor

Linked Issue

Closes OpenConceptLab/ocl_online#116

…nking with fallback to inline load of model in api/indexing services
@snyaggarwal snyaggarwal requested a review from paynejd June 12, 2026 01:45
@snyaggarwal snyaggarwal self-assigned this Jun 12, 2026

@paynejd paynejd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed end-to-end. Strong PR — using off-the-shelf Infinity instead of a hand-rolled /$embed/+/$rerank/ service is the right call, the NO_LM removal is clean (no lingering settings.LM/settings.ENCODER refs), and the VectorEmbed/Reranker unit tests are good. A few items to address before merge (infra items are on oclinfrastructure#6):

Should-fix

  • #1 + #3 (inline suggestion below): the local fallback rebuilds SentenceTransformer on every call (was a preloaded singleton) and returns np.float32 instead of .tolist() native floats. Both fixed in one suggestion.
  • #2 (infra): the memory right-sizing on #6 assumes models never load in-process, but the fallback loads them (the api fallback also loads the ~1GB+ CrossEncoder). A service outage — the exact case the fallback exists for — is when the slimmed container is most likely to OOM. Keep headroom for a fallback load, or make the fallback explicitly disable-able so a slimmed container fails fast instead of OOM-thrashing.

Verify

  • #5 — re-index + normalization parity: Infinity may L2-normalize all-MiniLM-L6-v2 output by default; SentenceTransformer.encode does not. Index-time and query-time are consistent going forward (both via the service), but (a) a full re-index is required at cutover since existing prod vectors were built in-process, and (b) a local fallback query against a service-built index could silently degrade kNN if normalization differs. Please confirm cosine(same text) ≈ 1.0 between Infinity and the local model before relying on the fallback.

Nits

  • (a) .env.example:12 and docker-compose.override.yml.bak:40 still set NO_LM=TRUE — stale, replace with EMBEDDING_SERVICE_URL / INFINITY_API_KEY. (Couldn't inline-suggest — those lines aren't in the diff.)
  • (b) inline below — hoist import requests to module top.
  • (f) leaving api without a depends_on on ocl-embeddings-api is correct — the existing "do not depend on other services" comment is intentional and the fallback handles startup ordering gracefully. No change.

Testing: the fallback tests mock _get_embedding_locally, so they never exercise the real per-call reload (#1) or the np.float32 return (#3), and nothing covers the indexing path (documents.py) going through the service. Worth one test that lets the real local path run on a tiny input and asserts the returned vector is a list of plain floats.

Comment thread core/common/search.py Outdated
Comment thread core/common/search.py Outdated
@paynejd

paynejd commented Jun 17, 2026

Copy link
Copy Markdown
Member

Re-reviewed at 77370a6d. Good progress — NO_LM removal is clean, import requests hoisted, VectorEmbed now caches the model class-level and returns .tolist(), and the new unit tests are solid. Remaining:

B3 (must-fix) — Reranker reloads the CrossEncoder on every search. VectorEmbed got a class-level _LOCAL_MODELS cache; Reranker did not. Reranker(encoder_model) is constructed per search request (core/common/search.py:222) with self.encoder = None, and _get_rerank_scores_locally builds a fresh CrossEncoder(...) (~1GB, bge-reranker-v2-m3) per instance. So the moment the service is unreachable — the exact case the fallback is for — every reranked search reloads a 1GB model from disk. Give Reranker the same class-level {model_name: encoder} cache.

B4 (must-fix) — Is the fallback actually viable in prod? Three preconditions, none currently met: (a) the Dockerfile has no model preload, and api/celery mount a volume at /tmp, not the HF cache dir — so a fallback would try to download from huggingface.co at request time (fails if egress is locked, download-storm if not); (b) infra trimmed the containers' RAM (see OpenConceptLab/oclinfrastructure#6) so a fallback load risks OOM; (c) per B3 the reranker reloads every call. As written the fallback can turn a clean "service down" into OOM-thrash or a download storm. Either make it genuinely viable (bake models / mount an HF-cache volume + keep RAM headroom) or make it disable-able per env so a trimmed container fails fast.

Nit — asymmetric error handling. _get_rerank_scores_locally catches and returns MISSING_SCORE, but VectorEmbed._get_embedding_locally has no try/except — a local embed failure raises through documents.py prepare() and fails the whole index task. Make them consistent.

Verify:

  • CI / toggle: the deleted get_embeddings carried if not Toggle.get('SEMANTIC_SEARCH_TOGGLE') or settings.ENV == 'ci': return None. The toggle check now lives in core/concepts/views.py:885, but the indexing path (core/concepts/documents.py:247) guards only on has_semantic_match_algorithm with no ci short-circuit. Confirm CI never hits a semantic-enabled source (else it tries to load/download a model mid-test).
  • Normalization (walking back my round-one Bump django from 3.0.9 to 3.1.6 #5): the dense_vector mapping sets neither similarity nor dims, so ES 8.15 defaults to cosine, which is magnitude-invariant — Infinity's L2-normalization vs SentenceTransformer's raw output washes out, and a full re-index is probably not required at cutover. Please just confirm the mapping is cosine (not dot_product) and that Infinity's MiniLM pooling matches sentence-transformers (one cosine(same_text) spot-check settles it).
  • qa behavior change: ENCODER_MODEL_NAME is now set unconditionally (old if ENV not in ['qa'] guard gone) and qa points at the prod service — qa previously did no reranking and now will. Intended?
  • Indexing brownout: embed timeout=10 per call; a slow (not down) service makes every embed wait 10s then fall back — indexing crawls. Consider a shorter embed timeout + circuit-breaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants