Skip to content

[Feat] Support concurrent multi-session inference #31

Description

@dlwlzzero

Background / Problem

The handle-based API is structurally multi-session: each loadModelHandle() allocates an independent CausalLmModel with its own mutex and state (api/quick_dot_ai_api.cpp:87-97, alloc at
2389), and a header comment claims different handles can run in parallel from different threads (api/quick_dot_ai_api.cpp:158-161).

However, significant process-global / singleton state sits underneath the handles and likely prevents correct concurrent multi-session use:

  • g_chat_template, g_use_chat_template, g_verbose, g_last_output, g_registry_mutex (quick_dot_ai_api.cpp:101-107) — notably g_chat_template is a single shared template, which is already
    incorrect when two handles load different models.
  • g_model_path_map (115-125), g_model_registry, g_arch_config_map (130-135).
  • Legacy default-handle singleton get_default_handle() (110-113).
  • Per-model (not per-session) tokenizer singleton (nntrainer/.../transformer.h:143).
  • QNN backend may itself impose single-context constraints (RPC/ION memory, backend-ext-config) — see qnn/ and src/models/qnn/.

Goal

Make N concurrent inference sessions safe and correct — independent prompts/templates/sampling/KV state, runnable in parallel without cross-talk.

Proposed scope

  • Audit all g_* globals in api/quick_dot_ai_api.cpp:99-135; move per-session state (chat template, verbosity, sampling config, last output) into CausalLmModel/handle.
  • Fix the shared-g_chat_template correctness bug for multi-model/multi-handle.
  • Verify the QNN backend supports multiple live contexts concurrently; if not, document the constraint and define serialization/queueing semantics.
  • Confirm nntrainer-side per-model state (tokenizer, rng, KV cache) is genuinely per-instance, not static.
  • Add a concurrency stress test (multiple handles, different models/prompts, parallel threads) asserting no output cross-contamination.

Acceptance criteria

  • Two+ handles with different models/prompts run concurrently with deterministic, independent outputs.
  • No shared mutable global affects per-session results.
  • QNN concurrency behavior is either supported or explicitly documented + enforced.

Risks / notes

  • This likely intersects Issue 4 (KV-cache ownership) and Issue 5 (per-session context tracking).
  • QNN RPC/ION memory and backend-ext-config init order (cf. recent commits 556f3c4, c5d7c79) may be the hard blocker for true parallelism.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions