[Feat] Support concurrent multi-session inference

 ## Background / Problem

  The handle-based API is structurally multi-session: each loadModelHandle() allocates an independent CausalLmModel with its own mutex and state (api/quick_dot_ai_api.cpp:87-97, alloc at
  2389), and a header comment claims different handles can run in parallel from different threads (api/quick_dot_ai_api.cpp:158-161).

  However, significant process-global / singleton state sits underneath the handles and likely prevents correct concurrent multi-session use:

  - g_chat_template, g_use_chat_template, g_verbose, g_last_output, g_registry_mutex (quick_dot_ai_api.cpp:101-107) — notably g_chat_template is a single shared template, which is already
  incorrect when two handles load different models.
  - g_model_path_map (115-125), g_model_registry, g_arch_config_map (130-135).
  - Legacy default-handle singleton get_default_handle() (110-113).
  - Per-model (not per-session) tokenizer singleton (nntrainer/.../transformer.h:143).
  - QNN backend may itself impose single-context constraints (RPC/ION memory, backend-ext-config) — see qnn/ and src/models/qnn/.

 ## Goal

  Make N concurrent inference sessions safe and correct — independent prompts/templates/sampling/KV state, runnable in parallel without cross-talk.

 ## Proposed scope

  - [ ] Audit all g_* globals in api/quick_dot_ai_api.cpp:99-135; move per-session state (chat template, verbosity, sampling config, last output) into CausalLmModel/handle.
  - [ ] Fix the shared-g_chat_template correctness bug for multi-model/multi-handle.
  - [ ] Verify the QNN backend supports multiple live contexts concurrently; if not, document the constraint and define serialization/queueing semantics.
  - [ ] Confirm nntrainer-side per-model state (tokenizer, rng, KV cache) is genuinely per-instance, not static.
  - [ ] Add a concurrency stress test (multiple handles, different models/prompts, parallel threads) asserting no output cross-contamination.

 ## Acceptance criteria

  - Two+ handles with different models/prompts run concurrently with deterministic, independent outputs.
  - No shared mutable global affects per-session results.
  - QNN concurrency behavior is either supported or explicitly documented + enforced.

 ## Risks / notes

  - This likely intersects Issue 4 (KV-cache ownership) and Issue 5 (per-session context tracking).
  - QNN RPC/ION memory and backend-ext-config init order (cf. recent commits 556f3c4, c5d7c79) may be the hard blocker for true parallelism.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Support concurrent multi-session inference #31

Background / Problem

Goal

Proposed scope

Acceptance criteria

Risks / notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feat] Support concurrent multi-session inference #31

Description

Background / Problem

Goal

Proposed scope

Acceptance criteria

Risks / notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions