Background / Problem
The handle-based API is structurally multi-session: each loadModelHandle() allocates an independent CausalLmModel with its own mutex and state (api/quick_dot_ai_api.cpp:87-97, alloc at
2389), and a header comment claims different handles can run in parallel from different threads (api/quick_dot_ai_api.cpp:158-161).
However, significant process-global / singleton state sits underneath the handles and likely prevents correct concurrent multi-session use:
- g_chat_template, g_use_chat_template, g_verbose, g_last_output, g_registry_mutex (quick_dot_ai_api.cpp:101-107) — notably g_chat_template is a single shared template, which is already
incorrect when two handles load different models.
- g_model_path_map (115-125), g_model_registry, g_arch_config_map (130-135).
- Legacy default-handle singleton get_default_handle() (110-113).
- Per-model (not per-session) tokenizer singleton (nntrainer/.../transformer.h:143).
- QNN backend may itself impose single-context constraints (RPC/ION memory, backend-ext-config) — see qnn/ and src/models/qnn/.
Goal
Make N concurrent inference sessions safe and correct — independent prompts/templates/sampling/KV state, runnable in parallel without cross-talk.
Proposed scope
Acceptance criteria
- Two+ handles with different models/prompts run concurrently with deterministic, independent outputs.
- No shared mutable global affects per-session results.
- QNN concurrency behavior is either supported or explicitly documented + enforced.
Risks / notes
- This likely intersects Issue 4 (KV-cache ownership) and Issue 5 (per-session context tracking).
- QNN RPC/ION memory and backend-ext-config init order (cf. recent commits 556f3c4, c5d7c79) may be the hard blocker for true parallelism.
Background / Problem
The handle-based API is structurally multi-session: each loadModelHandle() allocates an independent CausalLmModel with its own mutex and state (api/quick_dot_ai_api.cpp:87-97, alloc at
2389), and a header comment claims different handles can run in parallel from different threads (api/quick_dot_ai_api.cpp:158-161).
However, significant process-global / singleton state sits underneath the handles and likely prevents correct concurrent multi-session use:
incorrect when two handles load different models.
Goal
Make N concurrent inference sessions safe and correct — independent prompts/templates/sampling/KV state, runnable in parallel without cross-talk.
Proposed scope
Acceptance criteria
Risks / notes