Skip to content

Eval bug: Crashes when using version 0.3.0, when using two or three cards. Gemma 31b, dflash #38

@nikitabalakin

Description

@nikitabalakin

Name and Version

./llama-server -m /home/kotokin/model/google_gemma-4-31B-it-Q6_K.gguf -ngl 999 -sm tensor -c 65536 --jinja -np 1 --ctx-checkpoints 4 --spec-type dflash --spec-draft-model /home/kotokin/model/gemma4-31b-it-dflash-Q8_0.gguf

Operating systems

Linux

GGML backends

CUDA

Hardware

rtx 3090x3

Models

No response

Problem description & steps to reproduce

Crashes when using version 0.3.0, when using two or three cards. Gemma 31b, dflash

First Bad Commit

No response

Relevant log output

kotokin@kotokin-x570ud:~/beellama.cpp/build/bin$ export CUDA_VISIBLE_DEVICES=0,1
kotokin@kotokin-x570ud:~/beellama.cpp/build/bin$ ./llama-server -m /home/kotokin/model/google_gemma-4-31B-it-Q6_K.gguf -ngl 999 -sm tensor -c 65536 --jinja -np 1 --ctx-checkpoints 4 --spec-type dflash --spec-draft-model /home/kotokin/model/gemma4-31b-it-dflash-Q8_0.gguf
0.00.378.605 I dflash: setting -cd to 256 (drafter doesn't need the full main ctx; pass -cd N to override)
0.00.378.653 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.378.653 I device_info:
0.00.529.705 I   - CUDA0   : NVIDIA GeForce RTX 3090 (24126 MiB, 23860 MiB free)
0.00.681.359 I   - CUDA1   : NVIDIA GeForce RTX 3090 (24126 MiB, 23860 MiB free)
0.00.681.375 I   - CPU     : AMD Ryzen 9 3900XT 12-Core Processor (32011 MiB, 32011 MiB free)
0.00.681.497 I system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.682.377 I srv          init: using 23 threads for HTTP server
0.00.682.714 I srv         start: binding port with default address family
0.00.684.135 I srv  llama_server: loading model
0.00.684.477 I srv    load_model: loading model '/home/kotokin/model/google_gemma-4-31B-it-Q6_K.gguf'
0.00.684.520 I srv    load_model: auto-enabled kv-unified: spec decode backup doesn't need separate KV stream
0.00.684.541 I common_init_result: fitting params to device memory ...
0.00.684.542 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.685.502 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.01.303.462 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.304.022 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.335.388 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.27.767.619 W llama_context: n_ctx_seq (65536) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.28.264.345 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
TCQ decode: context-adaptive V alpha enabled
0.28.578.203 I srv    load_model: loading draft model '/home/kotokin/model/gemma4-31b-it-dflash-Q8_0.gguf'
0.28.578.253 I srv    load_model: DFlash draft model will use a single device by default; pass --spec-draft-device to override
0.29.218.559 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.29.219.126 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.29.252.736 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.30.144.739 I srv    load_model: initializing slots, n_slots = 1
0.30.191.681 I srv    load_model: DFlash enabled for all 1 slots
0.30.191.730 W llama_context: n_ctx_seq (256) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
/home/kotokin/beellama.cpp/ggml/src/ggml-backend.cpp:898: pre-allocated tensor (token_embd.weight) in a buffer (Meta()) that cannot run the operation (NONE)
[New LWP 114828]
[New LWP 114827]
[New LWP 114826]
[New LWP 114825]
[New LWP 114824]
[New LWP 114689]
[New LWP 114688]
[New LWP 114687]
[New LWP 114686]
[New LWP 114685]
[New LWP 114684]
[New LWP 114683]
[New LWP 114682]
[New LWP 114681]
[New LWP 114680]
[New LWP 114679]
[New LWP 114678]
[New LWP 114677]
[New LWP 114676]
[New LWP 114675]
[New LWP 114674]
[New LWP 114673]
[New LWP 114672]
[New LWP 114671]
[New LWP 114670]
[New LWP 114669]
[New LWP 114668]
[New LWP 114667]
[New LWP 114666]
[New LWP 114665]
[New LWP 114664]
[New LWP 114663]
[New LWP 114662]
[New LWP 114661]
[New LWP 114654]
[New LWP 114653]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
⚠ warning: 56  ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007e8e4dea067c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
⚠ warning: 49  ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007e8e4df1cdcf in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
⚠ warning: 30  ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007e8e4e52ce73 in ggml_print_backtrace () from /home/kotokin/beellama.cpp/build/bin/libggml-base.so.0
#5  0x00007e8e4e52d026 in ggml_abort () from /home/kotokin/beellama.cpp/build/bin/libggml-base.so.0
#6  0x00007e8e4e546c4c in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) () from /home/kotokin/beellama.cpp/build/bin/libggml-base.so.0
#7  0x00007e8e4e548c0f in ggml_backend_sched_split_graph () from /home/kotokin/beellama.cpp/build/bin/libggml-base.so.0
#8  0x00007e8e4d4e7bc6 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) () from /home/kotokin/beellama.cpp/build/bin/libllama.so.0
#9  0x00007e8e4d4e9a58 in llama_context::sched_reserve() () from /home/kotokin/beellama.cpp/build/bin/libllama.so.0
#10 0x00007e8e4d4f8c0b in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/kotokin/beellama.cpp/build/bin/libllama.so.0
#11 0x00007e8e4d4f9970 in llama_init_from_model () from /home/kotokin/beellama.cpp/build/bin/libllama.so.0
#12 0x00007e8e4dadcefc in common_speculative_create_ctx_dft(common_params_speculative const&, int) () from /home/kotokin/beellama.cpp/build/bin/libllama-common.so.0
#13 0x00007e8e4e7a545d in server_context_impl::load_model(common_params&) () from /home/kotokin/beellama.cpp/build/bin/libllama-server-impl.so
#14 0x00007e8e4e6d1cda in llama_server(int, char**) () from /home/kotokin/beellama.cpp/build/bin/libllama-server-impl.so
#15 0x00007e8e4de2a601 in __libc_start_call_main (main=main@entry=0x59609a5b3340 <main>, argc=argc@entry=18, argv=argv@entry=0x7ffc754aae28) at ../sysdeps/nptl/libc_start_call_main.h:59
⚠ warning: 59  ../sysdeps/nptl/libc_start_call_main.h: No such file or directory
#16 0x00007e8e4de2a718 in __libc_start_main_impl (main=0x59609a5b3340 <main>, argc=18, argv=0x7ffc754aae28, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc754aae18) at ../csu/libc-start.c:360
⚠ warning: 360 ../csu/libc-start.c: No such file or directory
#17 0x000059609a5b3375 in _start ()
[Inferior 1 (process 114651) detached]
Aborted                    ./llama-server -m /home/kotokin/model/google_gemma-4-31B-it-Q6_K.gguf -ngl 999 -sm tensor -c 65536 --jinja -np 1 --ctx-checkpoints 4 --spec-type dflash --spec-draft-model /home/kotokin/model/gemma4-31b-it-dflash-Q8_0.gguf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions