Subject of the issue
wamr_llvm_jit is much slower than peer runtimes on a small i8x16.bitmask loop when the input vector is built with i8x16.replace_lane and the replaced lane is loop-varying in the negative-byte range 0x80..0xff.
Test case
The clearest minimized reproducer is:
(module
(type (func (param i32)))
(type (func))
(import "wasi_snapshot_preview1" "proc_exit" (func (type 0)))
(func (type 1)
(local $i i64)
(local $acc i32)
(local.set $i (i64.const 1073741824))
(local.set $acc (i32.const 1311768464))
(loop $body
(local.get $acc)
v128.const i8x16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
local.get $i
i32.wrap_i64
i32.const 0x80
i32.or
i8x16.replace_lane 7
i8x16.bitmask
i32.xor
(local.set $acc)
(local.set $i (i64.sub (local.get $i) (i64.const 1)))
(br_if $body (i64.ne (local.get $i) (i64.const 0)))
)
(i32.const 0)
(local.get $acc)
(i32.store)
(call 0 (i32.const 0))
)
(memory 1)
(export "_start" (func 1))
(export "memory" (memory 0))
)
I also checked closely matched controls:
multilane_all_neg_splat: all lanes vary through i8x16.splat
sweep_const_80: replaced lane is constant negative
obs_extract_lane_s7_negvary_xor: keep replace_lane and negative varying lane, but observe with extract_lane_s instead of bitmask
cross_i16x8_bitmask_negvary: same high-level pattern translated to i16x8.bitmask
Your environment
- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wasmtime: 41.0.0 (4898322a4 2025-12-18)
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20
Steps to reproduce
- Compile the testcase with
wat2wasm reproducer.wat -o reproducer.wasm.
- Run the wasm file with
wamr_llvm_jit and compare its wall-clock or task-clock time with other runtimes.
- Compare against the controls listed above.
Representative commands in my setup:
wat2wasm reproducer.wat -o reproducer.wasm
# WAMR LLVM JIT
/path/to/iwasm reproducer.wasm
# peer runtimes
/path/to/wasmer run --llvm reproducer.wasm
/path/to/wasmedge --enable-jit reproducer.wasm
/path/to/wasmer run reproducer.wasm
/path/to/wasmtime reproducer.wasm
Expected and actual behavior
Expected behavior
For this small tight SIMD loop, I would expect wamr_llvm_jit to stay in the same rough range as the other major runtimes, or at least not to become a strong outlier only for this narrow i8x16.bitmask shape.
Actual behavior
wamr_llvm_jit is a large slowdown outlier on the reduced trigger.
Representative timings (seconds):
| variant |
wasmer_llvm |
wasmedge_jit |
wamr_llvm_jit |
wasmer_cranelift |
wasmtime |
multilane_one_neg_lane7 |
0.31224 |
0.03303 |
2.00588 |
0.61651 |
0.63321 |
sweep_loop_or_80 |
0.31067 |
0.03142 |
1.98314 |
0.62097 |
0.61535 |
multilane_all_neg_splat |
0.31384 |
0.03224 |
0.31538 |
0.61624 |
0.62004 |
sweep_const_80 |
0.30556 |
0.03335 |
0.32205 |
0.30358 |
0.33636 |
obs_extract_lane_s7_negvary_xor |
~0.38 |
~0.27 |
~0.35 |
~0.69 |
~0.68 |
cross_i16x8_bitmask_negvary |
0.30917 |
0.03175 |
0.31183 |
0.69188 |
0.69385 |
Important observations:
- A single loop-varying negative byte inserted by
i8x16.replace_lane is already sufficient to trigger the slowdown.
- The slowdown does not reproduce when the same negative pattern is produced by
i8x16.splat.
- The slowdown does not reproduce when the replaced lane is constant negative instead of loop-varying.
- The slowdown does not reproduce on the matched
i16x8.bitmask version.
- Replacing
bitmask with extract_lane_s keeps some work alive but does not reproduce the large WAMR outlier.
So the strongest observed trigger condition is:
i8x16.bitmask consuming a vector built by i8x16.replace_lane, where at least one replaced lane is loop-varying and its effective byte stays in 0x80..0xff.
Extra Info
I also exported WAMR low-level artifacts with wamrc --format=llvmir-unopt, --format=llvmir-opt, and --format=object.
For the slow i8x16 bitmask-shaped cases, optimized LLVM IR consistently contains a pattern like:
%new_vector = insertelement <16 x i8> ..., i8 %lane, i64 7
%isneg = icmp slt <16 x i8> %new_vector, zeroinitializer
%mask_bits = select <16 x i1> %isneg,
<16 x i64> <1,2,4,...,32768>,
zeroinitializer
%call = tail call i64 @llvm.vector.reduce.or.v16i64(<16 x i64> %mask_bits)
At object level, the hot loop becomes a long synthesized sequence with operations such as:
vpinsrb
vpcmpgtb
vpmovsxbw
vpmovzxbq / vpmovzxwq
vextracti128
- repeated
vpand / vpor
I did not observe a compact pmovmskb / vpmovmskb-style extraction sequence in these WAMR-generated object files.
By contrast:
- the constant-negative control (
sweep_const_80) is optimized down so the expensive bitmask path no longer stays alive in the hot loop;
- the matched
i16x8.bitmask control does not preserve a comparable heavy object-level sequence.
So the report above is based on both runtime measurements and WAMR low-level evidence.
Subject of the issue
wamr_llvm_jitis much slower than peer runtimes on a smalli8x16.bitmaskloop when the input vector is built withi8x16.replace_laneand the replaced lane is loop-varying in the negative-byte range0x80..0xff.Test case
The clearest minimized reproducer is:
I also checked closely matched controls:
multilane_all_neg_splat: all lanes vary throughi8x16.splatsweep_const_80: replaced lane is constant negativeobs_extract_lane_s7_negvary_xor: keepreplace_laneand negative varying lane, but observe withextract_lane_sinstead ofbitmaskcross_i16x8_bitmask_negvary: same high-level pattern translated toi16x8.bitmaskYour environment
Steps to reproduce
wat2wasm reproducer.wat -o reproducer.wasm.wamr_llvm_jitand compare its wall-clock or task-clock time with other runtimes.Representative commands in my setup:
Expected and actual behavior
Expected behavior
For this small tight SIMD loop, I would expect
wamr_llvm_jitto stay in the same rough range as the other major runtimes, or at least not to become a strong outlier only for this narrowi8x16.bitmaskshape.Actual behavior
wamr_llvm_jitis a large slowdown outlier on the reduced trigger.Representative timings (seconds):
multilane_one_neg_lane7sweep_loop_or_80multilane_all_neg_splatsweep_const_80obs_extract_lane_s7_negvary_xorcross_i16x8_bitmask_negvaryImportant observations:
i8x16.replace_laneis already sufficient to trigger the slowdown.i8x16.splat.i16x8.bitmaskversion.bitmaskwithextract_lane_skeeps some work alive but does not reproduce the large WAMR outlier.So the strongest observed trigger condition is:
Extra Info
I also exported WAMR low-level artifacts with
wamrc --format=llvmir-unopt,--format=llvmir-opt, and--format=object.For the slow
i8x16bitmask-shaped cases, optimized LLVM IR consistently contains a pattern like:At object level, the hot loop becomes a long synthesized sequence with operations such as:
vpinsrbvpcmpgtbvpmovsxbwvpmovzxbq/vpmovzxwqvextracti128vpand/vporI did not observe a compact
pmovmskb/vpmovmskb-style extraction sequence in these WAMR-generated object files.By contrast:
sweep_const_80) is optimized down so the expensive bitmask path no longer stays alive in the hot loop;i16x8.bitmaskcontrol does not preserve a comparable heavy object-level sequence.So the report above is based on both runtime measurements and WAMR low-level evidence.