Skip to content

wamr_llvm_jit is a large slow outlier on i8x16.bitmask when a loop-varying negative byte is inserted via i8x16.replace_lane #4931

@gaaraw

Description

@gaaraw

Subject of the issue

wamr_llvm_jit is much slower than peer runtimes on a small i8x16.bitmask loop when the input vector is built with i8x16.replace_lane and the replaced lane is loop-varying in the negative-byte range 0x80..0xff.

Test case

The clearest minimized reproducer is:

(module
  (type (func (param i32)))
  (type (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (type 0)))
  (func (type 1)
    (local $i i64)
    (local $acc i32)
    (local.set $i (i64.const 1073741824))
    (local.set $acc (i32.const 1311768464))
    (loop $body
      (local.get $acc)
      v128.const i8x16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      local.get $i
      i32.wrap_i64
      i32.const 0x80
      i32.or
      i8x16.replace_lane 7
      i8x16.bitmask
      i32.xor
      (local.set $acc)
      (local.set $i (i64.sub (local.get $i) (i64.const 1)))
      (br_if $body (i64.ne (local.get $i) (i64.const 0)))
    )
    (i32.const 0)
    (local.get $acc)
    (i32.store)
    (call 0 (i32.const 0))
  )
  (memory 1)
  (export "_start" (func 1))
  (export "memory" (memory 0))
)

I also checked closely matched controls:

  • multilane_all_neg_splat: all lanes vary through i8x16.splat
  • sweep_const_80: replaced lane is constant negative
  • obs_extract_lane_s7_negvary_xor: keep replace_lane and negative varying lane, but observe with extract_lane_s instead of bitmask
  • cross_i16x8_bitmask_negvary: same high-level pattern translated to i16x8.bitmask

Your environment

  • wasmer: 6.1.0
  • WAMR: iwasm 2.4.4
  • wasmedge: 0.16.1-18-gc457fe30
  • wasmtime: 41.0.0 (4898322a4 2025-12-18)
  • wabt: 1.0.39
  • llvm: 21.1.5
  • Host OS: Ubuntu 22.04.5 LTS x64
  • CPU: 12th Gen Intel® Core™ i7-12700 × 20

Steps to reproduce

  1. Compile the testcase with wat2wasm reproducer.wat -o reproducer.wasm.
  2. Run the wasm file with wamr_llvm_jit and compare its wall-clock or task-clock time with other runtimes.
  3. Compare against the controls listed above.

Representative commands in my setup:

wat2wasm reproducer.wat -o reproducer.wasm

# WAMR LLVM JIT
/path/to/iwasm reproducer.wasm

# peer runtimes
/path/to/wasmer run --llvm reproducer.wasm
/path/to/wasmedge --enable-jit reproducer.wasm
/path/to/wasmer run reproducer.wasm
/path/to/wasmtime reproducer.wasm

Expected and actual behavior

Expected behavior

For this small tight SIMD loop, I would expect wamr_llvm_jit to stay in the same rough range as the other major runtimes, or at least not to become a strong outlier only for this narrow i8x16.bitmask shape.

Actual behavior

wamr_llvm_jit is a large slowdown outlier on the reduced trigger.

Representative timings (seconds):

variant wasmer_llvm wasmedge_jit wamr_llvm_jit wasmer_cranelift wasmtime
multilane_one_neg_lane7 0.31224 0.03303 2.00588 0.61651 0.63321
sweep_loop_or_80 0.31067 0.03142 1.98314 0.62097 0.61535
multilane_all_neg_splat 0.31384 0.03224 0.31538 0.61624 0.62004
sweep_const_80 0.30556 0.03335 0.32205 0.30358 0.33636
obs_extract_lane_s7_negvary_xor ~0.38 ~0.27 ~0.35 ~0.69 ~0.68
cross_i16x8_bitmask_negvary 0.30917 0.03175 0.31183 0.69188 0.69385

Important observations:

  • A single loop-varying negative byte inserted by i8x16.replace_lane is already sufficient to trigger the slowdown.
  • The slowdown does not reproduce when the same negative pattern is produced by i8x16.splat.
  • The slowdown does not reproduce when the replaced lane is constant negative instead of loop-varying.
  • The slowdown does not reproduce on the matched i16x8.bitmask version.
  • Replacing bitmask with extract_lane_s keeps some work alive but does not reproduce the large WAMR outlier.

So the strongest observed trigger condition is:

i8x16.bitmask consuming a vector built by i8x16.replace_lane, where at least one replaced lane is loop-varying and its effective byte stays in 0x80..0xff.

Extra Info

I also exported WAMR low-level artifacts with wamrc --format=llvmir-unopt, --format=llvmir-opt, and --format=object.

For the slow i8x16 bitmask-shaped cases, optimized LLVM IR consistently contains a pattern like:

%new_vector = insertelement <16 x i8> ..., i8 %lane, i64 7
%isneg = icmp slt <16 x i8> %new_vector, zeroinitializer
%mask_bits = select <16 x i1> %isneg,
  <16 x i64> <1,2,4,...,32768>,
  zeroinitializer
%call = tail call i64 @llvm.vector.reduce.or.v16i64(<16 x i64> %mask_bits)

At object level, the hot loop becomes a long synthesized sequence with operations such as:

  • vpinsrb
  • vpcmpgtb
  • vpmovsxbw
  • vpmovzxbq / vpmovzxwq
  • vextracti128
  • repeated vpand / vpor

I did not observe a compact pmovmskb / vpmovmskb-style extraction sequence in these WAMR-generated object files.

By contrast:

  • the constant-negative control (sweep_const_80) is optimized down so the expensive bitmask path no longer stays alive in the hot loop;
  • the matched i16x8.bitmask control does not preserve a comparable heavy object-level sequence.

So the report above is based on both runtime measurements and WAMR low-level evidence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions