`wamr_llvm_jit` is a large slow outlier on `i8x16.bitmask` when a loop-varying negative byte is inserted via `i8x16.replace_lane`

# Subject of the issue

`wamr_llvm_jit` is much slower than peer runtimes on a small `i8x16.bitmask` loop when the input vector is built with `i8x16.replace_lane` and the replaced lane is loop-varying in the negative-byte range `0x80..0xff`.

# Test case

The clearest minimized reproducer is:

```wat
(module
  (type (func (param i32)))
  (type (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (type 0)))
  (func (type 1)
    (local $i i64)
    (local $acc i32)
    (local.set $i (i64.const 1073741824))
    (local.set $acc (i32.const 1311768464))
    (loop $body
      (local.get $acc)
      v128.const i8x16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      local.get $i
      i32.wrap_i64
      i32.const 0x80
      i32.or
      i8x16.replace_lane 7
      i8x16.bitmask
      i32.xor
      (local.set $acc)
      (local.set $i (i64.sub (local.get $i) (i64.const 1)))
      (br_if $body (i64.ne (local.get $i) (i64.const 0)))
    )
    (i32.const 0)
    (local.get $acc)
    (i32.store)
    (call 0 (i32.const 0))
  )
  (memory 1)
  (export "_start" (func 1))
  (export "memory" (memory 0))
)
```

I also checked closely matched controls:

- `multilane_all_neg_splat`: all lanes vary through `i8x16.splat`
- `sweep_const_80`: replaced lane is constant negative
- `obs_extract_lane_s7_negvary_xor`: keep `replace_lane` and negative varying lane, but observe with `extract_lane_s` instead of `bitmask`
- `cross_i16x8_bitmask_negvary`: same high-level pattern translated to `i16x8.bitmask`

# Your environment

- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wasmtime: 41.0.0 (4898322a4 2025-12-18)
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20

# Steps to reproduce

1. Compile the testcase with `wat2wasm reproducer.wat -o reproducer.wasm`.
2. Run the wasm file with `wamr_llvm_jit` and compare its wall-clock or task-clock time with other runtimes.
3. Compare against the controls listed above.

Representative commands in my setup:

```bash
wat2wasm reproducer.wat -o reproducer.wasm

# WAMR LLVM JIT
/path/to/iwasm reproducer.wasm

# peer runtimes
/path/to/wasmer run --llvm reproducer.wasm
/path/to/wasmedge --enable-jit reproducer.wasm
/path/to/wasmer run reproducer.wasm
/path/to/wasmtime reproducer.wasm
```

# Expected and actual behavior

## Expected behavior

For this small tight SIMD loop, I would expect `wamr_llvm_jit` to stay in the same rough range as the other major runtimes, or at least not to become a strong outlier only for this narrow `i8x16.bitmask` shape.

## Actual behavior

`wamr_llvm_jit` is a large slowdown outlier on the reduced trigger.

Representative timings (seconds):

| variant | wasmer_llvm | wasmedge_jit | wamr_llvm_jit | wasmer_cranelift | wasmtime |
|---|---:|---:|---:|---:|---:|
| `multilane_one_neg_lane7` | 0.31224 | 0.03303 | 2.00588 | 0.61651 | 0.63321 |
| `sweep_loop_or_80` | 0.31067 | 0.03142 | 1.98314 | 0.62097 | 0.61535 |
| `multilane_all_neg_splat` | 0.31384 | 0.03224 | 0.31538 | 0.61624 | 0.62004 |
| `sweep_const_80` | 0.30556 | 0.03335 | 0.32205 | 0.30358 | 0.33636 |
| `obs_extract_lane_s7_negvary_xor` | ~0.38 | ~0.27 | ~0.35 | ~0.69 | ~0.68 |
| `cross_i16x8_bitmask_negvary` | 0.30917 | 0.03175 | 0.31183 | 0.69188 | 0.69385 |

Important observations:

- A single loop-varying negative byte inserted by `i8x16.replace_lane` is already sufficient to trigger the slowdown.
- The slowdown does not reproduce when the same negative pattern is produced by `i8x16.splat`.
- The slowdown does not reproduce when the replaced lane is constant negative instead of loop-varying.
- The slowdown does not reproduce on the matched `i16x8.bitmask` version.
- Replacing `bitmask` with `extract_lane_s` keeps some work alive but does not reproduce the large WAMR outlier.

So the strongest observed trigger condition is:

> `i8x16.bitmask` consuming a vector built by `i8x16.replace_lane`, where at least one replaced lane is loop-varying and its effective byte stays in `0x80..0xff`.

# Extra Info

I also exported WAMR low-level artifacts with `wamrc --format=llvmir-unopt`, `--format=llvmir-opt`, and `--format=object`.

For the slow `i8x16` bitmask-shaped cases, optimized LLVM IR consistently contains a pattern like:

```llvm
%new_vector = insertelement <16 x i8> ..., i8 %lane, i64 7
%isneg = icmp slt <16 x i8> %new_vector, zeroinitializer
%mask_bits = select <16 x i1> %isneg,
  <16 x i64> <1,2,4,...,32768>,
  zeroinitializer
%call = tail call i64 @llvm.vector.reduce.or.v16i64(<16 x i64> %mask_bits)
```

At object level, the hot loop becomes a long synthesized sequence with operations such as:
- `vpinsrb`
- `vpcmpgtb`
- `vpmovsxbw`
- `vpmovzxbq` / `vpmovzxwq`
- `vextracti128`
- repeated `vpand` / `vpor`

I did not observe a compact `pmovmskb` / `vpmovmskb`-style extraction sequence in these WAMR-generated object files.

By contrast:
- the constant-negative control (`sweep_const_80`) is optimized down so the expensive bitmask path no longer stays alive in the hot loop;
- the matched `i16x8.bitmask` control does not preserve a comparable heavy object-level sequence.

So the report above is based on both runtime measurements and WAMR low-level evidence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`wamr_llvm_jit` is a large slow outlier on `i8x16.bitmask` when a loop-varying negative byte is inserted via `i8x16.replace_lane` #4931

Subject of the issue

Test case

Your environment

Steps to reproduce

Expected and actual behavior

Expected behavior

Actual behavior

Extra Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

variant	wasmer_llvm	wasmedge_jit	wamr_llvm_jit	wasmer_cranelift	wasmtime
`multilane_one_neg_lane7`	0.31224	0.03303	2.00588	0.61651	0.63321
`sweep_loop_or_80`	0.31067	0.03142	1.98314	0.62097	0.61535
`multilane_all_neg_splat`	0.31384	0.03224	0.31538	0.61624	0.62004
`sweep_const_80`	0.30556	0.03335	0.32205	0.30358	0.33636
`obs_extract_lane_s7_negvary_xor`	~0.38	~0.27	~0.35	~0.69	~0.68
`cross_i16x8_bitmask_negvary`	0.30917	0.03175	0.31183	0.69188	0.69385

wamr_llvm_jit is a large slow outlier on i8x16.bitmask when a loop-varying negative byte is inserted via i8x16.replace_lane #4931

Description

Subject of the issue

Test case

Your environment

Steps to reproduce

Expected and actual behavior

Expected behavior

Actual behavior

Extra Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`wamr_llvm_jit` is a large slow outlier on `i8x16.bitmask` when a loop-varying negative byte is inserted via `i8x16.replace_lane` #4931