stream_sender: ENOMEM backoff oscillates + starves control packets under weak uplink RSSI (follow-up to #1135/#1159)

# `stream_sender` ENOMEM backoff oscillates and starves control packets under a weak uplink — follow-up to #1135 / #1159

## Summary

The exponential-backoff fix from #1135 (merged via #1159, current `main` HEAD `7831f294`) **stops the permanent wedge / crash** — frames do now flow (`yield=41–45 pps`). But on a node with a **weak uplink to the AP (RSSI ≈ −78 dBm)** the firmware lands in a *steady oscillation* of ENOMEM backoff, and — more importantly — the global backoff **starves the low-rate control packets** (`feature_state`, mesh `HEALTH`) that are not themselves the cause of the buffer pressure.

Net effect on a live ESP32-S3: a large fraction of CSI frames are dropped, and `feature_state` / `HEALTH` telemetry fails on essentially **every cycle**. This is distinct from #1135 (which wedged *permanently* at a *strong* −39 dBm uplink); the backoff machinery there is working as designed here, but two gaps remain.

## Environment

| | |
|---|---|
| Board | ESP32-S3 (8 MB), USB-serial on COM19 |
| Firmware | `main` @ `7831f294` (post-#1159 — the exponential "streak" backoff to 2000 ms is the #1159 code path) |
| Uplink | STA→AP RSSI **−76 to −79 dBm** (weak), 2.4 GHz ch 6 |
| CSI RX | received-frame RSSI −42 to −63 dBm, `len` 128/256/384 B |
| Role | leader (`c6_espnow` time-sync healthy: `tx fail=0`, 100 % match) |

## Observed log (COM19, steady state)

```
I (61586) adaptive_ctrl: medium tick: state=6 yield=41pps motion=1.00 presence=5.29 rssi=-78
W (61586) adaptive_ctrl: feature_state emit failed
I (61636) stream_sender: ENOMEM backoff expired, resuming sends (38 were suppressed)
W (61636) stream_sender: sendto ENOMEM — backing off for 1600 ms (streak 5)
W (62656) stream_sender: sendto suppressed (ENOMEM backoff, 51 dropped)
I (63236) stream_sender: ENOMEM backoff expired, resuming sends (79 were suppressed)
W (63236) stream_sender: sendto ENOMEM — backing off for 2000 ms (streak 6)
W (64586) rv_mesh: rv_mesh_send: stream_sender failed (len=48)
I (64586) adaptive_ctrl: slow tick (...) HEALTH sent          ← logged "sent" but rv_mesh_send actually failed
W (64596) adaptive_ctrl: feature_state emit failed
W (65236) stream_sender: sendto suppressed (ENOMEM backoff, 101 dropped)
```

`feature_state emit failed` appears once per `medium tick` (~1 Hz). The `HEALTH sent` log on the slow tick is also misleading — the preceding `rv_mesh_send: stream_sender failed (len=48)` shows the 48-byte HEALTH frame was actually dropped.

## Analysis

**Root cause — WiFi TX airtime saturation, not buffer size.** `sendto` returns `ENOMEM` because lwIP cannot allocate a pbuf: the STA TX queue is not draining. The reason it won't drain is the uplink RSSI. `rssi_median_dbm` is sourced from `esp_wifi_sta_get_ap_info()` (`rv_radio_ops_esp32.c:153`), i.e. the **link to the associated AP** — at −78 dBm the PHY drops to a low MCS, so each CSI frame consumes a lot of airtime, while the node is simultaneously in promiscuous RX. Bigger pools don't help — the existing `sdkconfig.defaults` comment block (lines 45–54) already documents that `WIFI_DYNAMIC_TX=128 + PBUF_POOL=32` produced an *identical* failure count, concluding "the bottleneck was radio TX time, not pool size."

Offered load that drives this: CSI sends at **50 Hz** (`csi_collector.c:72`, `CSI_MIN_SEND_INTERVAL_US = 20 ms`), plus sync packets (every 20 frames), `feature_state` (1 Hz), and `HEALTH` — **all sharing one UDP socket**.

**Secondary flaw — the backoff is global, so bulk CSI starves the tiny control packets.** `s_backoff_until_us` (`stream_sender.c:28`, gate at `:82`) is process-wide. The 50 Hz CSI flood *triggers* the backoff, and the backoff then suppresses **everything**, including the ~40 B `feature_state` and 48 B `HEALTH`/mesh packets that are negligible airtime and would very likely succeed on their own. The low-rate telemetry is collateral damage of the high-rate bulk stream — exactly the `feature_state emit failed` / `rv_mesh_send failed (len=48)` we see every cycle.

### Why #1135's fix doesn't cover this
#1159 added exponential backoff (anti-crash) and dropped `feature_state` to 1 Hz. That correctly fixes the *permanent-wedge / crash* failure mode at strong RSSI. It does **not** address (a) sustained ENOMEM oscillation when the uplink itself is the bottleneck, nor (b) the global backoff starving control packets — both of which only surface under a weak uplink and were not in the #1135 repro (−39 dBm).

## Proposed fixes (prioritized)

1. **Give control packets a priority send path** so CSI backoff can't starve them. Add `stream_sender_send_priority()` (or a `priority` flag) that skips the suppression gate and, if it *does* hit ENOMEM, returns −1 quietly **without** extending the global streak. Route `feature_state` / `HEALTH` / sync through it. These are ≤48 B at ≤1 Hz — trivial pbuf pressure, won't re-trigger the crash cascade (which was 50 Hz × large CSI). Directly eliminates the `feature_state emit failed` / `rv_mesh_send failed` spam.
2. **Reduce / adapt the CSI offered load** (the only lever that drains the queue). Quick: `CSI_MIN_SEND_INTERVAL_US` 20 ms → 40–50 ms (50 → 20–25 Hz; pipeline floor is 20 Hz). Better: throttle CSI rate when `s_enomem_streak` is high or uplink RSSI is weak — cut load at the source instead of relying on a fixed cooldown.
3. **Check backoff before serializing** (`csi_collector.c:280`): `csi_serialize_frame()` currently runs a multi-hundred-byte memcpy and *then* `stream_sender_send` returns −1 during backoff. Gate on backoff first to stop burning CPU in callback context on frames that will be dropped.
4. **Operational note:** the −78 dBm uplink is the physical root — improving RSSI (placement / antenna / 20 MHz / less-congested channel) shrinks the whole problem. Worth surfacing as a health warning when `ap.rssi` is below a threshold.
5. **Fix the misleading `HEALTH sent` log** — it prints unconditionally even when `rv_mesh_send` returned `ESP_FAIL`.

Suggested first pass: #1 + #2-quick + #3 + #5 — a contained change across `stream_sender.c`, `csi_collector.c`, `adaptive_controller.c`, `rv_mesh.c`, no buffer-config changes (already proven not to help).

## Validation
Re-confirmed on `main` @ `7831f294`: firmware tree clean, `CSI_MIN_SEND_INTERVAL_US = 20 ms`, global `s_backoff_until_us`, `rssi_median_dbm = ap.rssi` (STA uplink) all present as cited.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stream_sender: ENOMEM backoff oscillates + starves control packets under weak uplink RSSI (follow-up to #1135/#1159) #1183

`stream_sender` ENOMEM backoff oscillates and starves control packets under a weak uplink — follow-up to #1135 / #1159

Summary

Environment

Observed log (COM19, steady state)

Analysis

Why #1135's fix doesn't cover this

Proposed fixes (prioritized)

Validation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development


Board	ESP32-S3 (8 MB), USB-serial on COM19
Firmware	`main` @ `7831f294` (post-#1159 — the exponential "streak" backoff to 2000 ms is the #1159 code path)
Uplink	STA→AP RSSI −76 to −79 dBm (weak), 2.4 GHz ch 6
CSI RX	received-frame RSSI −42 to −63 dBm, `len` 128/256/384 B
Role	leader (`c6_espnow` time-sync healthy: `tx fail=0`, 100 % match)

stream_sender: ENOMEM backoff oscillates + starves control packets under weak uplink RSSI (follow-up to #1135/#1159) #1183

Description

stream_sender ENOMEM backoff oscillates and starves control packets under a weak uplink — follow-up to #1135 / #1159

Summary

Environment

Observed log (COM19, steady state)

Analysis

Why #1135's fix doesn't cover this

Proposed fixes (prioritized)

Validation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`stream_sender` ENOMEM backoff oscillates and starves control packets under a weak uplink — follow-up to #1135 / #1159