Skip to content

stream_sender: ENOMEM backoff oscillates + starves control packets under weak uplink RSSI (follow-up to #1135/#1159) #1183

Description

@Tovli

stream_sender ENOMEM backoff oscillates and starves control packets under a weak uplink — follow-up to #1135 / #1159

Summary

The exponential-backoff fix from #1135 (merged via #1159, current main HEAD 7831f294) stops the permanent wedge / crash — frames do now flow (yield=41–45 pps). But on a node with a weak uplink to the AP (RSSI ≈ −78 dBm) the firmware lands in a steady oscillation of ENOMEM backoff, and — more importantly — the global backoff starves the low-rate control packets (feature_state, mesh HEALTH) that are not themselves the cause of the buffer pressure.

Net effect on a live ESP32-S3: a large fraction of CSI frames are dropped, and feature_state / HEALTH telemetry fails on essentially every cycle. This is distinct from #1135 (which wedged permanently at a strong −39 dBm uplink); the backoff machinery there is working as designed here, but two gaps remain.

Environment

Board ESP32-S3 (8 MB), USB-serial on COM19
Firmware main @ 7831f294 (post-#1159 — the exponential "streak" backoff to 2000 ms is the #1159 code path)
Uplink STA→AP RSSI −76 to −79 dBm (weak), 2.4 GHz ch 6
CSI RX received-frame RSSI −42 to −63 dBm, len 128/256/384 B
Role leader (c6_espnow time-sync healthy: tx fail=0, 100 % match)

Observed log (COM19, steady state)

I (61586) adaptive_ctrl: medium tick: state=6 yield=41pps motion=1.00 presence=5.29 rssi=-78
W (61586) adaptive_ctrl: feature_state emit failed
I (61636) stream_sender: ENOMEM backoff expired, resuming sends (38 were suppressed)
W (61636) stream_sender: sendto ENOMEM — backing off for 1600 ms (streak 5)
W (62656) stream_sender: sendto suppressed (ENOMEM backoff, 51 dropped)
I (63236) stream_sender: ENOMEM backoff expired, resuming sends (79 were suppressed)
W (63236) stream_sender: sendto ENOMEM — backing off for 2000 ms (streak 6)
W (64586) rv_mesh: rv_mesh_send: stream_sender failed (len=48)
I (64586) adaptive_ctrl: slow tick (...) HEALTH sent          ← logged "sent" but rv_mesh_send actually failed
W (64596) adaptive_ctrl: feature_state emit failed
W (65236) stream_sender: sendto suppressed (ENOMEM backoff, 101 dropped)

feature_state emit failed appears once per medium tick (~1 Hz). The HEALTH sent log on the slow tick is also misleading — the preceding rv_mesh_send: stream_sender failed (len=48) shows the 48-byte HEALTH frame was actually dropped.

Analysis

Root cause — WiFi TX airtime saturation, not buffer size. sendto returns ENOMEM because lwIP cannot allocate a pbuf: the STA TX queue is not draining. The reason it won't drain is the uplink RSSI. rssi_median_dbm is sourced from esp_wifi_sta_get_ap_info() (rv_radio_ops_esp32.c:153), i.e. the link to the associated AP — at −78 dBm the PHY drops to a low MCS, so each CSI frame consumes a lot of airtime, while the node is simultaneously in promiscuous RX. Bigger pools don't help — the existing sdkconfig.defaults comment block (lines 45–54) already documents that WIFI_DYNAMIC_TX=128 + PBUF_POOL=32 produced an identical failure count, concluding "the bottleneck was radio TX time, not pool size."

Offered load that drives this: CSI sends at 50 Hz (csi_collector.c:72, CSI_MIN_SEND_INTERVAL_US = 20 ms), plus sync packets (every 20 frames), feature_state (1 Hz), and HEALTHall sharing one UDP socket.

Secondary flaw — the backoff is global, so bulk CSI starves the tiny control packets. s_backoff_until_us (stream_sender.c:28, gate at :82) is process-wide. The 50 Hz CSI flood triggers the backoff, and the backoff then suppresses everything, including the ~40 B feature_state and 48 B HEALTH/mesh packets that are negligible airtime and would very likely succeed on their own. The low-rate telemetry is collateral damage of the high-rate bulk stream — exactly the feature_state emit failed / rv_mesh_send failed (len=48) we see every cycle.

Why #1135's fix doesn't cover this

#1159 added exponential backoff (anti-crash) and dropped feature_state to 1 Hz. That correctly fixes the permanent-wedge / crash failure mode at strong RSSI. It does not address (a) sustained ENOMEM oscillation when the uplink itself is the bottleneck, nor (b) the global backoff starving control packets — both of which only surface under a weak uplink and were not in the #1135 repro (−39 dBm).

Proposed fixes (prioritized)

  1. Give control packets a priority send path so CSI backoff can't starve them. Add stream_sender_send_priority() (or a priority flag) that skips the suppression gate and, if it does hit ENOMEM, returns −1 quietly without extending the global streak. Route feature_state / HEALTH / sync through it. These are ≤48 B at ≤1 Hz — trivial pbuf pressure, won't re-trigger the crash cascade (which was 50 Hz × large CSI). Directly eliminates the feature_state emit failed / rv_mesh_send failed spam.
  2. Reduce / adapt the CSI offered load (the only lever that drains the queue). Quick: CSI_MIN_SEND_INTERVAL_US 20 ms → 40–50 ms (50 → 20–25 Hz; pipeline floor is 20 Hz). Better: throttle CSI rate when s_enomem_streak is high or uplink RSSI is weak — cut load at the source instead of relying on a fixed cooldown.
  3. Check backoff before serializing (csi_collector.c:280): csi_serialize_frame() currently runs a multi-hundred-byte memcpy and then stream_sender_send returns −1 during backoff. Gate on backoff first to stop burning CPU in callback context on frames that will be dropped.
  4. Operational note: the −78 dBm uplink is the physical root — improving RSSI (placement / antenna / 20 MHz / less-congested channel) shrinks the whole problem. Worth surfacing as a health warning when ap.rssi is below a threshold.
  5. Fix the misleading HEALTH sent log — it prints unconditionally even when rv_mesh_send returned ESP_FAIL.

Suggested first pass: #1 + #2-quick + #3 + #5 — a contained change across stream_sender.c, csi_collector.c, adaptive_controller.c, rv_mesh.c, no buffer-config changes (already proven not to help).

Validation

Re-confirmed on main @ 7831f294: firmware tree clean, CSI_MIN_SEND_INTERVAL_US = 20 ms, global s_backoff_until_us, rssi_median_dbm = ap.rssi (STA uplink) all present as cited.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfirmwareESP32 firmware

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions