stream_sender ENOMEM backoff oscillates and starves control packets under a weak uplink — follow-up to #1135 / #1159
Summary
The exponential-backoff fix from #1135 (merged via #1159, current main HEAD 7831f294) stops the permanent wedge / crash — frames do now flow (yield=41–45 pps). But on a node with a weak uplink to the AP (RSSI ≈ −78 dBm) the firmware lands in a steady oscillation of ENOMEM backoff, and — more importantly — the global backoff starves the low-rate control packets (feature_state, mesh HEALTH) that are not themselves the cause of the buffer pressure.
Net effect on a live ESP32-S3: a large fraction of CSI frames are dropped, and feature_state / HEALTH telemetry fails on essentially every cycle. This is distinct from #1135 (which wedged permanently at a strong −39 dBm uplink); the backoff machinery there is working as designed here, but two gaps remain.
Environment
|
|
| Board |
ESP32-S3 (8 MB), USB-serial on COM19 |
| Firmware |
main @ 7831f294 (post-#1159 — the exponential "streak" backoff to 2000 ms is the #1159 code path) |
| Uplink |
STA→AP RSSI −76 to −79 dBm (weak), 2.4 GHz ch 6 |
| CSI RX |
received-frame RSSI −42 to −63 dBm, len 128/256/384 B |
| Role |
leader (c6_espnow time-sync healthy: tx fail=0, 100 % match) |
Observed log (COM19, steady state)
I (61586) adaptive_ctrl: medium tick: state=6 yield=41pps motion=1.00 presence=5.29 rssi=-78
W (61586) adaptive_ctrl: feature_state emit failed
I (61636) stream_sender: ENOMEM backoff expired, resuming sends (38 were suppressed)
W (61636) stream_sender: sendto ENOMEM — backing off for 1600 ms (streak 5)
W (62656) stream_sender: sendto suppressed (ENOMEM backoff, 51 dropped)
I (63236) stream_sender: ENOMEM backoff expired, resuming sends (79 were suppressed)
W (63236) stream_sender: sendto ENOMEM — backing off for 2000 ms (streak 6)
W (64586) rv_mesh: rv_mesh_send: stream_sender failed (len=48)
I (64586) adaptive_ctrl: slow tick (...) HEALTH sent ← logged "sent" but rv_mesh_send actually failed
W (64596) adaptive_ctrl: feature_state emit failed
W (65236) stream_sender: sendto suppressed (ENOMEM backoff, 101 dropped)
feature_state emit failed appears once per medium tick (~1 Hz). The HEALTH sent log on the slow tick is also misleading — the preceding rv_mesh_send: stream_sender failed (len=48) shows the 48-byte HEALTH frame was actually dropped.
Analysis
Root cause — WiFi TX airtime saturation, not buffer size. sendto returns ENOMEM because lwIP cannot allocate a pbuf: the STA TX queue is not draining. The reason it won't drain is the uplink RSSI. rssi_median_dbm is sourced from esp_wifi_sta_get_ap_info() (rv_radio_ops_esp32.c:153), i.e. the link to the associated AP — at −78 dBm the PHY drops to a low MCS, so each CSI frame consumes a lot of airtime, while the node is simultaneously in promiscuous RX. Bigger pools don't help — the existing sdkconfig.defaults comment block (lines 45–54) already documents that WIFI_DYNAMIC_TX=128 + PBUF_POOL=32 produced an identical failure count, concluding "the bottleneck was radio TX time, not pool size."
Offered load that drives this: CSI sends at 50 Hz (csi_collector.c:72, CSI_MIN_SEND_INTERVAL_US = 20 ms), plus sync packets (every 20 frames), feature_state (1 Hz), and HEALTH — all sharing one UDP socket.
Secondary flaw — the backoff is global, so bulk CSI starves the tiny control packets. s_backoff_until_us (stream_sender.c:28, gate at :82) is process-wide. The 50 Hz CSI flood triggers the backoff, and the backoff then suppresses everything, including the ~40 B feature_state and 48 B HEALTH/mesh packets that are negligible airtime and would very likely succeed on their own. The low-rate telemetry is collateral damage of the high-rate bulk stream — exactly the feature_state emit failed / rv_mesh_send failed (len=48) we see every cycle.
Why #1135's fix doesn't cover this
#1159 added exponential backoff (anti-crash) and dropped feature_state to 1 Hz. That correctly fixes the permanent-wedge / crash failure mode at strong RSSI. It does not address (a) sustained ENOMEM oscillation when the uplink itself is the bottleneck, nor (b) the global backoff starving control packets — both of which only surface under a weak uplink and were not in the #1135 repro (−39 dBm).
Proposed fixes (prioritized)
- Give control packets a priority send path so CSI backoff can't starve them. Add
stream_sender_send_priority() (or a priority flag) that skips the suppression gate and, if it does hit ENOMEM, returns −1 quietly without extending the global streak. Route feature_state / HEALTH / sync through it. These are ≤48 B at ≤1 Hz — trivial pbuf pressure, won't re-trigger the crash cascade (which was 50 Hz × large CSI). Directly eliminates the feature_state emit failed / rv_mesh_send failed spam.
- Reduce / adapt the CSI offered load (the only lever that drains the queue). Quick:
CSI_MIN_SEND_INTERVAL_US 20 ms → 40–50 ms (50 → 20–25 Hz; pipeline floor is 20 Hz). Better: throttle CSI rate when s_enomem_streak is high or uplink RSSI is weak — cut load at the source instead of relying on a fixed cooldown.
- Check backoff before serializing (
csi_collector.c:280): csi_serialize_frame() currently runs a multi-hundred-byte memcpy and then stream_sender_send returns −1 during backoff. Gate on backoff first to stop burning CPU in callback context on frames that will be dropped.
- Operational note: the −78 dBm uplink is the physical root — improving RSSI (placement / antenna / 20 MHz / less-congested channel) shrinks the whole problem. Worth surfacing as a health warning when
ap.rssi is below a threshold.
- Fix the misleading
HEALTH sent log — it prints unconditionally even when rv_mesh_send returned ESP_FAIL.
Suggested first pass: #1 + #2-quick + #3 + #5 — a contained change across stream_sender.c, csi_collector.c, adaptive_controller.c, rv_mesh.c, no buffer-config changes (already proven not to help).
Validation
Re-confirmed on main @ 7831f294: firmware tree clean, CSI_MIN_SEND_INTERVAL_US = 20 ms, global s_backoff_until_us, rssi_median_dbm = ap.rssi (STA uplink) all present as cited.
stream_senderENOMEM backoff oscillates and starves control packets under a weak uplink — follow-up to #1135 / #1159Summary
The exponential-backoff fix from #1135 (merged via #1159, current
mainHEAD7831f294) stops the permanent wedge / crash — frames do now flow (yield=41–45 pps). But on a node with a weak uplink to the AP (RSSI ≈ −78 dBm) the firmware lands in a steady oscillation of ENOMEM backoff, and — more importantly — the global backoff starves the low-rate control packets (feature_state, meshHEALTH) that are not themselves the cause of the buffer pressure.Net effect on a live ESP32-S3: a large fraction of CSI frames are dropped, and
feature_state/HEALTHtelemetry fails on essentially every cycle. This is distinct from #1135 (which wedged permanently at a strong −39 dBm uplink); the backoff machinery there is working as designed here, but two gaps remain.Environment
main@7831f294(post-#1159 — the exponential "streak" backoff to 2000 ms is the #1159 code path)len128/256/384 Bc6_espnowtime-sync healthy:tx fail=0, 100 % match)Observed log (COM19, steady state)
feature_state emit failedappears once permedium tick(~1 Hz). TheHEALTH sentlog on the slow tick is also misleading — the precedingrv_mesh_send: stream_sender failed (len=48)shows the 48-byte HEALTH frame was actually dropped.Analysis
Root cause — WiFi TX airtime saturation, not buffer size.
sendtoreturnsENOMEMbecause lwIP cannot allocate a pbuf: the STA TX queue is not draining. The reason it won't drain is the uplink RSSI.rssi_median_dbmis sourced fromesp_wifi_sta_get_ap_info()(rv_radio_ops_esp32.c:153), i.e. the link to the associated AP — at −78 dBm the PHY drops to a low MCS, so each CSI frame consumes a lot of airtime, while the node is simultaneously in promiscuous RX. Bigger pools don't help — the existingsdkconfig.defaultscomment block (lines 45–54) already documents thatWIFI_DYNAMIC_TX=128 + PBUF_POOL=32produced an identical failure count, concluding "the bottleneck was radio TX time, not pool size."Offered load that drives this: CSI sends at 50 Hz (
csi_collector.c:72,CSI_MIN_SEND_INTERVAL_US = 20 ms), plus sync packets (every 20 frames),feature_state(1 Hz), andHEALTH— all sharing one UDP socket.Secondary flaw — the backoff is global, so bulk CSI starves the tiny control packets.
s_backoff_until_us(stream_sender.c:28, gate at:82) is process-wide. The 50 Hz CSI flood triggers the backoff, and the backoff then suppresses everything, including the ~40 Bfeature_stateand 48 BHEALTH/mesh packets that are negligible airtime and would very likely succeed on their own. The low-rate telemetry is collateral damage of the high-rate bulk stream — exactly thefeature_state emit failed/rv_mesh_send failed (len=48)we see every cycle.Why #1135's fix doesn't cover this
#1159 added exponential backoff (anti-crash) and dropped
feature_stateto 1 Hz. That correctly fixes the permanent-wedge / crash failure mode at strong RSSI. It does not address (a) sustained ENOMEM oscillation when the uplink itself is the bottleneck, nor (b) the global backoff starving control packets — both of which only surface under a weak uplink and were not in the #1135 repro (−39 dBm).Proposed fixes (prioritized)
stream_sender_send_priority()(or apriorityflag) that skips the suppression gate and, if it does hit ENOMEM, returns −1 quietly without extending the global streak. Routefeature_state/HEALTH/ sync through it. These are ≤48 B at ≤1 Hz — trivial pbuf pressure, won't re-trigger the crash cascade (which was 50 Hz × large CSI). Directly eliminates thefeature_state emit failed/rv_mesh_send failedspam.CSI_MIN_SEND_INTERVAL_US20 ms → 40–50 ms (50 → 20–25 Hz; pipeline floor is 20 Hz). Better: throttle CSI rate whens_enomem_streakis high or uplink RSSI is weak — cut load at the source instead of relying on a fixed cooldown.csi_collector.c:280):csi_serialize_frame()currently runs a multi-hundred-byte memcpy and thenstream_sender_sendreturns −1 during backoff. Gate on backoff first to stop burning CPU in callback context on frames that will be dropped.ap.rssiis below a threshold.HEALTH sentlog — it prints unconditionally even whenrv_mesh_sendreturnedESP_FAIL.Suggested first pass: #1 + #2-quick + #3 + #5 — a contained change across
stream_sender.c,csi_collector.c,adaptive_controller.c,rv_mesh.c, no buffer-config changes (already proven not to help).Validation
Re-confirmed on
main@7831f294: firmware tree clean,CSI_MIN_SEND_INTERVAL_US = 20 ms, globals_backoff_until_us,rssi_median_dbm = ap.rssi(STA uplink) all present as cited.