Lilygo RTL433 runs fine, then reboots every couple of minutes then hangs... #2092

puterboy · 2024-10-21T16:30:10Z

I am running 2 different Lilygo rtl433 ESP32 devices.
They had been running fine for the last couple of weeks.
Then recently, the one downstairs, started the following behavior.

Run fine for 1 or more days
Start cycle of rebooting every 1-10 minutes (for about 10-20 times)
Freeze/hang (requiring physical reboot)

The upstairs device seems completely stable.
Both devices never seem to get below about 2K free rtl_433 stack space nor do they run out of memory.

It's unlikely to be a power issue as I have a LiIon battery pack backup on the Lilygo device.
More generally, I don't think it's the hardware or firmware per-se as swapping the devices caused the swapped device to start exhibiting the above behavior.

Any ideas on what could be causing this?

Why does instability arise only after a couple of days?
Why does it suddenly start repeatedly rebooting every couple of minutes?
Why does it suddenly freeze and not even reboot?
Why does it affect only the device in one location?

The only thing I can think of is that a new device or device reading comes online that is only received at the downstairs location -- and somehow that triggers a bug or even a stack overflow (before it gets reported to the MQTT broker)...

@1technophile @NorthernMan54 have you seen anything like this?

NorthernMan54 · 2024-10-21T19:12:31Z

Need to see the serial output

puterboy · 2024-10-27T22:55:24Z

OK...

N: [ OMG->MQTT ] topic: home-crap/OMG_lilygo_rtl_433_ESP-2/RTL_433toMQTT/TFA-303221/1/219 msg: {"model":"TFA-303221","id":219,"channel":1,"battery_ok":1,"temperature_C":16.4,"humidity":65,"sendmode":0,"mic":"CRC","protocol":"TFA Dostmann 30.3221.02 T/H Outdoor Sensor","rssi":-94,"duration":206998}
T: isAdupl?
T: Enqueue JSON
T: Queue length: 1
T: Min ind: 6
T: store code : 21 / 381081797
T: Col: val/timestamp
T: mem code : 194 / 381053903
T: mem code : 191 / 381056961
T: mem code : 103 / 381057973
T: mem code : 30 / 381076050
T: mem code : 218 / 381080207
T: mem code : 235 / 381081571
T: mem code : 21 / 381081797
T: mem code : 8 / 381036885
T: mem code : 83 / 381036930
T: mem code : 196 / 381045816
T: mem code : 30 / 381049598
T: mem code : 183 / 381052358
T: isAdupl?
T: no pub. dupl
T: Dequeue JSON
CORRUPT HEAP: Bad head at 0x3ffd9cf8. Expected 0xabba1234 got 0x00000020

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)

Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

Then a few minutes later: (note log level reverted to 'Notice')

N: [ OMG->MQTT ] topic: homeassistant-crap/sensor/LaCrosse-TX141THBv2-0-216-temperature_C/config msg: {"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/0/216","dev_cla":"temperature","unit_of_meas":"°C","name":"Temperature","uniq_id":"LaCrosse-TX141THBv2-0-216-temperature_C","val_tpl":"{{ value_json.temperature_C | is_defined }}","stat_cla":"measurement","device":{"ids":["LaCrosse-TX141THBv2-0-216"],"cns":[["mac","LaCrosse-TX141THBv2-0-216"]],"mdl":"LaCrosse-TX141THBv2","name":"LaCrosse-TX141THBv2-0-216","via_device":"OMG_lilygo_rtl_433_ESP-2"}}
CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x0000001c

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

Then again:

N: [ OMG->MQTT ] topic: homeassistant-crap/sensor/LaCrosse-TX141THBv2-0-216-temperature_C/config msg: {"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/0/216","dev_cla":"temperature","unit_of_meas":"°C","name":"Temperature","uniq_id":"LaCrosse-TX141THBv2-0-216-temperature_C","val_tpl":"{{ value_json.temperature_C | is_defined }}","stat_cla":"measurement","device":{"ids":["LaCrosse-TX141THBv2-0-216"],"cns":[["mac","LaCrosse-TX141THBv2-0-216"]],"mdl":"LaCrosse-TX141THBv2","name":"LaCrosse-TX141THBv2-0-216","via_device":"OMG_lilygo_rtl_433_ESP-2"}}
CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x00000024

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

etc.

So somehow there is a heap corruption...

Note that the last reported stack low water mark was 936 bytes before the first crash and then about 3600 bytes before the subsequent crashes.

So, if it is exhausting stack, it must be doing that while trying to decode the current message so it doesn't get reported via the RFtoMQTT routine.

Again, it can run fine for hours or even days (in this case it ran fine for 4 days (!) before I start seeing the heap corruption. But once it occurs, then it seems to recur and reboot typically every few minutes for 5-10 or more times before resuming stability. In this case it rebooted about 6 times over the course of about 2.5 hours.

Any ideas what may be causing the heap corruption and/or how to troubleshoot?

The only thing I can think of is that there is some sensor that reports intermittently (maybe it's only turned on every once in a while) and which consumes excessive stack causing the heap corruption.
Interestingly, I can't say for sure, but it seems that the crashes typically occur between 6 and 10AM in the morning. I don't think it's one of my known sensors because all mine seem to be working fine (and have been doing so for quite a while)

1technophile · 2024-10-27T23:13:27Z

Maybe try to increase again the stack, and if this is still the problem we should check if there is a way to limit the decoder consumption and avoid such crashes.

puterboy · 2024-10-27T23:16:42Z

That seems though to be a bit "brute force"
Is there any way to log which decoder is being called and how much stack it uses?
That way I could see what is causing the problem...

puterboy · 2024-10-27T23:18:17Z

I like your idea of limiting decoder consumption and perhaps logging any time a decoder tries to use more than that amount so that we can know the name of the decoder and how much stack it sought to consume...

NorthernMan54 · 2024-10-27T23:18:46Z

Add monitor_filters = esp32_exception_decoder to your configuration in platformio.ini

It should show exactly where the error occurred

puterboy · 2024-10-27T23:23:23Z

Add monitor_filters = esp32_exception_decoder to your configuration in platformio.ini

It should show exactly where the error occurred

I did this btw last time I had crashes and it pointed to increasing the stack size but happy to enable it again to see if I get the same error.

puterboy · 2024-10-28T17:18:13Z

I received another half dozen reboots this morning with monitor_filters = esp32_exception_decoder but the serial log error messages seem to be the same as before.

N: [ OMG->MQTT ] topic: home-crap/OMG_lilygo_rtl_433_ESP-2/RTL_433toMQTT/Ambientweather-F007TH/2/178 msg: {"model":"Ambientweather-F007TH","id":178,"channel":2,"battery_ok":1,"temperature_C":16.61111,"humidity":41,"mic":"CRC","protocol":"Ambient Weather F007TH, TFA 30.3208.02, SwitchDocLabs F016TH temperature sensor","rssi":-53,"duration":189001}
CORRUPT HEAP: Bad head at 0x3ffba2f8. Expected 0xabba1234 got 0xffffffff

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810




ELF file SHA256: cc2ead56a1306f41

AND

CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x00000024

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810




ELF file SHA256: cc2ead56a1306f41

etc.

So maybe it's not a decoder issue???? Or at least not like the ones causing #2043 where I got an error saying Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder) which was fixed by increasing stack size.

Any ideas on how to debug this further?
(of course I could just try increasing stack size further to see if that helps but I would like to get to the root of the issue)

NorthernMan54 · 2024-10-29T02:34:14Z

Oh, If you run a build and upload it to the board, then run the monitor it will convert the backtrace addresses to actual lines of code, so we can pin point the issue.

puterboy · 2024-10-29T02:46:55Z

I did re-build and re-uploaded (using platformio) with monitor_filters = esp32_exception_decoder

NorthernMan54 · 2024-10-29T02:56:50Z

That's weird, as it should have given a longer back trace. If you build a different project afterwards, it does break the feature.

puterboy · 2024-10-29T03:05:55Z

I will try to rebuild again...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lilygo RTL433 runs fine, then reboots every couple of minutes then hangs... #2092

Lilygo RTL433 runs fine, then reboots every couple of minutes then hangs... #2092

puterboy commented Oct 21, 2024

NorthernMan54 commented Oct 21, 2024

puterboy commented Oct 27, 2024 •

edited

Loading

1technophile commented Oct 27, 2024

puterboy commented Oct 27, 2024

puterboy commented Oct 27, 2024

NorthernMan54 commented Oct 27, 2024

puterboy commented Oct 27, 2024

puterboy commented Oct 28, 2024

NorthernMan54 commented Oct 29, 2024

puterboy commented Oct 29, 2024

NorthernMan54 commented Oct 29, 2024

puterboy commented Oct 29, 2024

Lilygo RTL433 runs fine, then reboots every couple of minutes then hangs... #2092

Lilygo RTL433 runs fine, then reboots every couple of minutes then hangs... #2092

Comments

puterboy commented Oct 21, 2024

NorthernMan54 commented Oct 21, 2024

puterboy commented Oct 27, 2024 • edited Loading

1technophile commented Oct 27, 2024

puterboy commented Oct 27, 2024

puterboy commented Oct 27, 2024

NorthernMan54 commented Oct 27, 2024

puterboy commented Oct 27, 2024

puterboy commented Oct 28, 2024

NorthernMan54 commented Oct 29, 2024

puterboy commented Oct 29, 2024

NorthernMan54 commented Oct 29, 2024

puterboy commented Oct 29, 2024

puterboy commented Oct 27, 2024 •

edited

Loading