Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zigbee ota updates #192

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Zigbee ota updates #192

wants to merge 10 commits into from

Conversation

oleo65
Copy link
Contributor

@oleo65 oleo65 commented Apr 28, 2024

This is a first attempt at activating the zigbee ota functionality. All changes are toogled via Kconfig parameter.

Vastly inspired by the zigbee light switch sample from the nrf-toolkit.

This works so far in adding an additional OTA cluster to the device and publishing the flashed firmware version.

I was not able to get a firmware update on the device, but this is most probably due to the quite complicated process of providing the updates to the zigbee controller.

This addresses #128

All relevant sections are behind a build toggle which can be activated in Kconfig.
Parameters are based on the zigbee light switch example from nrf-connect toolkit.
clang-format with vs code.
@rbaron
Copy link
Owner

rbaron commented Apr 29, 2024

Thanks @oleo65, this looks great. A couple of minor things:

  1. How essential is it having an extra proj.conf file? I'm asking because every split in config creates extra burden in maintenance, future improvements and specially manual testing. It adds up if we multiply by the number of variants and coordinators. I would vote for keeping it in a single config. We can set it to CONFIG_ZIGBEE_FOTA=n until we're confident that it works as intended. I assume all of our variants are beefy enough, and hopefully if will become the default soon
  2. Could you add a few words to the sample docs?
  3. It would be better to run at least one manual end-to-end test before merging this. I will try some on my own as soon as I can, but please try and also do an independent one

code/nrf-connect/samples/zigbee/src/main.c Outdated Show resolved Hide resolved
ZB_AF_SET_IDENTIFY_NOTIFICATION_HANDLER(PRST_ZIGBEE_ENDPOINT, identify_cb);

#ifdef CONFIG_ZIGBEE_FOTA
ZB_AF_SET_IDENTIFY_NOTIFICATION_HANDLER(CONFIG_ZIGBEE_FOTA_ENDPOINT, identify_cb);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what's the expected behavior here? We have two identify handlers in different endpoints. Where is the FOTA one used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CONFIG_ZIGBEE_FOTA_ENDPOINT comes from the library which is activated by the CONFIG_ZIGBEE_FOTA toggle. Basically an additional zigbee application (?) is registered on the device if OTA functionality is active. This is what I interpreted from the docs and the code.

The lines from above I simply took from the nordic sample. If they are really needed I cannot tell, but I wanted to rule out all possible errors during development by missing something out.

I would suggest, we leave it like it is now, and once we can confirm that OTA is working we could try if it is really needed. 👼

image

@oleo65
Copy link
Contributor Author

oleo65 commented May 1, 2024

I am perfectly happy with only one prj.conf file. The settings handling is already a hassle in my opinion, since I usually customize some settings for my builds and this will often lead to merge conflicts. Do you know if there is some sort of inheritance possible for *.conf files, e.g. I could create my own conf based on the prf.conf and only customize the settings I need?

I will come up with some changes to the docs, once we confirmed, that ota is working. Well at least I think we can provide the info that OTA is possible via zigbee and can be activated with the build toggle. But I would not give individual advice on how to provide the OTA files to the zigbee network, because in my opinion this is out of scope of this project and highly individual based on the setup.

Regarding E2E testing I am still trying to figure out how to get the zigbee coordinator to be aware of the update file. My setup is Home Assistant + ZHA + Conbee II USB Stick. In theory this should be possible and I successfully updated other zigbee devices in the past. OTA is undergoing a heavy refactoring within ZHA / zipgy at the moment and it might be broken with HA 2024.4.X so I am waiting on the next update to see if it resolves by itself. On my end I think I provided everything as I saw in various places in the HA forums. 🤷

@rbaron
Copy link
Owner

rbaron commented May 2, 2024

Awesome, thanks @oleo65. Code looks good to me and it's a great addition. I am away from my computer for the next couple of days, but I will help run some e2e tests when I'm back.

I have a similar setup as yours, with ZHA + SkyConnect. I also need to figure out how to make OTA work there.

@rbaron
Copy link
Owner

rbaron commented May 11, 2024

@oleo65 I started unraveling the mysteries of Zigbee OTA. Pushed a minor commit to this PR defining the sw_build_id characteristic on the basic attrs cluster.

Made some notes, but still more questions than answers 😬:

zigpy OTA

Examples

Questions

  • How to generate a zigbee OTA file?
    • A: it's generated in the build/zephyr/.*zigbee if CONFIG_ZIGBEE_FOTA=y
  • How to provide the ota file to ZHA?
  • How to generate a local OTA index.json:
  • How is the Zigbee firmware version specified?
    • A: sw_build_id? This is in the basic_attr clusters. We currently set some attrs there but not the sw_build_id AFAICT. Apparently in ZBOSS's zb_zcl_basic_attrs_ext_t this is the char sw_ver[17] struct field. There's OTA-specific software version in the OTA cluster. In nRF Connect SDK this is specified with the CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION string value. In the OTA file, the File version is derived from it (source).
    • Is this info also available in the .zigbee file?
      • A: Yes, see answer above. It's in the "file version" OTA header
    • Can we downgrade the firmware?
      - A: Kinda, not as straightforward as upgrading

@rbaron
Copy link
Owner

rbaron commented May 11, 2024

@oleo65 did you manage to flash and run the firmware from this PR? I tried a few times but I was getting a hard crash a few seconds after boot.

I compared the changes in the Kconfig with the defines in the prj_fota.conf light_switch example and AFAICT they look the same. Maybe there's an issue with overriding the defaults like we did there. If I apply the same settings in our prj.conf, things work.

So if you excuse me rolling back on my own suggestion, I took the liberty of creating a separate prj_fota.conf again, like you did in the first place and like the examples do 😜

@oleo65
Copy link
Contributor Author

oleo65 commented May 11, 2024

Yes, I have exactly the same issue and was not able to get the parasites running with the latest revision.

With my initial commit I flashed two parasite successfully and after some resetting they work flawless so I can far (except from the OTA functionality).

I did some research a couple of days ago but was not able to get a real lead on this. The stack trace seems to point to a wrong pointer somewhere in the zigbee library. But I suspect that this is not the real problem but rather a symptom.

What I discovered was that we might need to increase the stack size further. But I was not able to figure this out yet and I think it is only a weak lead.

Have you been able to successfully flash and run the firmware by splitting the config up again? I believe it should be possible to run it from a merged config file however. 🤔

@rbaron
Copy link
Owner

rbaron commented May 11, 2024

Yeah! It works with the split config. A inheritance config model would be cool, but I'm also fine with split + some duplication while we test this. If the goal is to enable OTA by default, we can then get rid of the old, non-OTA one anyway.

Now I'm trying to see if I can get Home Assistant + zigpy to figure out about our .zigbee file. Let's see 🤞

@rbaron
Copy link
Owner

rbaron commented May 11, 2024

Aha. New way to specify OTA updates via ZHA, introduced in home-assistant/core#111159, onboarding changes from zigpy zigpy/zigpy#1340. Directly pointing to .zigbee files considered harmful. Let's do it!

@oleo65
Copy link
Contributor Author

oleo65 commented May 11, 2024

Great find @rbaron . This looks promising. 😊

@rbaron
Copy link
Owner

rbaron commented May 12, 2024

I made some progress but no OTA yet on Home Assistant 2024.5.3 + Sky Connect. What I did:

Compiled and flashed a b-parasite with CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION="1.0.0".
Compiled a separate OTA file with CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION="1.2.0"

I went for the recommended way to tell ZHA about OTA firmwares, using the z2m_local_index:

zha:
  zigpy_config:
    ota:
      z2m_local_index: /config/zigpy_ota/index.json

To generate the index.json, I used https://github.com/Koenkk/zigbee-OTA:

$ node scripts/add.js ~/dev/parasite/b-parasite/code/nrf-connect/samples/zigbee/build/zephyr/DB15-0141-01020000-b-parasite.zigbee

This gives me the index.json, that I put in /config/zigpy_ota/index.json:

[
  {
    "fileVersion": 16908288,
    "fileSize": 375832,
    "manufacturerCode": 56085,
    "imageType": 321,
    "sha512": "bbce367fca8c9bc853e0d67c4e05a5479f1443997e6a946677f465df2c1963370c6d2c9f9a2444ee95e011af01253629caf27643c1afe4fdc27735a106f705a9",
    "url": "https://github.com/Koenkk/zigbee-OTA/raw/master/images/DIY/DB15-0141-01020000-b-parasite.zigbee",
    "path": "/config/zigpy_ota/DB15-0141-01020000-b-parasite.zigbee"
  }
]

The url above seems to be required, even when we specify the local path.

I then put the compiled OTA file from this PR in /config/zigpy_ota/DB15-0141-01020000-b-parasite.zigbee.

I now can trigger an image_notify command from Home Assistant:

Screenshot 2024-05-12 at 09 07 22

And I see some promising messages in the log:

$ tail -f /config/home-assistant.log | egrep 'zigpy.ota|zigpy.util|zigpy.*OTA'

2024-05-12 08:43:13.165 DEBUG (MainThread) [zigpy.ota] Loaded 1 images from provider: <zigpy.ota.providers.LocalZ2MProvider object at 0x7ff56c52d280>
2024-05-12 08:43:13.168 DEBUG (MainThread) [zigpy.ota] Caching image OtaImageWithMetadata(metadata=LocalOtaImageMetadata(file_version=16908288, manufacturer_id=56085, image_type=321, checksum='sha512:bbce367fca8c9bc853e0d67c4e05a5479f1443997e6a946677f465df2c1963370c6d2c9f9a2444ee95e011af01253629caf27643c1afe4fdc27735a106f705a9', file_size=375832, manufacturer_names=(), model_names=(), changelog=None, release_notes=None, min_hardware_version=None, max_hardware_version=None, min_current_file_version=None, max_current_file_version=None, specificity=None, source='Local Z2M provider (/config/zigpy_ota/index.json)', path=PosixPath('/config/zigpy_ota/DB15-0141-01020000-b-parasite.zigbee')), firmware=OTAImage(header=OTAImageHeader(upgrade_file_id=200208670, header_version=256, header_length=56, field_control=<FieldControl: 0>, manufacturer_id=56085, image_type=321, file_version=16908288, stack_version=2, header_string=<'\nb-parasite'>, image_size=375832, *device_specific_file=False, *hardware_versions_present=False, *security_credential_version_present=False), subelements=[<SubElement(tag_id=<ElementTagId.UPGRADE_IMAGE: 0>, data=[375770:1500a163696d6781a2626964006473697a651a0005bbc33db8...649268ed52716e])>]))
2024-05-12 08:43:13.169 DEBUG (MainThread) [zigpy.ota] Picking firmware OtaImageWithMetadata(metadata=LocalOtaImageMetadata(file_version=16908288, manufacturer_id=56085, image_type=321, checksum='sha512:bbce367fca8c9bc853e0d67c4e05a5479f1443997e6a946677f465df2c1963370c6d2c9f9a2444ee95e011af01253629caf27643c1afe4fdc27735a106f705a9', file_size=375832, manufacturer_names=(), model_names=(), changelog=None, release_notes=None, min_hardware_version=None, max_hardware_version=None, min_current_file_version=None, max_current_file_version=None, specificity=None, source='Local Z2M provider (/config/zigpy_ota/index.json)', path=PosixPath('/config/zigpy_ota/DB15-0141-01020000-b-parasite.zigbee')), firmware=OTAImage(header=OTAImageHeader(upgrade_file_id=200208670, header_version=256, header_length=56, field_control=<FieldControl: 0>, manufacturer_id=56085, image_type=321, file_version=16908288, stack_version=2, header_string=<'\nb-parasite'>, image_size=375832, *device_specific_file=False, *hardware_versions_present=False, *security_credential_version_present=False), subelements=[<SubElement(tag_id=<ElementTagId.UPGRADE_IMAGE: 0>, data=[375770:1500a163696d6781a2626964006473697a651a0005bbc33db8...649268ed52716e])>]))

The "Picking firmware [...]" log line comes from this line in zigpy. This should mean that our OTA file was deemed compatible. I would expect for the OTA procedure to start at this point.

But then nothing happens, neither on HA nor b-parasite AFAICT. The full HA log also doesn't tell me more. it's possible I'm missing some details in the OTA file header.

The mystery remains.

@rbaron
Copy link
Owner

rbaron commented May 12, 2024

Update: goddamn, it kinda works!!

After the steps from the previous comment, I thought the OTA process would automatically kick in. But from the zigpy code I realize that the outcome of the process is just a "firmware update available" event trigger.

It makes this available in the device page in Home Assistant:

Screenshot 2024-05-12 at 10 12 07

Tapping the Firmware button brings up the pop up. Tapping the Install button starts the OTA (after an initial failure, second tap works).

Home Assistant logs:

2024-05-12 10:12:25.910 DEBUG (MainThread) [zigpy.ota] Picking firmware OtaImageWithMetadata(metadata=LocalOtaImageMetadata(file_version=16908288, manufacturer_id=56085, image_type=321, checksum='sha512:1355b4562923676aedae5d6eb2da611620a3515861cdd502c30824b6d1198eda12077bf25e5b743e0e1ce8c102c73986886a7ec776e61e12bade186621382ccf', file_size=375835, manufacturer_names=(), model_names=(), changelog=None, release_notes=None, min_hardware_version=None, max_hardware_version=None, min_current_file_version=None, max_current_file_version=None, specificity=None, source='Local Z2M provider (/config/zigpy_ota/index.json)', path=PosixPath('/config/zigpy_ota/DB15-0141-01020000-b-parasite.zigbee')), firmware=OTAImage(header=OTAImageHeader(upgrade_file_id=200208670, header_version=256, header_length=60, field_control=<FieldControl.HARDWARE_VERSIONS_PRESENT: 4>, manufacturer_id=56085, image_type=321, file_version=16908288, stack_version=2, header_string=<'\nb-parasite'>, image_size=375835, minimum_hardware_version=<HWVersion version=0 revision=0>, maximum_hardware_version=<HWVersion version=0 revision=255>, *device_specific_file=False, *hardware_versions_present=True, *security_credential_version_present=False), subelements=[<SubElement(tag_id=<ElementTagId.UPGRADE_IMAGE: 0>, data=[375769:1500a163696d6781a2626964006473697a651a0005bbc23db8...797026ae065868])>]))
2024-05-12 10:12:36.879 INFO (MainThread) [zigpy.device] [0xd35a] OTA upgrade progress: (40 / 375835): 0.0106%
2024-05-12 10:12:37.709 INFO (MainThread) [zigpy.device] [0xd35a] OTA upgrade progress: (80 / 375835): 0.0213%
2024-05-12 10:12:38.521 INFO (MainThread) [zigpy.device] [0xd35a] OTA upgrade progress: (120 / 375835): 0.0319%
2024-05-12 10:12:39.491 INFO (MainThread) [zigpy.device] [0xd35a] OTA upgrade progress: (160 / 375835): 0.0426%
2024-05-12 10:12:40.450 INFO (MainThread) [zigpy.device] [0xd35a] OTA upgrade progress: (200 / 375835): 0.0532%

... 13 minutes later ...

2024-05-12 10:25:40.300 INFO (MainThread) [zigpy.device] [0xd35a] OTA upgrade progress: (37080 / 375835): 9.8660%

b-parasite logs:

...
[00:22:37.103,027] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 336795 copy: 40
[00:22:37.921,722] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[00:22:37.921,752] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 336755 copy: 40
[00:22:38.739,471] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[00:22:38.739,501] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 336715 copy: 40
[00:22:39.560,699] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[00:22:39.560,729] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 336675 copy: 40
[00:22:40.398,590] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
...

The b-parasite LED is also flashing, due to the prst_led_flash in the ota_evt_handler.

The only thing is that this is very slow. At this rate the full transfer should take... 2 hours! I remember years ago I tried doing an OTA upgrade for an IKEA shortcut button with zigbee2mqtt and was also surprised about this, so maybe it's working as intended. I will try to measure the battery consumption during OTA, but I feel like it's not pretty. The flashing LED may be a little too much too, but let's see. Worst case we just do a few flashes and stop it, instead of continuously flashing it for 2h.

I will report back once the transfer finishes. There's still room for things to go wrong after the transfer.

@rbaron
Copy link
Owner

rbaron commented May 12, 2024

Two hours later, indeed it works.

b-parasite logs as it finishes applying the upgrade and reboots (note the app: Version: 1.2.0):

[02:24:36.752,166] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 235 copy: 40
[02:24:37.575,378] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[02:24:37.575,408] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 195 copy: 40
[02:24:38.434,753] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[02:24:38.434,783] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 155 copy: 40
[02:24:39.284,027] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[02:24:39.284,057] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 115 copy: 40
[02:24:40.251,922] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[02:24:40.251,953] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 75 copy: 40
[02:24:41.070,037] <dbg> zigbee_fota: ota_process_firmware: Process firmware.
[02:24:41.070,068] <dbg> zigbee_fota: ota_process_firmware: Bytes left: 35 copy: 35
[02:24:41.071,319] <inf> zigbee_fota: Close image 0 for writing success: 1
[02:24:41.156,402] <inf> zigbee_fota: Firmware downloaded.
[02:24:41.556,579] <inf> zigbee_fota: Full image downloaded.
[02:24:41.556,671] <inf> zigbee_fota: New OTA image downloaded.
[02:24:41.879,058] <inf> zigbee_fota: Mark OTA image as downloaded.
[00:00:00.000,457] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.0 ***
[00:00:00.003,845] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.004,058] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.004,058] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.081,634] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access
[00:00:00.082,519] <inf> app: Dumping debug counters:
[00:00:00.090,209] <inf> app: - boot: 3
[00:00:00.097,076] <inf> app: - main_finish: 2
[00:00:00.103,973] <inf> app: - reset_cause_pin: 2
[00:00:00.107,482] <inf> app: - reset_cause_software: 1
[00:00:00.111,968] <inf> app: - sensors_read_after: 150
[00:00:00.116,516] <inf> app: - sensors_read_before: 150
[00:00:00.116,607] <wrn> app: Version: 1.2.0
[00:00:01.016,998] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:01.017,639] <inf> app: Marked image as OK
[00:00:01.021,942] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:01.022,399] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:09.723,754] <inf> zigbee_app_utils: Joined network successfully on reboot signal (Extended PAN ID: f645dd6563c51a88, PAN ID: 0x80af)

HA also shows the device as up-to-date now, with firmware version 0x01020000 (derived from the 1.0.2 more or less as as described here):

Screenshot 2024-05-12 at 13 33 07

I still want to see if changing the polling interval speeds up the OTA process.

@rbaron
Copy link
Owner

rbaron commented May 12, 2024

Something to keep in mind: in the latest nRF Connect SDK (2.6.99 as of writing) there's a known issue NCSIDB-1213: Subsequent Zigbee FOTA updates fail -- Once a Zigbee FOTA update is interrupted for any reason, the subsequent updates will fail until a device reboot. source

I tried enabling the "turbo poll" extension (which I think is enabled by default anyway), but there was no difference in upload time. I see roughly two messages per second, which makes me think we're hitting the ZB_PIM_DEFAULT_FAST_POLL_INTERVAL of 500 ms. These seem related. Still 2h seem way too much.

@oleo65
Copy link
Contributor Author

oleo65 commented May 12, 2024

I believe I read somewhere else that long update times are very common with zigbee end devices. It is not something which looks too concerning to me.

Very glad to see that you were able to get it working. I might be able to check this with my setup tonight, so we have at least a second E2E Test. 😊

@rbaron
Copy link
Owner

rbaron commented May 12, 2024

I swapped the LED flashing with a LOG_INF and measured the power consumption during the OTA process:

image

It averages at 1.78 mA. For 2h, that's 3.6 mAh, or around 3.6 mA / 200 mAh = 2% of the total capacity of a CR2032. It means we can do 50 OTAs if we don't do anything else. Fun fact.

@oleo65
Copy link
Contributor Author

oleo65 commented May 12, 2024

@rbaron Have you been able to successfully flash a parasite with a firmware build from the latest commit using the prj_fota.conf?

The firmware builds fine but the parasite errors out before zephyr os is loaded. Flashing an older already built firmware I had lying around worked fine on the same parasite. I had a similar error a while back but cannot remember how I fixed it in the end. (I am building for a parasite v1.2.0, but this seems like a minor detail to me.)

0006061 I Target type is nrf52840 [board]
0006097 I DP IDR = 0x2ba01477 (v1 rev2) [dap]
0006202 I AHB-AP#0 IDR = 0x24770011 (AHB-AP var1 rev2) [discovery]
0006249 I AP#1 IDR = 0x02880000 (AP var0 rev0) [discovery]
0006297 I NRF52840 not in secure state [target_nRF52]
0006308 I AHB-AP#0 Class 0x1 ROM table #0 @ 0xe00ff000 (designer=244 part=008) [rom_table]
0006315 I [0]<e000e000:SCS v7-M class=14 designer=43b:Arm part=00c> [rom_table]
0006319 I [1]<e0001000:DWT v7-M class=14 designer=43b:Arm part=002> [rom_table]
0006321 I [2]<e0002000:FPB v7-M class=14 designer=43b:Arm part=003> [rom_table]
0006324 I [3]<e0000000:ITM v7-M class=14 designer=43b:Arm part=001> [rom_table]
0006331 I [4]<e0040000:TPIU M4 class=9 designer=43b:Arm part=9a1 devtype=11 archid=0000 devid=ca1:0:0> [rom_table]
0006338 I [5]<e0041000:ETM M4 class=9 designer=43b:Arm part=925 devtype=13 archid=0000 devid=0:0:0> [rom_table]
0006347 I CPU core #0: Cortex-M4 r0p1, v7.0-M architecture [cortex_m]
0006348 I   Extensions: [DSP, FPU, FPU_V4, MPU] [cortex_m]
0006348 I   FPU present: FPv4-SP-D16-M [cortex_m]
0006357 I This appears to be an nRF52840 QIAA D0 [target_nRF52]
0006362 I 4 hardware watchpoints [dwt]
0006366 I 6 hardware breakpoints, 4 literal comparators [fpb]
0015929 C Memory transfer fault @ 0x20018000-0x2001801f [__main__]

@rbaron
Copy link
Owner

rbaron commented May 13, 2024

@oleo65 that's odd. I just tried it again and it works for me at the current commit:

On a nRF52840 2.0.0 board:

[00:00:00.000,488] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.0 ***
[00:00:00.004,760] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.004,974] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.004,974] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.075,134] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access
[00:00:00.076,019] <inf> app: Dumping debug counters:
[00:00:00.079,315] <inf> app: - boot: 2
[00:00:00.081,451] <inf> app: - main_finish: 1
[00:00:00.083,740] <inf> app: - reset_cause_pin: 1
[00:00:00.085,388] <inf> app: - reset_cause_unknown: 1
[00:00:00.087,249] <inf> app: - sensors_read_after: 1
[00:00:00.089,233] <inf> app: - sensors_read_before: 1
[00:00:00.989,837] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:00.989,868] <inf> app: Firmware version: 0.0.0
[00:00:00.993,041] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:00.993,530] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:00.999,267] <inf> zigbee_app_utils: Device started for the first time
[00:00:00.999,267] <inf> zigbee_app_utils: Start network steering
[00:00:00.999,328] <inf> zigbee_app_utils: Started network rejoin procedure.

On a nRF52840 1.2.0 board:

SEGGER J-Link V7.80c - Real time terminal output
J-Link EDU Mini V1 compiled Mar 28 2023 17:01:59 V1.0, SN=801036382
Process: JLinkExe
[00:00:00.000,457] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.0 ***
[00:00:00.003,814] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.004,028] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.004,058] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.004,180] <err> littlefs: WEST_TOPDIR/modules/fs/littlefs/lfs.c:1234: Corrupted dir pair at {0x0, 0x1}
[00:00:00.004,180] <wrn> littlefs: can't mount (LFS -84); formatting
[00:00:00.176,818] <inf> littlefs: /lfs mounted
[00:00:00.329,315] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access
[00:00:00.332,061] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access
[00:00:00.332,946] <inf> app: Dumping debug counters:
[00:00:00.334,991] <inf> app: - boot: 1
[00:00:00.336,029] <inf> app: - reset_cause_pin: 1
[00:00:01.236,511] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:01.236,572] <inf> app: Firmware version: 0.0.0
[00:00:01.576,263] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access
[00:00:01.920,227] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:01.920,623] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:01.925,994] <inf> zigbee_app_utils: Device started for the first time
[00:00:01.926,025] <inf> zigbee_app_utils: Start network steering
[00:00:01.926,055] <inf> zigbee_app_utils: Started network rejoin procedure.
[00:00:02.941,986] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access
[00:00:03.008,544] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access

I see the log format are a bit different (they look like the CONFIG_LOG_MODE_MINIMAL), are you using the untouched prj_fota.yaml? Flashing with the VSCode extension?

I would try in this order, and see if it works:

  • Checking out an earlier commit from this PR
  • Double check the build (I just realize this PR won't build for the nRF52833 variant... Will fix later)
  • Do a full erase while flashing
  • Try a different board

I'm curious. Let me know how it goes.

@oleo65
Copy link
Contributor Author

oleo65 commented May 16, 2024

I tried a ton of things now and are equally more confused and learned some things. 👼

In the end I did a clean checkout of the zigbee-ota-updates branch and recompiled with the unchanged .conf files. For the regular prj.conf file, it is working as expected, it compiles fine and the logging after flashing from the sensor is working as well. I compiled for a brand new v2.0.0 parasite, and it is running flawlessly for a couple of days now.

For building the prj_ota.conf, it also compiles fine and I can flash it to the parasite. Strangely the parasite also appears on my zigbee network if I open it for registration. But the logging is broken and it resets with the memory transfer fault. What I don't understand yet is, why it worked before on my initial build a couple of commits before. I was not able to reproduce the behaviour yet, which makes it even more confusing.

I did not have any more time for debugging, since I am away for the week.

For some more context, I build against the latest v2.5.3 release of the nordic SDK, and flashed with my CMSIS-DAP probe via python. This might not be the same as with the J-LINK, but it worked over the course of the last 2 years for me. Logging I usually do via the SEGGER RTT protocol which should be essentially the same as with the J-LINK, I believe. 🤔

Thanks for your ideas already, I believe we are getting close to the solution.

@rbaron
Copy link
Owner

rbaron commented May 16, 2024

Tks @oleo65. The plot thickens...

I just tried the v2.5.3 SDK & toolchain, and it still works fine for me when building and flashing from VSCode with a J-Link.

I also tried:

Building with docker:

$ docker run --rm -v ${PWD}:/repo nordicplayground/nrfconnect-sdk:v2.5-branch west build --pristine --board [email protected] --build-dir /repo/code/nrf-connect/samples/zigbee/build /repo/code/nrf-connect/samples/zigbee -- -DCONF_FILE=/repo/code/nrf-connect/samples/zigbee/prj_fota.conf

... lots of logs ...

[405/405] Generating zephyr/merged.hex

And flashing with nrfjprog:

$ nrfjprog -f nrf52 --program code/nrf-connect/samples/zigbee/build/zephyr/merged.hex --sectorerase --verify --reset
[ #################### ]   9.768s | Erase file - Done erasing
[ #################### ]   2.776s | Program file - Done programming
[ #################### ]   2.940s | Verify file - Done verifying
Applying system reset.
Run.

Getting logs directly with JLinkExe:

$ echo "\n\n\n\n0\n/dev/stdout" | JLinkRTTLogger
SEGGER J-Link RTT Logger
Compiled Sep 27 2022 16:03:22
(c) 2016-2017 SEGGER Microcontroller GmbH, www.segger.com
         Solutions for real time microcontroller applications

Default logfile path: /Users/rbaron/Library/Application Support/SEGGER

------------------------------------------------------------

Device name. Default: NRF52840_XXAA Target interface. Interface speed [kHz]. Default: 4000 kHz RTT Control Block address. Default: auto-detection RTT Channel name or index. Default: channel 1 Output file. Default: RTT_<ChannelName>_<Time>.log
------------------------------------------------------------
... some logs ...

[00:00:00.000,488] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.0-2-g2b8768c4235a ***
[00:00:00.004,730] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.004,974] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.004,974] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.081,115] <inf> app: Dumping debug counters:
[00:00:00.088,714] <inf> app: - boot: 7
[00:00:00.093,139] <inf> app: - main_finish: 6
[00:00:00.098,937] <inf> app: - reset_cause_cpu_lockup: 1
[00:00:00.103,271] <inf> app: - reset_cause_pin: 3
[00:00:00.107,818] <inf> app: - reset_cause_software: 3
[00:00:00.112,091] <inf> app: - sensors_read_after: 20
[00:00:00.116,455] <inf> app: - sensors_read_before: 20
[00:00:01.016,906] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:01.016,967] <inf> app: Firmware version: 0.0.0
[00:00:01.020,172] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:01.020,660] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:01.026,367] <inf> zigbee_app_utils: Device started for the first time
[00:00:01.026,397] <inf> zigbee_app_utils: Start network steering
[00:00:01.026,458] <inf> zigbee_app_utils: Started network rejoin procedure.

With the introduction of the OTA stuff, there's also significant changes in the apps memory address layout (we now have MCUBoot plus 2 app slots, as opposed to a single, bootloaderless app). I wonder if this is confusing your RTT logger. What's the full command you're using for checking out the logger? Maybe I can reproduce with that.

Did you double check if your logger works with your early commit in this PR (e.g.: git checkout 7dcd4bf)? That would help narrow things down.

I'm also gonna be away form my computer for the next couple of days, but I'm looking forward to us nailing this down 👌

@oleo65
Copy link
Contributor Author

oleo65 commented May 26, 2024

Well I am a little bit smarter now and believe that the failing logging might be related to my debugging probe.

Using the latest commit from this branch, I am able to build both variants with and without OTA and flash them to a v2.0.0 parasite.

Interestingly the parasite works with both variants, and gets paired with HA. The logging with my probe still fails for the OTA firmware variant but works for the other. Maybe I need to provide a custom memory address to the rtt command, but wasn't able to figure out how to do this, and what to provide.

Furthermore I was also able to get a successful OTA notification in HA, using the (sadly quite tedious) method described above to provide the firmware via ZHA custom json file. I ended up not flashing, because I simply did not change anything but are confident, that it should have worked. I might be able to test this sometime this week. Then we have another full humanoid E2E test. 😉

@rbaron
Copy link
Owner

rbaron commented May 27, 2024

Tks @oleo65. What's the actual command/software/probe you're using to check the logs? I will try to use OpenOCD/pyOCD too. If you're feeling brave, you could try seeing if you can run a different OTA application (like the light switch), and checking if logging works with that.

@oleo65
Copy link
Contributor Author

oleo65 commented May 30, 2024

I am using pyocd with pyocd rtt --target nrf52840 to get a rtt session running.

For a probe, I am using a nanoDAP probe from MuseLab, in SWD programming mode.

@rbaron
Copy link
Owner

rbaron commented Jun 1, 2024

Okay @oleo65, I managed to reproduce (the same? a different?) crash with pyocd:

With prj_fota.conf & pyocd it really hard faults every time. The parasite reboots and seem to work normally, but logging does not reconnect:

$ pyocd --version
0.34.3
$ pyocd rtt -t nrf52840
0001224 I Target type is nrf52840 [board]
0001271 I DP IDR = 0x2ba01477 (v1 rev2) [dap]
0001299 I AHB-AP#0 IDR = 0x24770011 (AHB-AP var1 rev2) [ap]
0001340 I AP#1 IDR = 0x02880000 (AP var0 rev0) [ap]
0001341 I NRF52840 not in secure state [target_nRF52]
0001347 I AHB-AP#0 Class 0x1 ROM table #0 @ 0xe00ff000 (designer=244 part=008) [rom_table]
0001354 I [0]<e000e000:SCS v7-M class=14 designer=43b:Arm part=00c> [rom_table]
0001358 I [1]<e0001000:DWT v7-M class=14 designer=43b:Arm part=002> [rom_table]
0001362 I [2]<e0002000:FPB v7-M class=14 designer=43b:Arm part=003> [rom_table]
0001366 I [3]<e0000000:ITM v7-M class=14 designer=43b:Arm part=001> [rom_table]
0001372 I [4]<e0040000:TPIU M4 class=9 designer=43b:Arm part=9a1 devtype=11 archid=0000 devid=ca1:0:0> [rom_table]
0001378 I [5]<e0041000:ETM M4 class=9 designer=43b:Arm part=925 devtype=13 archid=0000 devid=0:0:0> [rom_table]
0001380 I CPU core #0 is Cortex-M4 r0p1 [cortex_m]
0001381 I FPU present: FPv4-SP-D16-M [cortex_m]
0001383 I 4 hardware watchpoints [dwt]
0001386 I 6 hardware breakpoints, 4 literal comparators [fpb]
0001394 I RTT control block search range [0x20000000, 0x040000] [rtt_cmd]
0002707 I _SEGGER_RTT @ 0x200042bc with 3 aUp and 3 aDown [rtt_cmd]
[00:00:00.000,518] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.3 ***
[00:00:00.004,821] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.005,065] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.005,065] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.079,437] <wrn> debug_counters: fs_read returned 0, expected 4, assuming first access
[00:00:00.080,322] <inf> app: Dumping debug counters:
[00:00:00.086,578] <inf> app: - boot: 8
[00:00:00.090,362] <inf> app: - main_finish: 7
[00:00:00.094,238] <inf> app: - reset_cause_pin: 2
[00:00:00.098,144] <inf> app: - reset_cause_software: 5
[00:00:00.101,104] <inf> app: - reset_cause_unknown: 1
[00:00:00.104,675] <inf> app: - sensors_read_after: 9
[00:00:00.108,337] <inf> app: - sensors_read_before: 9
[00:00:01.008,941] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:01.009,002] <inf> app: Firmware version: 0.0.0
[00:00:01.012,207] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:01.012,725] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:01.018,371] <inf> zigbee_app_utils: Device started for the first time
[00:00:01.018,402] <inf> zigbee_app_utils: Start network steering
[00:00:01.018,432] <inf> zigbee_app_utils: Started network rejoin procedure.
[00:00:06.810,607] <err> mpsl_init: MPSL ASSERT: 107, 302
[00:00:06.810,638] <err> os: ***** HARD FAULT *****
[00:00:06.810,668] <err> os:   Fault escalation (see below)
[00:00:06.810,668] <err> os: ARCH_EXCEPT with reason 3

[00:00:06.810,699] <err> os: r0/a1:  0x00000003  r1/a2:  0x20002538  r2/a3:  0x20002538
[00:00:06.810,729] <err> os: r3/a4:  0x200043c8 r12/ip:  0x20002610 r14/lr:  0x0003408f
[00:00:06.810,729] <err> os:  xpsr:  0x4100001b
[00:00:06.810,760] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x0000000a  s[ 2]:  0x00010000  s[ 3]:  0x00010185
[00:00:06.810,791] <err> os: s[ 4]:  0x00373031  s[ 5]:  0x00ff58d5  s[ 6]:  0x4000b000  s[ 7]:  0x000107cf
[00:00:06.810,791] <err> os: s[ 8]:  0x00000001  s[ 9]:  0x4000b000  s[10]:  0x00000000  s[11]:  0x0000f7b7
[00:00:06.810,821] <err> os: s[12]:  0x20000e38  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x0000002d
[00:00:06.810,852] <err> os: fpscr:  0x200006e4
[00:00:06.810,852] <err> os: Faulting instruction address (r15/pc): 0x0001d00c
[00:00:06.810,913] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:06.810,913] <err> os: Fault during interrupt handling

[00:00:06.810,943] <err> os: Current thread: 0x20002538 (unknown)
[00:00:07.477,233] <err> fatal_error: Resetting system

(No more logs, it hangs here)

With prj_fota.conf & VSCode extension, logs work fine:

image

With prj_fota & JLinkRTTLogger, logging also works fine every time:

$ echo "\n\n\n\n0\n/dev/stdout" | JLinkRTTLogger
SEGGER J-Link RTT Logger
Compiled Sep 27 2022 16:03:22
(c) 2016-2017 SEGGER Microcontroller GmbH, www.segger.com
         Solutions for real time microcontroller applications

Default logfile path: /Users/rbaron/Library/Application Support/SEGGER

------------------------------------------------------------

Device name. Default: NRF52 Target interface. Interface speed [kHz]. Default: 4000 kHz RTT Control Block address. Default: auto-detection RTT Channel name or index. Default: channel 1 Output file. Default: RTT_<ChannelName>_<Time>.log
------------------------------------------------------------
Connected to:
  J-Link EDU Mini V1 compiled Mar 28 2023 17:01:59
  S/N: 801036382

Searching for RTT Control Block...OK.
3 up-channels found:
0: Terminal
1:
2:
Selected RTT Channel description:
  Index: 0
  Name:  Terminal
  Size:  4096 bytes.

Output file: /dev/stdout

Getting RTT data from target. Press any key to quit.
------------------------------------------------------------

[00:00:00.000,488] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.3 ***
[00:00:00.004,791] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.005,004] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.005,035] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
Transfer rate: 814 Bytes/s Data written: 408 Bytes [00:00:00.078,002] <inf> app: Dumping debug counters:
[00:00:00.083,312] <inf> app: - boot: 7
[00:00:00.086,639] <inf> app: - main_finish: 4
[00:00:00.089,660] <inf> app: - reset_cause_pin: 5
[00:00:00.093,444] <inf> app: - reset_cause_software: 1
[00:00:00.096,099] <inf> app: - reset_cause_unknown: 1
[00:00:00.099,304] <inf> app: - sensors_read_after: 6
[00:00:00.102,600] <inf> app: - sensors_read_before: 6
[00:00:01.003,051] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:01.003,112] <inf> app: Firmware version: 0.0.0
[00:00:01.006,286] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:01.006,774] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:01.012,451] <inf> zigbee_app_utils: Device started for the first time
[00:00:01.012,451] <inf> zigbee_app_utils: Start network steering
[00:00:01.012,512] <inf> zigbee_app_utils: Started network rejoin procedure.
[00:00:42.598,175] <inf> zigbee_app_utils: Network steering was not successful (status: -1)

With the default prj.conf & pyocd, it breaks sometimes, but reboots and works on the second time when it happens:

$ pyocd rtt -t nrf52840
0001250 I Target type is nrf52840 [board]
0001292 I DP IDR = 0x2ba01477 (v1 rev2) [dap]
0001324 I AHB-AP#0 IDR = 0x24770011 (AHB-AP var1 rev2) [ap]
0001363 I AP#1 IDR = 0x02880000 (AP var0 rev0) [ap]
0001364 I NRF52840 not in secure state [target_nRF52]
0001370 I AHB-AP#0 Class 0x1 ROM table #0 @ 0xe00ff000 (designer=244 part=008) [rom_table]
0001377 I [0]<e000e000:SCS v7-M class=14 designer=43b:Arm part=00c> [rom_table]
0001381 I [1]<e0001000:DWT v7-M class=14 designer=43b:Arm part=002> [rom_table]
0001385 I [2]<e0002000:FPB v7-M class=14 designer=43b:Arm part=003> [rom_table]
0001389 I [3]<e0000000:ITM v7-M class=14 designer=43b:Arm part=001> [rom_table]
0001395 I [4]<e0040000:TPIU M4 class=9 designer=43b:Arm part=9a1 devtype=11 archid=0000 devid=ca1:0:0> [rom_table]
0001401 I [5]<e0041000:ETM M4 class=9 designer=43b:Arm part=925 devtype=13 archid=0000 devid=0:0:0> [rom_table]
0001403 I CPU core #0 is Cortex-M4 r0p1 [cortex_m]
0001404 I FPU present: FPv4-SP-D16-M [cortex_m]
0001405 I 4 hardware watchpoints [dwt]
0001408 I 6 hardware breakpoints, 4 literal comparators [fpb]
0001415 I RTT control block search range [0x20000000, 0x040000] [rtt_cmd]
0002580 I _SEGGER_RTT @ 0x20003ad0 with 3 aUp and 3 aDown [rtt_cmd]
[00:00:10.439,331] <err> mpsl_init: MPSL ASSERT: 107, 302
[00:00:10.439,331] <err> os: ***** HARD FAULT *****
[00:00:10.439,361] <err> os:   Fault escalation (see below)
[00:00:10.439,392] <err> os: ARCH_EXCEPT with reason 3

[00:00:10.439,392] <err> os: r0/a1:  0x00000003  r1/a2:  0x20002100  r2/a3:  0x20002100
[00:00:10.439,422] <err> os: r3/a4:  0x20003bdc r12/ip:  0x200021d8 r14/lr:  0x0002545b
[00:00:10.439,422] <err> os:  xpsr:  0x4100001b
[00:00:10.439,453] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x0000000a  s[ 2]:  0x00010000  s[ 3]:  0x00003f85
[00:00:10.439,483] <err> os: s[ 4]:  0x00373031  s[ 5]:  0x00ff6c03  s[ 6]:  0x4000b000  s[ 7]:  0x000045cf
[00:00:10.439,483] <err> os: s[ 8]:  0x00000001  s[ 9]:  0x4000b000  s[10]:  0x00000000  s[11]:  0x000035b7
[00:00:10.439,514] <err> os: s[12]:  0x20000d08  s[13]:  0x20000d08  s[14]:  0x00000040  s[15]:  0x0000002f
[00:00:10.439,544] <err> os: fpscr:  0x200006ac
[00:00:10.439,575] <err> os: Faulting instruction address (r15/pc): 0x000101ec
[00:00:10.439,605] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:10.439,605] <err> os: Fault during interrupt handling

[00:00:10.439,636] <err> os: Current thread: 0x20002100 (unknown)
[00:00:11.330,902] <err> fatal_error: Resetting system
[00:00:00.000,488] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.3 ***
[00:00:00.004,394] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.004,608] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.004,638] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.075,195] <inf> app: Dumping debug counters:
[00:00:00.078,826] <inf> app: - boot: 3
[00:00:00.080,932] <inf> app: - main_finish: 2
[00:00:00.083,587] <inf> app: - reset_cause_pin: 1
[00:00:00.085,479] <inf> app: - reset_cause_software: 2
[00:00:00.087,432] <inf> app: - sensors_read_after: 2
[00:00:00.089,447] <inf> app: - sensors_read_before: 2
[00:00:00.989,898] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:00.992,980] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:00.993,469] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:00.999,237] <inf> zigbee_app_utils: Device started for the first time
[00:00:00.999,237] <inf> zigbee_app_utils: Start network steering
[00:00:00.999,298] <inf> zigbee_app_utils: Started network rejoin procedure.
[00:00:42.582,214] <inf> zigbee_app_utils: Network steering was not successful (status: -1)
[00:01:25.132,232] <inf> zigbee_app_utils: Network steering was not successful (status: -1)
[00:02:09.716,217] <inf> zigbee_app_utils: Network steering was not successful (status: -1)

With prj.conf & the VSCode extension, it still works as before.

It's possible that we're doing something wrong, but pyocd looks very sus here, even breaking sometimes with the stock prj.conf for me...

@rbaron
Copy link
Owner

rbaron commented Jun 3, 2024

@oleo65 I managed to also connect to RTT directly with openocd too, and apparently it works for me. BTW funny feeling when I google a question and it shows the answer on this repo.... Maybe you can try with your current setup to see if openocd works instead of pyocd?

In terminal 1:

$ openocd -f interface/jlink.cfg -c 'transport select swd' -f target/nrf52.cfg -c 'init'

In terminal 2:

$ nc localhost 4444
> rtt setup 0x20000000 80000 "SEGGER RTT"
rtt setup 0x20000000 80000 "SEGGER RTT"
> rtt start
rtt start
rtt: Searching for control block 'SEGGER RTT'
rtt: Control block found at 0x200042bc
> rtt server start 9090 0
rtt server start 9090 0
Listening on port 9090 for rtt connections
accepting 'rtt' connection on tcp/9090

In terminal 3:

$ nc localhost 9090
[00:00:00.000,518] <inf> ieee802154_nrf5: nRF5 802154 radio initialized
*** Booting nRF Connect SDK v2.5.3 ***
[00:00:00.004,821] <inf> littlefs: LittleFS version 2.5, disk version 2.0
[00:00:00.005,065] <inf> littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle
[00:00:00.005,065] <inf> littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.081,085] <inf> app: Dumping debug counters:
[00:00:00.088,439] <inf> app: - boot: 18
[00:00:00.092,742] <inf> app: - main_finish: 15
[00:00:00.099,060] <inf> app: - reset_cause_pin: 6
[00:00:00.105,346] <inf> app: - reset_cause_software: 1
[00:00:00.109,527] <inf> app: - reset_cause_unknown: 11
[00:00:00.113,739] <inf> app: - sensors_read_after: 2175
[00:00:00.118,011] <inf> app: - sensors_read_before: 2175
[00:00:00.124,114] <inf> app: - steering_watchdog_restart: 1
[00:00:01.024,719] <inf> zigbee_app_utils: Enabled sleepy end device behavior.
[00:00:01.024,780] <inf> app: Firmware version: 0.0.0
[00:00:01.029,388] <inf> zigbee_app_utils: Production configuration is not present or invalid (status: -1)
[00:00:01.029,876] <inf> zigbee_app_utils: Zigbee stack initialized
[00:00:09.197,631] <inf> zigbee_app_utils: Unable to join the network, start network steering
[00:00:09.197,723] <inf> zigbee_app_utils: Started network rejoin procedure.
[00:00:18.361,694] <inf> zigbee_app_utils: Unable to join the network, start network steering
[00:00:29.063,568] <inf> zigbee_app_utils: Joined network successfully on reboot signal (Extended PAN ID: 8a5e76c12f8d9ce6, PAN ID: 0xd15b)
[00:00:29.063,629] <inf> zigbee_app_utils: Network rejoin procedure stopped as NOT scheduled.

@oleo65
Copy link
Contributor Author

oleo65 commented Jun 3, 2024

I guess this repo is famous now. 😎

I hope I will have the time this week to give it a try. Thanks for sorting this out.

@oleo65
Copy link
Contributor Author

oleo65 commented Jul 7, 2024

Some updates from my E2E tests.

I was finally able to get the OTA notification for the boards within HA. 🥳

Screenshot_20240707-234704.png

Over the course of about one week 5 of my devices popped up with updates available. They are all "old" v1.2.0 devices.

However I was not able to successfully flash any of them. The update process starts successfully but keeps on canceling randomly after 20-40 minutes at varying percentages. Still, I consider this a big step forward.

I increased the parent poll interval to 300 seconds for my parasites to try to conserve more battery power. Spoiler: does not seem to have any effect. This might add more instability to the OTA process, but is only a wild guess. 😊

I suspect that maybe my network is too weak or busy and the OTA connection not stable enough. I plan to do some more testing as soon as there is more time to spare.

As a side note: Currently I don't exactly know how to provide different firmwares for the different hardware revisions at the same time to the zigbee coordinator. This might be challenging.

@SchmidL
Copy link

SchmidL commented Jul 26, 2024

Hi - love all your work! Thank you a lot!

While starting the project, I still have the "old" v1.2.0 devices. I finally switched many of my battery-powered devices to Zigbee and enabled HA to grab all the data.
Is there already a consensus about using the plain Zigbee firmware or already the version with the OTA update?
I would also provide a write-up (including calibration) for others if I manage to pull it off!

@oleo65
Copy link
Contributor Author

oleo65 commented Jul 27, 2024

I'd say, it depends. 😉

On how easy your deviced could be updated via cable in case you are stuck and on how adventurous you are. From a feature perspective, the main branch and the OTA branch firmware are identical.

Having said that, I am running both v1.2.0 and v2.0.0 parasites with the OTA variant from this branch just fine for a few weeks now.

A successful OTA update is still pending for me, but mostly due to lack of time for experiments and somewhat flaky implementation from the Home Assisant / ZHA side.

I would love to see someone else validate the approach, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants