Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fw_update: implement retry with backoff #703

Merged
merged 4 commits into from
Dec 13, 2024

Conversation

szczys
Copy link
Contributor

@szczys szczys commented Dec 10, 2024

Copy link

github-actions bot commented Dec 10, 2024

Visit the preview URL for this PR (updated for commit f18f360):

https://golioth-firmware-sdk-doxygen-dev--pr703-szczys-fw-upda-r3e0ovzc.web.app

(expires Fri, 20 Dec 2024 21:36:49 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: a9993e61697a3983f3479e468bcb0b616f9a0578

@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch from 0e88b3b to 051546f Compare December 10, 2024 22:34
@szczys szczys force-pushed the szczys/fw-report-stability branch from 1d412b1 to 16fbd8c Compare December 10, 2024 22:43
@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch from 051546f to 8507b7b Compare December 10, 2024 22:44
@szczys szczys force-pushed the szczys/fw-report-stability branch from 16fbd8c to 5144bca Compare December 10, 2024 22:56
@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch from 8507b7b to 31d5b66 Compare December 10, 2024 22:57
Copy link

github-actions bot commented Dec 10, 2024

Code Coverage

Code Coverage

Package Line Rate Branch Rate Health
include.golioth 75% 50%
port.linux 57% 31%
port.utils 58% 46%
port.zephyr 54% 24%
src 70% 31%
Summary 68% (2654 / 3901) 31% (1114 / 3611)

Copy link

codecov bot commented Dec 10, 2024

Codecov Report

Attention: Patch coverage is 0% with 46 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/fw_update.c 0.00% 46 Missing ⚠️
Files with missing lines Coverage Δ
src/fw_update.c 0.00% <0.00%> (ø)

... and 2 files with indirect coverage changes

@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch from 31d5b66 to 505a7d0 Compare December 11, 2024 02:30
@szczys szczys force-pushed the szczys/fw-report-stability branch 2 times, most recently from be31140 to c60c269 Compare December 11, 2024 18:12
@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch 2 times, most recently from ec86659 to 288f72d Compare December 11, 2024 19:39
@szczys szczys changed the base branch from szczys/fw-report-stability to main December 11, 2024 19:42
@szczys szczys changed the base branch from main to szczys/fw-report-stability December 11, 2024 19:42
@szczys szczys force-pushed the szczys/fw-report-stability branch 2 times, most recently from c82b93e to 867e299 Compare December 12, 2024 15:18
@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch from 288f72d to 83faa32 Compare December 12, 2024 15:19
Copy link
Contributor

@sam-golioth sam-golioth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @szczys! I left one comment about rollover. We should think about it holistically across all the places where we're making comparisons or performing math using the result of golioth_sys_now_ms() so I don't think that should block merging this PR.

src/fw_update.c Outdated

static int32_t backoff_ms_before_expiration(struct fw_update_component_context *ctx)
{
uint64_t actual_duration = golioth_sys_now_ms() - ctx->last_fail_ts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to consider overflow on FreeRTOS. The API for golioth_sys_now_ms() returns a uint64_t, but it's calculated from the tick count which is a uint32_t (usually, technically it can be configured). Depending on the tick period this could rollover in as little as 49 days.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this with rollover in mind. Though I did not realize we don't have access to the full bit-width of the type being return. Since I'm returning an int32 for the duration, why don't I just mask the uint64 tick count to uint32 before calculation? I think it will perform the same way.

@szczys szczys force-pushed the szczys/fw-report-stability branch from 867e299 to 7ec533d Compare December 12, 2024 22:26
@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch from 83faa32 to 5888e01 Compare December 12, 2024 22:42
Base automatically changed from szczys/fw-report-stability to main December 13, 2024 19:02
Copy link
Contributor

@sam-golioth sam-golioth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @szcysy! Appreciate the extra attention to detail around rollovers 🙏

Record the timestamp at which a download fails and the desired backoff
duration in the component context. Add functions to set these variables
after each failure, check them to see if the backoff duration has expired,
and to reset them at the beginning and end of a component download
operation.

Stored and calculated values for backoff timing use uint32_t. While
golioth_sys_now_ms() returns uint64_t, not all platforms (ESP-IDF) support
that full bit-width.

Signed-off-by: Mike Szczys <[email protected]>
Move the mutex lock/unlock functions inside of
received_new_target_component() as it should never be called without
guarding the manifest pointer which is also updated on another thread by
the on_ota_manifest callback.

Check stored target component against what was received with manifest and
use along with the backoff_duration_ms time value in the context to detect
when a component is already being downloaded.

Reset the backoff values in the context when a new component is received
from the manifest.

Signed-off-by: Mike Szczys <[email protected]>
When awaiting a retry, use the timeout value of the manifest semaphore to
wait until the appropriate backoff duration has passed before retrying a
download. This allows a new manifest to be received and acted upon while
awaiting a retry.

Set up the component for a retry as soon as the download step has finished
(whether it is a success or failure). The retry continues to be in effect
until all validation and other steps of the fw_update process have been
confirmed. Finally, reset the backoff values to indicate a retry
is not necessary.

Signed-off-by: Mike Szczys <[email protected]>
Use the new GOLIOTH_OTA_REASON_AWAIT_RETRY reason when reporting IDLE state
during retry backoff period.

Signed-off-by: Mike Szczys <[email protected]>
@szczys szczys force-pushed the szczys/fw-update-retry-with-backoff branch from 5888e01 to f18f360 Compare December 13, 2024 21:36
@szczys szczys merged commit a7c5dbf into main Dec 13, 2024
70 of 76 checks passed
@szczys szczys deleted the szczys/fw-update-retry-with-backoff branch December 13, 2024 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants