Fix race condition in RedisCache when setting key expiration #59212

CaringDev · 2024-11-28T15:36:20Z

Fix race condition in RedisCache when setting key expiration

Ensure the key is set before setting expiration.

Description

Currently, writing the key and (attempting to) setting its expiration is done concurrently.
Oftentimes, the key will not be written by the time an attempt to set the expiration is made. Thus, the key will be non-expiring which leads to potentially exhausting Redis storage and logical errors in applications relying on key expiry.

💡I don't know whether the change from scripts in 8.x to the two calls in 9.x has any other (unwanted) side-effects.
Edit: E.g. storing the key then failing to set the expiration. Over time we'd "leak" keys.

Fixes #59211

tiakun · 2024-11-29T09:40:38Z

I think it should be wrapped in CreateBatch instead, or even CreateTransaction if we want to prevent race condition from multiple clients.

mgravell · 2024-11-29T10:03:48Z

So: concurrent is not (in this case) unordered. Are you actually seeing scenarios where this is causing a problem, or is this speculative? Ironically, if you *are* seeing a problem, I had a conversation earlier this week where I posited a potential way this could happen, which is a SE.Redis fix, not as aspnetcore fix (I wear both hats). Re the CreateBatch: that shouldn't really matter here. Re CreateTransaction: we'd really rather not if we can help it, for max compatibility with different servers.

…

On Fri, 29 Nov 2024, 09:41 Tia, ***@***.***> wrote: I think it should be wrapped in CreateBatch instead, or even CreateTransaction if we want to prevent race condition from multiple clients. — Reply to this email directly, view it on GitHub <#59212 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMGXAIS7HZMZIKDBZM32DAZC3AVCNFSM6AAAAABSVM2MYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBXGQZTGMBQGQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

CaringDev · 2024-11-29T12:47:00Z

So: concurrent is not (in this case) unordered

Ah, sorry, I must have missed something here.

Are you actually seeing
scenarios where this is causing a problem

Yes, I crashed our production Redis (pre-prod not having enough data / monitoring) by filling it with non-expiring data (we do set an expiration on all keys).

Ironically, if you are seeing a problem, I had a conversation earlier
this week where I posited a potential way this could happen, which is a
SE.Redis fix, not as aspnetcore fix (I wear both hats).

With M.E.C.SER 8.0.11 and SE.R "standard" or pinned to latest this does not happen, with 9.0.0 it does with both. So I assumed M.E.C to be at fault.

mgravell · 2024-11-29T13:02:50Z

Based on way too much knowledge of both, I disagree; I suspect M.E.C.R. is using the API as advertised, and SE.Redis is failing in a niche case. It is moot, though. I will investigate and fix the offending article.

…

On Fri, 29 Nov 2024, 12:47 Raphael Schweizer, ***@***.***> wrote: So: concurrent is not (in this case) unordered Ah, sorry, I must have missed something here. Are you actually seeing scenarios where this is causing a problem Yes, I crashed our production Redis (pre-prod not having enough data / monitoring) by filling it with non-expiring data (we *do* set an expiration on all keys). Ironically, if you *are* seeing a problem, I had a conversation earlier this week where I posited a potential way this could happen, which is a SE.Redis fix, not as aspnetcore fix (I wear both hats). With M.E.C.SER 8.0.11 and SE.R "standard" or pinned to latest this does not happen, with 9.0.0 it does with both. So I assumed M.E.C to be at fault. — Reply to this email directly, view it on GitHub <#59212 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMG6BXVT4D7LFRXWZEL2DBO5ZAVCNFSM6AAAAABSVM2MYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBXG42TCOBSGQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

CaringDev · 2024-11-29T14:20:54Z

I will investigate and fix the offending article.

I'll gladly accept any solution, whatever is the root cause 😀
Let me know if I can help with more details / testing.
Interestingly enough I can only reproduce the behavior on our final environment (Suse CloudFoundry), not locally on Windows nor in WSL nor aspnetcore-runtime-deps images.

mgravell · 2024-11-29T18:19:01Z

Interesting. Can I ask, since I'm not familiar with that service: is that vanilla Redis, or is it a Redis-alike? Any idea whether it has anything exotic going on like additional load balancing? I already planned to investigate this potential ordering issue (because: order matters), but I'm curious as to whether that product / service might also be a wildcard here.

CaringDev · 2024-11-30T14:04:23Z

@mgravell I added more details to #59211
In particular, Redis (consumed by me "as a service") is using the Bitnami charts. I could probably get the exact configuration, if that helps. Interestingly that bug is only triggered when the client runs on said Suse CloudFoundry environment and only with M.E.C.R @ 9.0.0, not with 8.0.11 (both regardless of SE.R version, i.e. transitive or pinned to latest).

mgravell · 2024-11-30T15:59:29Z

M.E.C.R. V8 uses a fundamentally different approach that indeed won't be dependent on pipelined ordering. However, pipelined ordering is meant to be part of the SE.Redis contract. As I say, I think I know exactly what is wrong, and will look ASAP.

If I can confirm and resolve in SE.Redis, then the "fix" over here will probably just be a package-ref update. That way, we fix many more possible call paths.

mgravell · 2024-11-30T16:20:47Z

/cc @NickCraver @philon-msft it might just be confirmation bias, but this ^^^ sounds like the ordering thing I mentioned in our call last week!

mgravell · 2024-12-01T08:39:03Z

My priority first is making sure we understand the problem and root cause, so we can have confidence both in any mitigation and any relevant misbehaving tools.

Initial check doesn't show a pathological problem - that isn't conclusive, however, plus I'm using "current" (2.8.22) rather than the 2.7.27 which M.E.C.SER uses; will re-run in a moment with that version:

results with 2.8.22

Connected
Worker 0 completed 10000 passes of 1000 ops each; failures: 0
Worker 4 completed 10000 passes of 1000 ops each; failures: 0
Worker 1 completed 10000 passes of 1000 ops each; failures: 0
Worker 2 completed 10000 passes of 1000 ops each; failures: 0
Worker 8 completed 10000 passes of 1000 ops each; failures: 0
Worker 7 completed 10000 passes of 1000 ops each; failures: 0
Worker 5 completed 10000 passes of 1000 ops each; failures: 0
Worker 6 completed 10000 passes of 1000 ops each; failures: 0
Worker 3 completed 10000 passes of 1000 ops each; failures: 0
Worker 12 completed 10000 passes of 1000 ops each; failures: 0
Worker 10 completed 10000 passes of 1000 ops each; failures: 0
Worker 11 completed 10000 passes of 1000 ops each; failures: 0
Worker 9 completed 10000 passes of 1000 ops each; failures: 0
Worker 14 completed 10000 passes of 1000 ops each; failures: 0
Worker 16 completed 10000 passes of 1000 ops each; failures: 0
Worker 19 completed 10000 passes of 1000 ops each; failures: 0
Worker 15 completed 10000 passes of 1000 ops each; failures: 0
Worker 17 completed 10000 passes of 1000 ops each; failures: 0
Worker 18 completed 10000 passes of 1000 ops each; failures: 0
Worker 13 completed 10000 passes of 1000 ops each; failures: 0
Total ops: 200000000

and 2.7.27

Connected
Worker 1 completed 10000 passes of 1000 ops each; failures: 0
Worker 5 completed 10000 passes of 1000 ops each; failures: 0
Worker 0 completed 10000 passes of 1000 ops each; failures: 0
Worker 9 completed 10000 passes of 1000 ops each; failures: 0
Worker 7 completed 10000 passes of 1000 ops each; failures: 0
Worker 3 completed 10000 passes of 1000 ops each; failures: 0
Worker 13 completed 10000 passes of 1000 ops each; failures: 0
Worker 6 completed 10000 passes of 1000 ops each; failures: 0
Worker 11 completed 10000 passes of 1000 ops each; failures: 0
Worker 4 completed 10000 passes of 1000 ops each; failures: 0
Worker 14 completed 10000 passes of 1000 ops each; failures: 0
Worker 8 completed 10000 passes of 1000 ops each; failures: 0
Worker 18 completed 10000 passes of 1000 ops each; failures: 0
Worker 12 completed 10000 passes of 1000 ops each; failures: 0
Worker 15 completed 10000 passes of 1000 ops each; failures: 0
Worker 17 completed 10000 passes of 1000 ops each; failures: 0
Worker 10 completed 10000 passes of 1000 ops each; failures: 0
Worker 2 completed 10000 passes of 1000 ops each; failures: 0
Worker 16 completed 10000 passes of 1000 ops each; failures: 0
Worker 19 completed 10000 passes of 1000 ops each; failures: 0
Total ops: 200000000

I'm going to have to dig deeper.

mgravell · 2024-12-02T11:14:19Z

My test rig added over here; at the current time I believe I was wrong about SE.Redis having a potential out-of-order glitch; neither aggressive testing nor code inspection have supported that idea. Right now, I am unable to reproduce this symptom.

mgravell

For completeness : my hesitation here is that this (the proposed change) introduces an extra round-trip latency in all scenarios with TTL, which should typically be just about all of them. For local servers, that isn't a huge problem, but for remote servers: this can be significant. The existing implementation was designed specifically to avoid this extra latency, by using the pipelined API scenario explicitly exposed by SE.Redis - short version being: this should already be correctly ordered.

Before we jump in and look at fixes here, I'm very keen to prove the actual problem, to make sure we're fixing the right thing in the most appropriate way.

tiakun · 2024-12-04T01:47:04Z

My test rig added over here; at the current time I believe I was wrong about SE.Redis having a potential out-of-order glitch; neither aggressive testing nor code inspection have supported that idea. Right now, I am unable to reproduce this symptom.

I did inspect into WriteMessageTakingWriteLockAsync for some hours, and got the same conclusion. The message either go into queue or the lock must be taken, which prevents the message to be sent before the queue has been completely processed. This enforces ordering if WriteMessageTakingWriteLockAsync calls are from the same thread because queueing and sending are synchronous, and the FlushAsync call is already nicely handled not to release lock before it really finished.

I guess you are very well-informed of the above information though, but hope it helps.

I also ran your test on .NET 8 with 2.8.22 and main branch of SE.Redis against local Redis server on WSL2, and it comes out with the same result as yours.

mgravell · 2024-12-04T12:52:50Z

That was indeed my analysis (and yes, I'm very familiar with this code). The only reason I was suspicious in the first place is that in some related code (I'm playing with a rewrite of the IO core for SE.Redis) I approached it differently, basically "take the queue lock, try to take the write lock; if we take the write lock, release the queue lock and write - otherwise enqueued" - but from the perspective of a single caller thread, I'm satisfied that the result is the same and no queue jumping can occur..

Hence my questions about the underlying server; maybe a non-compliant server, or a RESP-aware proxy that splits traffic without considering pipelines.

dotnet-policy-service · 2024-12-11T20:25:58Z

Looks like this PR hasn't been active for some time and the codebase could have been changed in the meantime.
To make sure no conflicting changes have occurred, please rerun validation before merging. You can do this by leaving an /azp run comment here (requires commit rights), or by simply closing and reopening.

Fix race condition in RedisCache when setting key expiration

50e2d4c

CaringDev requested review from captainsafia, halter73 and mgravell as code owners November 28, 2024 15:36

dotnet-issue-labeler bot added the area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions label Nov 28, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 28, 2024

mgravell self-assigned this Nov 30, 2024

mgravell added bug This issue describes a behavior which is not expected - a bug. feature-caching Includes: StackExchangeRedis and SqlServer distributed caches labels Nov 30, 2024

mgravell reviewed Dec 2, 2024

View reviewed changes

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in RedisCache when setting key expiration #59212

Fix race condition in RedisCache when setting key expiration #59212

CaringDev commented Nov 28, 2024 •

edited

Loading

tiakun commented Nov 29, 2024

mgravell commented Nov 29, 2024 via email

CaringDev commented Nov 29, 2024

mgravell commented Nov 29, 2024 via email

CaringDev commented Nov 29, 2024 •

edited

Loading

mgravell commented Nov 29, 2024

CaringDev commented Nov 30, 2024

mgravell commented Nov 30, 2024 •

edited

Loading

mgravell commented Nov 30, 2024

mgravell commented Dec 1, 2024 •

edited

Loading

mgravell commented Dec 2, 2024

mgravell left a comment •

edited

Loading

tiakun commented Dec 4, 2024

mgravell commented Dec 4, 2024

dotnet-policy-service bot commented Dec 11, 2024

Fix race condition in RedisCache when setting key expiration #59212

Are you sure you want to change the base?

Fix race condition in RedisCache when setting key expiration #59212

Conversation

CaringDev commented Nov 28, 2024 • edited Loading

Fix race condition in RedisCache when setting key expiration

Description

tiakun commented Nov 29, 2024

mgravell commented Nov 29, 2024 via email

CaringDev commented Nov 29, 2024

mgravell commented Nov 29, 2024 via email

CaringDev commented Nov 29, 2024 • edited Loading

mgravell commented Nov 29, 2024

CaringDev commented Nov 30, 2024

mgravell commented Nov 30, 2024 • edited Loading

mgravell commented Nov 30, 2024

mgravell commented Dec 1, 2024 • edited Loading

mgravell commented Dec 2, 2024

mgravell left a comment • edited Loading

Choose a reason for hiding this comment

tiakun commented Dec 4, 2024

mgravell commented Dec 4, 2024

dotnet-policy-service bot commented Dec 11, 2024

CaringDev commented Nov 28, 2024 •

edited

Loading

CaringDev commented Nov 29, 2024 •

edited

Loading

mgravell commented Nov 30, 2024 •

edited

Loading

mgravell commented Dec 1, 2024 •

edited

Loading

mgravell left a comment •

edited

Loading