Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in RedisCache when setting key expiration #59212

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

CaringDev
Copy link

@CaringDev CaringDev commented Nov 28, 2024

Fix race condition in RedisCache when setting key expiration

Ensure the key is set before setting expiration.

Description

Currently, writing the key and (attempting to) setting its expiration is done concurrently.
Oftentimes, the key will not be written by the time an attempt to set the expiration is made. Thus, the key will be non-expiring which leads to potentially exhausting Redis storage and logical errors in applications relying on key expiry.

💡I don't know whether the change from scripts in 8.x to the two calls in 9.x has any other (unwanted) side-effects.
Edit: E.g. storing the key then failing to set the expiration. Over time we'd "leak" keys.

Fixes #59211

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions label Nov 28, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 28, 2024
@tiakun
Copy link

tiakun commented Nov 29, 2024

I think it should be wrapped in CreateBatch instead, or even CreateTransaction if we want to prevent race condition from multiple clients.

@mgravell
Copy link
Member

mgravell commented Nov 29, 2024 via email

@CaringDev
Copy link
Author

So: concurrent is not (in this case) unordered

Ah, sorry, I must have missed something here.

Are you actually seeing
scenarios where this is causing a problem

Yes, I crashed our production Redis (pre-prod not having enough data / monitoring) by filling it with non-expiring data (we do set an expiration on all keys).

Ironically, if you are seeing a problem, I had a conversation earlier
this week where I posited a potential way this could happen, which is a
SE.Redis fix, not as aspnetcore fix (I wear both hats).

With M.E.C.SER 8.0.11 and SE.R "standard" or pinned to latest this does not happen, with 9.0.0 it does with both. So I assumed M.E.C to be at fault.

@mgravell
Copy link
Member

mgravell commented Nov 29, 2024 via email

@CaringDev
Copy link
Author

CaringDev commented Nov 29, 2024

I will investigate and fix the offending article.

I'll gladly accept any solution, whatever is the root cause 😀
Let me know if I can help with more details / testing.
Interestingly enough I can only reproduce the behavior on our final environment (Suse CloudFoundry), not locally on Windows nor in WSL nor aspnetcore-runtime-deps images.

@mgravell
Copy link
Member

Interesting. Can I ask, since I'm not familiar with that service: is that vanilla Redis, or is it a Redis-alike? Any idea whether it has anything exotic going on like additional load balancing? I already planned to investigate this potential ordering issue (because: order matters), but I'm curious as to whether that product / service might also be a wildcard here.

@CaringDev
Copy link
Author

@mgravell I added more details to #59211
In particular, Redis (consumed by me "as a service") is using the Bitnami charts. I could probably get the exact configuration, if that helps. Interestingly that bug is only triggered when the client runs on said Suse CloudFoundry environment and only with M.E.C.R @ 9.0.0, not with 8.0.11 (both regardless of SE.R version, i.e. transitive or pinned to latest).

@mgravell
Copy link
Member

mgravell commented Nov 30, 2024

M.E.C.R. V8 uses a fundamentally different approach that indeed won't be dependent on pipelined ordering. However, pipelined ordering is meant to be part of the SE.Redis contract. As I say, I think I know exactly what is wrong, and will look ASAP.

If I can confirm and resolve in SE.Redis, then the "fix" over here will probably just be a package-ref update. That way, we fix many more possible call paths.

@mgravell mgravell self-assigned this Nov 30, 2024
@mgravell mgravell added bug This issue describes a behavior which is not expected - a bug. feature-caching Includes: StackExchangeRedis and SqlServer distributed caches labels Nov 30, 2024
@mgravell
Copy link
Member

/cc @NickCraver @philon-msft it might just be confirmation bias, but this ^^^ sounds like the ordering thing I mentioned in our call last week!

@mgravell
Copy link
Member

mgravell commented Dec 1, 2024

My priority first is making sure we understand the problem and root cause, so we can have confidence both in any mitigation and any relevant misbehaving tools.

Initial check doesn't show a pathological problem - that isn't conclusive, however, plus I'm using "current" (2.8.22) rather than the 2.7.27 which M.E.C.SER uses; will re-run in a moment with that version:

results with 2.8.22

Connected
Worker 0 completed 10000 passes of 1000 ops each; failures: 0
Worker 4 completed 10000 passes of 1000 ops each; failures: 0
Worker 1 completed 10000 passes of 1000 ops each; failures: 0
Worker 2 completed 10000 passes of 1000 ops each; failures: 0
Worker 8 completed 10000 passes of 1000 ops each; failures: 0
Worker 7 completed 10000 passes of 1000 ops each; failures: 0
Worker 5 completed 10000 passes of 1000 ops each; failures: 0
Worker 6 completed 10000 passes of 1000 ops each; failures: 0
Worker 3 completed 10000 passes of 1000 ops each; failures: 0
Worker 12 completed 10000 passes of 1000 ops each; failures: 0
Worker 10 completed 10000 passes of 1000 ops each; failures: 0
Worker 11 completed 10000 passes of 1000 ops each; failures: 0
Worker 9 completed 10000 passes of 1000 ops each; failures: 0
Worker 14 completed 10000 passes of 1000 ops each; failures: 0
Worker 16 completed 10000 passes of 1000 ops each; failures: 0
Worker 19 completed 10000 passes of 1000 ops each; failures: 0
Worker 15 completed 10000 passes of 1000 ops each; failures: 0
Worker 17 completed 10000 passes of 1000 ops each; failures: 0
Worker 18 completed 10000 passes of 1000 ops each; failures: 0
Worker 13 completed 10000 passes of 1000 ops each; failures: 0
Total ops: 200000000

and 2.7.27

Connected
Worker 1 completed 10000 passes of 1000 ops each; failures: 0
Worker 5 completed 10000 passes of 1000 ops each; failures: 0
Worker 0 completed 10000 passes of 1000 ops each; failures: 0
Worker 9 completed 10000 passes of 1000 ops each; failures: 0
Worker 7 completed 10000 passes of 1000 ops each; failures: 0
Worker 3 completed 10000 passes of 1000 ops each; failures: 0
Worker 13 completed 10000 passes of 1000 ops each; failures: 0
Worker 6 completed 10000 passes of 1000 ops each; failures: 0
Worker 11 completed 10000 passes of 1000 ops each; failures: 0
Worker 4 completed 10000 passes of 1000 ops each; failures: 0
Worker 14 completed 10000 passes of 1000 ops each; failures: 0
Worker 8 completed 10000 passes of 1000 ops each; failures: 0
Worker 18 completed 10000 passes of 1000 ops each; failures: 0
Worker 12 completed 10000 passes of 1000 ops each; failures: 0
Worker 15 completed 10000 passes of 1000 ops each; failures: 0
Worker 17 completed 10000 passes of 1000 ops each; failures: 0
Worker 10 completed 10000 passes of 1000 ops each; failures: 0
Worker 2 completed 10000 passes of 1000 ops each; failures: 0
Worker 16 completed 10000 passes of 1000 ops each; failures: 0
Worker 19 completed 10000 passes of 1000 ops each; failures: 0
Total ops: 200000000

I'm going to have to dig deeper.

@mgravell
Copy link
Member

mgravell commented Dec 2, 2024

My test rig added over here; at the current time I believe I was wrong about SE.Redis having a potential out-of-order glitch; neither aggressive testing nor code inspection have supported that idea. Right now, I am unable to reproduce this symptom.

Copy link
Member

@mgravell mgravell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness : my hesitation here is that this (the proposed change) introduces an extra round-trip latency in all scenarios with TTL, which should typically be just about all of them. For local servers, that isn't a huge problem, but for remote servers: this can be significant. The existing implementation was designed specifically to avoid this extra latency, by using the pipelined API scenario explicitly exposed by SE.Redis - short version being: this should already be correctly ordered.

Before we jump in and look at fixes here, I'm very keen to prove the actual problem, to make sure we're fixing the right thing in the most appropriate way.

@tiakun
Copy link

tiakun commented Dec 4, 2024

My test rig added over here; at the current time I believe I was wrong about SE.Redis having a potential out-of-order glitch; neither aggressive testing nor code inspection have supported that idea. Right now, I am unable to reproduce this symptom.

I did inspect into WriteMessageTakingWriteLockAsync for some hours, and got the same conclusion. The message either go into queue or the lock must be taken, which prevents the message to be sent before the queue has been completely processed. This enforces ordering if WriteMessageTakingWriteLockAsync calls are from the same thread because queueing and sending are synchronous, and the FlushAsync call is already nicely handled not to release lock before it really finished.

I guess you are very well-informed of the above information though, but hope it helps.

I also ran your test on .NET 8 with 2.8.22 and main branch of SE.Redis against local Redis server on WSL2, and it comes out with the same result as yours.

@mgravell
Copy link
Member

mgravell commented Dec 4, 2024

That was indeed my analysis (and yes, I'm very familiar with this code). The only reason I was suspicious in the first place is that in some related code (I'm playing with a rewrite of the IO core for SE.Redis) I approached it differently, basically "take the queue lock, try to take the write lock; if we take the write lock, release the queue lock and write - otherwise enqueued" - but from the perspective of a single caller thread, I'm satisfied that the result is the same and no queue jumping can occur..

Hence my questions about the underlying server; maybe a non-compliant server, or a RESP-aware proxy that splits traffic without considering pipelines.

Copy link
Contributor

Looks like this PR hasn't been active for some time and the codebase could have been changed in the meantime.
To make sure no conflicting changes have occurred, please rerun validation before merging. You can do this by leaving an /azp run comment here (requires commit rights), or by simply closing and reopening.

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions bug This issue describes a behavior which is not expected - a bug. community-contribution Indicates that the PR has been added by a community member feature-caching Includes: StackExchangeRedis and SqlServer distributed caches pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Items with configured expiration stored in IDistributedCache (using RedisCache) end up non-expiring in Redis
3 participants