-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in RedisCache when setting key expiration #59212
base: main
Are you sure you want to change the base?
Conversation
I think it should be wrapped in CreateBatch instead, or even CreateTransaction if we want to prevent race condition from multiple clients. |
So: concurrent is not (in this case) unordered. Are you actually seeing
scenarios where this is causing a problem, or is this speculative?
Ironically, if you *are* seeing a problem, I had a conversation earlier
this week where I posited a potential way this could happen, which is a
SE.Redis fix, not as aspnetcore fix (I wear both hats).
Re the CreateBatch: that shouldn't really matter here. Re
CreateTransaction: we'd really rather not if we can help it, for max
compatibility with different servers.
…On Fri, 29 Nov 2024, 09:41 Tia, ***@***.***> wrote:
I think it should be wrapped in CreateBatch instead, or even
CreateTransaction if we want to prevent race condition from multiple
clients.
—
Reply to this email directly, view it on GitHub
<#59212 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEHMGXAIS7HZMZIKDBZM32DAZC3AVCNFSM6AAAAABSVM2MYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBXGQZTGMBQGQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
Ah, sorry, I must have missed something here.
Yes, I crashed our production Redis (pre-prod not having enough data / monitoring) by filling it with non-expiring data (we do set an expiration on all keys).
With M.E.C.SER 8.0.11 and SE.R "standard" or pinned to latest this does not happen, with 9.0.0 it does with both. So I assumed M.E.C to be at fault. |
Based on way too much knowledge of both, I disagree; I suspect M.E.C.R. is
using the API as advertised, and SE.Redis is failing in a niche case. It is
moot, though. I will investigate and fix the offending article.
…On Fri, 29 Nov 2024, 12:47 Raphael Schweizer, ***@***.***> wrote:
So: concurrent is not (in this case) unordered
Ah, sorry, I must have missed something here.
Are you actually seeing
scenarios where this is causing a problem
Yes, I crashed our production Redis (pre-prod not having enough data /
monitoring) by filling it with non-expiring data (we *do* set an
expiration on all keys).
Ironically, if you *are* seeing a problem, I had a conversation earlier
this week where I posited a potential way this could happen, which is a
SE.Redis fix, not as aspnetcore fix (I wear both hats).
With M.E.C.SER 8.0.11 and SE.R "standard" or pinned to latest this does
not happen, with 9.0.0 it does with both. So I assumed M.E.C to be at fault.
—
Reply to this email directly, view it on GitHub
<#59212 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEHMG6BXVT4D7LFRXWZEL2DBO5ZAVCNFSM6AAAAABSVM2MYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBXG42TCOBSGQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
I'll gladly accept any solution, whatever is the root cause 😀 |
Interesting. Can I ask, since I'm not familiar with that service: is that vanilla Redis, or is it a Redis-alike? Any idea whether it has anything exotic going on like additional load balancing? I already planned to investigate this potential ordering issue (because: order matters), but I'm curious as to whether that product / service might also be a wildcard here. |
@mgravell I added more details to #59211 |
M.E.C.R. V8 uses a fundamentally different approach that indeed won't be dependent on pipelined ordering. However, pipelined ordering is meant to be part of the SE.Redis contract. As I say, I think I know exactly what is wrong, and will look ASAP. If I can confirm and resolve in SE.Redis, then the "fix" over here will probably just be a package-ref update. That way, we fix many more possible call paths. |
/cc @NickCraver @philon-msft it might just be confirmation bias, but this ^^^ sounds like the ordering thing I mentioned in our call last week! |
My priority first is making sure we understand the problem and root cause, so we can have confidence both in any mitigation and any relevant misbehaving tools. Initial check doesn't show a pathological problem - that isn't conclusive, however, plus I'm using "current" (2.8.22) rather than the 2.7.27 which M.E.C.SER uses; will re-run in a moment with that version: results with 2.8.22
and 2.7.27
I'm going to have to dig deeper. |
My test rig added over here; at the current time I believe I was wrong about SE.Redis having a potential out-of-order glitch; neither aggressive testing nor code inspection have supported that idea. Right now, I am unable to reproduce this symptom. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness : my hesitation here is that this (the proposed change) introduces an extra round-trip latency in all scenarios with TTL, which should typically be just about all of them. For local servers, that isn't a huge problem, but for remote servers: this can be significant. The existing implementation was designed specifically to avoid this extra latency, by using the pipelined API scenario explicitly exposed by SE.Redis - short version being: this should already be correctly ordered.
Before we jump in and look at fixes here, I'm very keen to prove the actual problem, to make sure we're fixing the right thing in the most appropriate way.
I did inspect into WriteMessageTakingWriteLockAsync for some hours, and got the same conclusion. The message either go into queue or the lock must be taken, which prevents the message to be sent before the queue has been completely processed. This enforces ordering if WriteMessageTakingWriteLockAsync calls are from the same thread because queueing and sending are synchronous, and the FlushAsync call is already nicely handled not to release lock before it really finished. I guess you are very well-informed of the above information though, but hope it helps. I also ran your test on .NET 8 with 2.8.22 and main branch of SE.Redis against local Redis server on WSL2, and it comes out with the same result as yours. |
That was indeed my analysis (and yes, I'm very familiar with this code). The only reason I was suspicious in the first place is that in some related code (I'm playing with a rewrite of the IO core for SE.Redis) I approached it differently, basically "take the queue lock, try to take the write lock; if we take the write lock, release the queue lock and write - otherwise enqueued" - but from the perspective of a single caller thread, I'm satisfied that the result is the same and no queue jumping can occur.. Hence my questions about the underlying server; maybe a non-compliant server, or a RESP-aware proxy that splits traffic without considering pipelines. |
Looks like this PR hasn't been active for some time and the codebase could have been changed in the meantime. |
Fix race condition in RedisCache when setting key expiration
Ensure the key is set before setting expiration.
Description
Currently, writing the key and (attempting to) setting its expiration is done concurrently.
Oftentimes, the key will not be written by the time an attempt to set the expiration is made. Thus, the key will be non-expiring which leads to potentially exhausting Redis storage and logical errors in applications relying on key expiry.
💡I don't know whether the change from scripts in 8.x to the two calls in 9.x has any other (unwanted) side-effects.
Edit: E.g. storing the key then failing to set the expiration. Over time we'd "leak" keys.
Fixes #59211