Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta xds #152

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Delta xds #152

wants to merge 3 commits into from

Conversation

sschepens
Copy link
Contributor

@sschepens sschepens commented Oct 23, 2020

Working implementation of Incremental XDS.
Has been running in production for several months primarily using DeltaXDS over ADS.
This has no tests, but current tests should be running

Couple of implementation comments:

  • This implementation does not support resource aliases, only names.
  • Snapshot API has been broken to force clients to wrap their resources in a SnapshotResource which contains the given resources and its version
  • Snapshot global version is still maintained and no changes will be processed if it doesn't change, even though internal resources may have changed.
  • DeltaDiscoveryRequestStreamObservers keeps track of the clients known resource versions as well as pending resources. This resource knowledge is also handed to DeltaWatches as copies.
  • Client known Resource versions are updated whenever a client acknowledges a response sent by the control plane considering only the resources versions sent in that response.
  • Delta responses do not provide the client global version, so global snapshot versions are inferred when clients acknowledge a response for a version.
  • Delta requires some type urls to be handled in a "Wildcard Mode", currently this is hardcoded to CDS and LDS. If a Stream is in wildcard mode, it will receive all new resources even without needing to be subscribed.
  • When setting a new snapshot, to prevent full snapshot diff with every watch which can be very expensive, a diff is calculated with the previous snapshot. Only resources that have changed their version or been deleted are then used to notify watches.
  • A watch is responded only if a new resource is added, a resource it knows has been updated been removed.
  • Special conditions have been added to prevent having more than one in flight response for a given TypeURL, even though this is supported by envoy and the protocol, it introduces more complexity and can lead to subtle issues.
  • Even though Snapshot version may have changed, it may be that internal resources have not changed and so, no notifications will occur.

I don't have time to add enough tests to get this merged, but i can tell you this is battle tested.
Most of the design decisions specified above were taken to solve actual issues while handling delta in production, we have huge configs with 9k+ clusters and constant changes and we did find edge cases to fix.
Delta is significantly different than SOTW and thoroughly testing it can be tricky.

I'm leaving this open in a Draft state, maybe someone can pick this up and get it merged.

@codecov-io
Copy link

codecov-io commented Oct 23, 2020

Codecov Report

Merging #152 into master will decrease coverage by 27.00%.
The diff coverage is 10.98%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master     #152       +/-   ##
=============================================
- Coverage     88.02%   61.02%   -27.01%     
- Complexity      296      303        +7     
=============================================
  Files            31       39        +8     
  Lines           977     1465      +488     
  Branches         78      119       +41     
=============================================
+ Hits            860      894       +34     
- Misses           85      538      +453     
- Partials         32       33        +1     
Impacted Files Coverage Δ Complexity Δ
...o/envoyproxy/controlplane/cache/DeltaResponse.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...a/io/envoyproxy/controlplane/cache/DeltaWatch.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...envoyproxy/controlplane/cache/DeltaXdsRequest.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...proxy/controlplane/cache/GroupCacheStatusInfo.java 50.00% <0.00%> (-16.67%) 2.00 <0.00> (ø)
...ava/io/envoyproxy/controlplane/cache/Snapshot.java 100.00% <ø> (ø) 5.00 <0.00> (ø)
...server/AdsDeltaDiscoveryRequestStreamObserver.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...lane/server/AdsDiscoveryRequestStreamObserver.java 96.77% <ø> (-0.11%) 16.00 <0.00> (ø)
...ne/server/DeltaDiscoveryRequestStreamObserver.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
.../controlplane/server/DiscoveryServerCallbacks.java 14.28% <0.00%> (-5.72%) 1.00 <0.00> (ø)
...trolplane/server/LatestDeltaDiscoveryResponse.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d71b9a2...f8bf24d. Read the comment docs.

Signed-off-by: Sebastian Schepens <[email protected]>
Signed-off-by: Sebastian Schepens <[email protected]>
Signed-off-by: Sebastian Schepens <[email protected]>
@jakubdyszkiewicz
Copy link
Contributor

Nice! Do you have any insights on the performance improvements on SOTW vs Delta?

@sschepens
Copy link
Contributor Author

sschepens commented Oct 26, 2020

@jakubdyszkiewicz It really depends on the use case. For example:

  • If using ADS, it would see a huge jump in performance since only changed resources would get propagated.
  • If not using ADS, CDS, LDS and RDS updates would get huge performance benefits, since only real changes are propagated. EDS is a different story, when not using ADS EDS generates a new stream for every Cluster and this is still true in Delta, so you wouldn't get much difference there, but still, only updated cluters would get updated so there would still be a lot of benefit unless you were using per cluster versioning of EDS in which case you could probably see little to no difference.

The performance benefit is really hard to measure, in our use case we were using non ADS SOTW with per cluster versioning which is probably the best performing case you can make of SOTW and we managed to reduce about 50% the load on our Control Plane, the amount of updates pushed from our Control Plane went from 1.5M changes per minute to 25k per minute.

But not only Control Plane performance was improved, also envoy was using less cpu when CDS was updated for example, which in our use case is huge since we have 9k+ clusters.

The key take away is:

  • If you're pushing lots of updates constantly, then Delta will be a huge performance gain
  • If you're not using ADS, with Delta you probably should, at least for EDS.

Base automatically changed from master to main January 16, 2021 21:52
@mikegajda
Copy link

What's the state of this and are there plans to get this merged in? @sschepens how has this been working out in production for you?

I'm working on a project that could stand to benefit from this, and could potentially offer to write the tests here, depending on what the plan is for this.

@sschepens
Copy link
Contributor Author

What's the state of this and are there plans to get this merged in? @sschepens how has this been working out in production for you?

@mikegajda this has been running in production for 6+ months and is working like a charm.

I'm working on a project that could stand to benefit from this, and could potentially offer to write the tests here, depending on what the plan is for this.

That would be great! I think what we would need is Design input from other people, since this breaks existing APIs, it would be great to see if this change is the best or if some other ideas come about.

@slonka
Copy link
Member

slonka commented Feb 22, 2021

We would love to see this implemented. @mikegajda Allegro's OSS Control Plane can act as a second level of tests (we depend on JCP and have integration and resiliency tests). If you need help getting started reach out to me directly on Envoy's slack.

@mikegajda
Copy link

Thanks for the comment @slonka and pointing me at the allegro control plane stuff, I'll reach out on slack if I run into anything

@mikegajda
Copy link

👋 I have been working here on a version of this that has some testing (working on more) and changes how versions are generated, would love to hear some thoughts/comments about this approach: HubSpot#3

@slonka slonka mentioned this pull request May 19, 2021
@inssein
Copy link

inssein commented Feb 6, 2024

Unless I am missing something, the work from this PR is already merged and we should close this.

@mikegajda
Copy link

Yes, this can be closely now, thanks @inssein for noticing that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants