Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design] Semantic Caching #30

Open
4 tasks
missBerg opened this issue Dec 5, 2024 · 1 comment
Open
4 tasks

[Design] Semantic Caching #30

missBerg opened this issue Dec 5, 2024 · 1 comment
Assignees
Labels
api Control Plane API design enhancement New feature or request

Comments

@missBerg
Copy link
Contributor

missBerg commented Dec 5, 2024

Produce a design proposal for the feature Semantic Caching.
Include:

  • Motivation
  • Feature Definition
  • Control Plane API
  • Technical Implementation Proposal

This issue was created from conversation during Dec 5th Community meeting
https://docs.google.com/document/d/10e1sfsF-3G3Du5nBHGmLjXw5GVMqqCvFDqp_O65B0_w/edit?tab=t.0#bookmark=id.l5miyf5qkodx

@missBerg missBerg added enhancement New feature or request api Control Plane API design labels Dec 5, 2024
@Krishanx92
Copy link

Krishanx92 commented Dec 12, 2024

Motivation

Semantic caching in an LLM Gateway reduces API call costs by reusing responses for semantically similar requests, minimizing expensive LLM interactions. Traditional caching fails when queries have slight differences in wording, while semantic caching identifies intent using vector embeddings. This approach significantly improves response times, as cache hits are served in milliseconds compared to LLM processing delays. By reducing redundant LLM requests, semantic caching enhances cost efficiency and optimizes token usage. Ultimately, semantic caching achieves higher cache hit rates and smarter cache utilization, making the LLM Gateway more efficient.

Key Benefits

  • Latency Reduction: Quicker response times since computation is bypassed.
  • Cost Efficiency: Reduced computation costs for API-based LLMs.
  • Improved Scalability: Handles higher throughput by avoiding redundant computations.

sementic-cache drawio

Will provide detail documentation for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Control Plane API design enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants