AI API Caching Strategies: Reduce LLM Costs by 60%+
Caching is the highest-ROI cost optimization technique for AI APIs. A well-implemented cache can eliminate 30-70% of your API calls entirely โ zero cost, zero latency penalty on cache hits. This guide covers three caching strategies with real implementation examples and cost breakdowns.
๐จ June 15 deadline: See all 39 alternatives, calculate your savings, and get migration code on our Claude 4 Deprecation Hub.
Why Caching Works So Well for LLMs
Most AI API workloads have significant request overlap. A customer support bot sees the same questions repeatedly. A content generator processes similar prompts. A classification pipeline handles recurring patterns. Every duplicate request that hits your cache instead of the API is pure savings.
A SaaS company processing 15,000 chat requests/day implemented exact-match caching and immediately reduced their API bill from $450/month to $210/month โ a 53% reduction with zero quality loss.
The key insight: LLM APIs charge per token. If you can serve a response from cache, you pay nothing for that request. Even a 30% cache hit rate means 30% of your costs disappear overnight.
Strategy 1: Exact-Match Caching
The simplest and most reliable caching approach. Store the full prompt + response. If the exact same request comes in again, return the cached result without calling the API.
How it works
- Hash the request (system prompt + user message + model + parameters)
- Check if hash exists in your cache store (Redis, SQLite, or even a dictionary)
- On hit: return cached response immediately
- On miss: call the API, store the response with the hash, return it
import hashlib, json, redis
r = redis.Redis()
def cached_completion(messages, model="gpt-4o-mini", **kwargs):
# Create a cache key from the full request
cache_input = json.dumps({"messages": messages, "model": model, **kwargs}, sort_keys=True)
cache_key = f"llm:{hashlib.sha256(cache_input.encode()).hexdigest()}"
# Check cache
cached = r.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss โ call the API
response = openai.chat.completions.create(messages=messages, model=model, **kwargs)
result = response.model_dump()
# Store in cache (expire after 24 hours)
r.setex(cache_key, 86400, json.dumps(result))
return result
When exact-match caching works best
- FAQ chatbots โ same questions asked repeatedly (40-60% hit rates common)
- Template-based generation โ same inputs produce same outputs
- Classification pipelines โ identical documents reclassified
- Code completion โ same context triggers same suggestions
When it falls short
Exact-match requires identical inputs. If users phrase the same question differently ("What's the price?" vs "How much does it cost?"), they won't match. That's where semantic caching comes in.
Strategy 2: Semantic Caching
Semantic caching matches requests by meaning, not exact text. "How do I reset my password?" and "I forgot my password, how do I fix it?" would both hit the same cache entry because they're semantically equivalent.
How it works
- Generate an embedding vector for each request using a cheap embedding model
- Store the embedding + response in a vector database
- On new requests, search for the nearest embedding (cosine similarity above a threshold)
- On hit: return cached response; On miss: call API, store embedding + response
import numpy as np
from openai import OpenAI
client = OpenAI()
def get_embedding(text):
response = client.embeddings.create(input=text, model="text-embedding-3-small")
return response.data[0].embedding
def semantic_cached_completion(messages, model="gpt-4o-mini",
similarity_threshold=0.92, **kwargs):
# Combine messages into a single query string for embedding
query_text = " ".join(m["content"] for m in messages)
query_embedding = get_embedding(query_text)
# Search vector DB for similar cached queries
similar = vector_db.search(query_embedding, top_k=1,
threshold=similarity_threshold)
if similar:
return similar[0]["response"]
# Cache miss
response = client.chat.completions.create(
messages=messages, model=model, **kwargs
)
result = response.model_dump()
# Store embedding + response
vector_db.insert(query_embedding, {
"response": result,
"query": query_text,
"model": model
})
return result
Semantic caching trade-offs
| Factor | Exact-Match | Semantic |
|---|---|---|
| Hit rate | 20-40% | 40-65% |
| Quality risk | None (exact response) | Low (similar but not identical query) |
| Infrastructure | Redis or in-memory | Vector DB (Pinecone, Weaviate, pgvector) |
| Added latency | <1ms (hash lookup) | 5-20ms (embedding + vector search) |
| Cost to run | Near zero | Embedding cost (~$0.02/1M tokens) |
| Best for | FAQ bots, templates | Conversational, varied phrasing |
Tuning the similarity threshold
The threshold controls how "similar" queries need to be to count as a cache hit. Too low and you'll return wrong answers; too high and you'll miss real matches.
- 0.95+ โ Very conservative. Almost exact meaning. Low risk of wrong answers.
- 0.90-0.95 โ Balanced. Good hit rates with minimal quality risk.
- 0.85-0.90 โ Aggressive. Higher hit rates but risk of semantically different queries matching.
Start at 0.92 and adjust based on your quality metrics. For support bots where accuracy matters, stay above 0.93.
Strategy 3: Provider Prompt Caching
Both OpenAI and Anthropic now offer built-in prompt caching at the API level. This automatically caches the prefix of your prompts and gives you a discount on subsequent requests that share the same prefix.
OpenAI Prompt Caching
- Automatically caches prompts longer than 1,024 tokens (GPT-4o) or 2,048 tokens (GPT-4o mini)
- 90% discount on cached input tokens (you pay only 10% of the input price)
- Cache entries expire after 5-10 minutes of inactivity
- No code changes needed โ it's automatic for supported models
Anthropic Prompt Caching
- Cache prefixes up to 4,096 tokens with explicit caching markers
- 90% discount on cached input tokens
- Cache TTL is 5 minutes (extended on each hit)
- Requires adding
cache_controlparameter to mark the cache breakpoint
# Anthropic prompt caching example
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant for Acme Corp. [long system prompt...]",
"cache_control": {"type": "ephemeral"} # Cache this prefix
}
],
messages=[{"role": "user", "content": "What's the return policy?"}]
)
When provider caching helps most
- Long system prompts โ if your system prompt is 500+ tokens, every request after the first pays 10% instead of 100%
- Multi-turn conversations โ conversation history is the prefix; only new messages are full price
- RAG applications โ large context documents in the prompt get cached automatically
Combining Strategies for Maximum Savings
The best results come from layering multiple caching approaches. Here's a production architecture that combines all three:
- Layer 1 โ Exact-match cache (Redis). Fast, zero-cost lookups. Catches duplicate requests.
- Layer 2 โ Semantic cache (Vector DB). Catches paraphrased versions of cached queries.
- Layer 3 โ Provider prompt caching (Automatic). Even on cache misses, the system prompt prefix is cached by the provider at 90% off.
On a cache miss at Layer 1, you check Layer 2. If both miss, the API call still benefits from provider caching on the prompt prefix. The result: you rarely pay full price for any request.
Cache Invalidation: The Hard Part
Caching is easy. Invalidating caches correctly is where most teams struggle. Strategies:
- Time-based expiration (TTL) โ simplest approach. Set 1-24 hour TTLs based on how stale your data can be. Pricing data? 24 hours is fine. Real-time support? 5-15 minutes.
- Event-driven invalidation โ when your data changes (new pricing, updated docs), invalidate affected cache entries. More complex but more precise.
- Version-prefixed keys โ include a version number in cache keys:
v2:hash(prompt). Bump the version to invalidate everything at once. - Write-through caching โ update cache and DB simultaneously on writes. Ensures consistency but adds write latency.
Rule of thumb: Use TTL for most cases. Only build event-driven invalidation if stale responses would cause user-facing issues.
Measuring Cache Performance
Track these metrics to know if your caching is working:
| Metric | What It Tells You | Target |
|---|---|---|
| Hit rate | % of requests served from cache | 30%+ (exact), 50%+ (semantic) |
| Cost per request | Average API cost divided by total requests | Decreasing over time |
| Cache latency | Time to serve a cached response | <5ms (exact), <25ms (semantic) |
| Stale response rate | % of cached responses that were outdated | <1% |
| Cache size | Storage used by cache entries | Monitor growth, set eviction policies |
See how much you could save with caching.
Enter your current API usage and get a personalized cost projection with and without caching.
Try the APIpulse CalculatorImplementation Checklist
- Identify which request types have high repetition (FAQ, templates, classification)
- Start with exact-match caching โ it's the easiest win
- Measure your natural hit rate before optimizing
- Add semantic caching if exact-match hit rate is below 30%
- Enable provider prompt caching (it's free and automatic)
- Set appropriate TTLs based on data freshness requirements
- Monitor hit rate, cost per request, and stale response rate
- Implement cache invalidation for data-sensitive workloads
- Consider combining all three layers for maximum savings
- Use APIpulse to track your cost-per-request trends
Related Reading
- AI API Cost Optimization: A Complete Guide for 2026
- The Complete Guide to AI API Batch Processing
- AI API Streaming Costs: How to Optimize Real-Time LLM Spending
- How to Reduce Your AI API Costs by 40% (Without Losing Quality)
- How to Cut Your AI API Bill in Half
- Multi-Model Routing: Use the Right LLM for Each Task
- AI API Cost Per Request: How Much Does Each LLM Call Actually Cost?
- Cheapest RAG Setup in 2026: Full Cost Breakdown
- AI Agent Cost Calculator โ Estimate Your Agent's Spend โ
Get notified when API prices change
No spam. Only pricing updates and new features. Unsubscribe anytime.
Want to optimize your AI API costs?
APIpulse Pro ($29 one-time) includes saved scenarios, cost report exports, and personalized recommendations that can save you up to 40%.
Get Pro โ $29Save money: APIpulse Cost Optimizer โ find out how much you could save by switching models. Free tool.