How much can AI API caching save?

Caching can reduce AI API costs by 40-70%. Exact-match caching saves the most for repeated queries. Semantic caching works well for chatbots with similar questions.

What are the different types of AI API caching?

Three main types: exact-match caching (identical prompts return cached results), semantic caching (similar prompts return cached results), and prompt caching (provider-managed caching of prompt prefixes).

Which providers support prompt caching?

OpenAI, Anthropic, and Google all support prompt caching. Anthropic's prompt caching reduces costs by 90% for repeated prompt prefixes. OpenAI offers automatic caching for frequent requests.

AI API Caching Strategies: Reduce LLM Costs by 60%+

Strategy 2: Semantic Caching

Semantic caching matches requests by meaning, not exact text. "How do I reset my password?" and "I forgot my password, how do I fix it?" would both hit the same cache entry because they're semantically equivalent.

How it works

Generate an embedding vector for each request using a cheap embedding model
Store the embedding + response in a vector database
On new requests, search for the nearest embedding (cosine similarity above a threshold)
On hit: return cached response; On miss: call API, store embedding + response

import numpy as np
from openai import OpenAI

client = OpenAI()

def get_embedding(text):
    response = client.embeddings.create(input=text, model="text-embedding-3-small")
    return response.data[0].embedding

def semantic_cached_completion(messages, model="gpt-4o-mini",
                                similarity_threshold=0.92, **kwargs):
    # Combine messages into a single query string for embedding
    query_text = " ".join(m["content"] for m in messages)
    query_embedding = get_embedding(query_text)

    # Search vector DB for similar cached queries
    similar = vector_db.search(query_embedding, top_k=1,
                                threshold=similarity_threshold)

    if similar:
        return similar[0]["response"]

    # Cache miss
    response = client.chat.completions.create(
        messages=messages, model=model, **kwargs
    )
    result = response.model_dump()

    # Store embedding + response
    vector_db.insert(query_embedding, {
        "response": result,
        "query": query_text,
        "model": model
    })
    return result

Semantic caching trade-offs

Factor	Exact-Match	Semantic
Hit rate	20-40%	40-65%
Quality risk	None (exact response)	Low (similar but not identical query)
Infrastructure	Redis or in-memory	Vector DB (Pinecone, Weaviate, pgvector)
Added latency	<1ms (hash lookup)	5-20ms (embedding + vector search)
Cost to run	Near zero	Embedding cost (~$0.02/1M tokens)
Best for	FAQ bots, templates	Conversational, varied phrasing

Tuning the similarity threshold

The threshold controls how "similar" queries need to be to count as a cache hit. Too low and you'll return wrong answers; too high and you'll miss real matches.

0.95+ — Very conservative. Almost exact meaning. Low risk of wrong answers.
0.90-0.95 — Balanced. Good hit rates with minimal quality risk.
0.85-0.90 — Aggressive. Higher hit rates but risk of semantically different queries matching.

Start at 0.92 and adjust based on your quality metrics. For support bots where accuracy matters, stay above 0.93.

Semantic caching impact (15,000 requests/day)

Without caching (GPT-4o) $675/mo

Exact-match only (35% hit rate) $439/mo

Semantic caching (55% hit rate) $304/mo

Total savings with semantic caching $371/mo (55% reduction)

Strategy 3: Provider Prompt Caching

Both OpenAI and Anthropic now offer built-in prompt caching at the API level. This automatically caches the prefix of your prompts and gives you a discount on subsequent requests that share the same prefix.

OpenAI Prompt Caching

Automatically caches prompts longer than 1,024 tokens (GPT-4o) or 2,048 tokens (GPT-4o mini)
90% discount on cached input tokens (you pay only 10% of the input price)
Cache entries expire after 5-10 minutes of inactivity
No code changes needed — it's automatic for supported models

Anthropic Prompt Caching

Cache prefixes up to 4,096 tokens with explicit caching markers
90% discount on cached input tokens
Cache TTL is 5 minutes (extended on each hit)
Requires adding cache_control parameter to mark the cache breakpoint

# Anthropic prompt caching example
response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant for Acme Corp. [long system prompt...]",
            "cache_control": {"type": "ephemeral"}  # Cache this prefix
        }
    ],
    messages=[{"role": "user", "content": "What's the return policy?"}]
)

When provider caching helps most

Long system prompts — if your system prompt is 500+ tokens, every request after the first pays 10% instead of 100%
Multi-turn conversations — conversation history is the prefix; only new messages are full price
RAG applications — large context documents in the prompt get cached automatically

Provider prompt caching impact (1,000 requests/day with 800-token system prompt)

Without caching (GPT-4o, $2.50/M input) $600/mo

With provider caching (90% discount on prefix) $108/mo

Monthly savings $492/mo (82% reduction)

Combining Strategies for Maximum Savings

The best results come from layering multiple caching approaches. Here's a production architecture that combines all three:

Layer 1 — Exact-match cache (Redis). Fast, zero-cost lookups. Catches duplicate requests.
Layer 2 — Semantic cache (Vector DB). Catches paraphrased versions of cached queries.
Layer 3 — Provider prompt caching (Automatic). Even on cache misses, the system prompt prefix is cached by the provider at 90% off.

On a cache miss at Layer 1, you check Layer 2. If both miss, the API call still benefits from provider caching on the prompt prefix. The result: you rarely pay full price for any request.

Combined caching: 20,000 requests/day, GPT-4o, 600-token system prompt

No caching $900/mo

Exact-match only (35% hit rate) $585/mo

Exact + semantic (55% hit rate) $405/mo

All three layers (55% hit + provider caching on misses) $162/mo

Maximum savings $738/mo (82% reduction)

Cache Invalidation: The Hard Part

Caching is easy. Invalidating caches correctly is where most teams struggle. Strategies:

Time-based expiration (TTL) — simplest approach. Set 1-24 hour TTLs based on how stale your data can be. Pricing data? 24 hours is fine. Real-time support? 5-15 minutes.
Event-driven invalidation — when your data changes (new pricing, updated docs), invalidate affected cache entries. More complex but more precise.
Version-prefixed keys — include a version number in cache keys: v2:hash(prompt). Bump the version to invalidate everything at once.
Write-through caching — update cache and DB simultaneously on writes. Ensures consistency but adds write latency.

Rule of thumb: Use TTL for most cases. Only build event-driven invalidation if stale responses would cause user-facing issues.

Measuring Cache Performance

Track these metrics to know if your caching is working:

Metric	What It Tells You	Target
Hit rate	% of requests served from cache	30%+ (exact), 50%+ (semantic)
Cost per request	Average API cost divided by total requests	Decreasing over time
Cache latency	Time to serve a cached response	<5ms (exact), <25ms (semantic)
Stale response rate	% of cached responses that were outdated	<1%
Cache size	Storage used by cache entries	Monitor growth, set eviction policies

See how much you could save with caching.

Enter your current API usage and get a personalized cost projection with and without caching.

Try the APIpulse Calculator

— See if you're overpaying for AI APIs

Implementation Checklist

Identify which request types have high repetition (FAQ, templates, classification)
Start with exact-match caching — it's the easiest win
Measure your natural hit rate before optimizing
Add semantic caching if exact-match hit rate is below 30%
Enable provider prompt caching (it's free and automatic)
Set appropriate TTLs based on data freshness requirements
Monitor hit rate, cost per request, and stale response rate
Implement cache invalidation for data-sensitive workloads
Consider combining all three layers for maximum savings
Use APIpulse to track your cost-per-request trends

🎯 Rate Your API Setup in 30 Seconds

Get an A+ to F grade on your AI API costs. See how you compare and find cheaper alternatives instantly.

Get Your Cost Score →

📊 Generate Your Personalized API Cost Report

Select your model, enter your monthly spend, and get a custom savings report with cheaper alternatives — free, in 60 seconds.

Get notified when API prices change

No spam. Only pricing updates and new features. Unsubscribe anytime.

Want to optimize your AI API costs?

APIpulse includes free cost comparisons, exports, and recommendations that can save you up to 40%.

Free Tools →

Save money: 📊 Live API Pricing · Cost Optimizer — find out how much you could save by switching models. Free tool.

💸 Looking for Sonnet 4.6 Alternatives?

5 models ranked by cost — some are 90% cheaper.

See 5 Sonnet 4.6 Alternatives →

🔧 Free Embeddable Pricing Widget

Add live AI API pricing to your docs, blog, or README with one script tag. 88 models, auto-updating.

Get the Free Widget → Free MCP Server →