-
Notifications
You must be signed in to change notification settings - Fork 446
Description
Problem Statement
The current GeminiModel implementation does not support Gemini's explicit context caching feature, which provides up to 90% cost reduction on cached tokens. While Gemini 2.5 models have implicit caching, it doesn't work reliably with Strands' request structure (system prompt and tools in config instead of contents).
Current behavior:
- Every request sends full system prompt + tools (e.g., 13,494 tokens)
- No visibility into cached tokens
- No control over cache lifecycle
cached_content_token_countalways returnsNone
Expected behavior:
- Ability to explicitly cache system prompt + tools
- 75-90% discount on cached tokens
- Cache visibility via
usage_metadata.cached_content_token_count - Cache lifecycle management (create, delete, TTL)
Proposed Solution
Add explicit context caching support to GeminiModel similar to how BedrockModel implements cache_prompt parameter.
API design:
from strands.models.gemini import GeminiModel
model = GeminiModel(
model_id="gemini-2.5-flash",
client_args={"api_key": "..."},
enable_caching=True, # Enable auto-caching
cache_ttl="3600s" # Cache TTL (default 1 hour)
)
# Or manual cache management
model.create_cache(system_prompt, tool_specs, ttl="7200s")
model.delete_cache()Key features:
- Auto-cache creation: Automatically creates cache on first request when
enable_caching=True - Cache validation: Reuses cache when system prompt + tools match
- Visibility: Exposes
cachedTokensinmetadata.usage - Cache lifecycle: Methods for create/delete/manage cache
Implementation Details
Changes needed in strands/models/gemini.py:
- Add
enable_cachingandcache_ttltoGeminiConfig - Add
create_cache()anddelete_cache()methods - Modify
_format_request_config()to acceptcached_contentparameter - Add cache validation logic in
_format_request() - Expose
cached_content_token_countin metadata
References
- [Gemini Context Caching Docs](https://ai.google.dev/gemini-api/docs/caching)
- [Python SDK Reference](https://googleapis.github.io/python-genai/)
- [Bedrock Cache Implementation](https://strandsagents.com/latest/documentation/docs/user-guide/concepts/model-providers/amazon-bedrock/) (for API consistency)
Alternative Solutions
- Do nothing: Users pay 5-10x more in token costs
- Rely on implicit caching: Unreliable, no visibility, no control
Additional Context
Tested implementation shows:
- 68% token reduction on real workload
cached_content_token_count: 9,255out of 13,564 total tokens- Works with 30+ tools and complex system prompts
- Compatible with existing Strands agent loop
I'm happy to submit a PR with the implementation if this feature request is accepted.
Use Case
Agents with large system prompts or many tools (e.g., 30 tools = ~9K tokens) incur high costs on every request. For production workloads with 1,000+ messages/day, this becomes expensive quickly.
Example cost impact:
- Without caching: 13,564 tokens/msg × 30K msgs/month × $0.00000035 = $142/month
- With caching: ~4,300 effective tokens/msg × 30K msgs/month = $69/month
- Savings: $73/month ($876/year)
Alternatives Solutions
No response
Additional Context
No response