"LLMs are better at writing code to call tools than at calling tools directly." β Cloudflare Code Mode Research
A comprehensive benchmark comparing Code Mode (code generation) vs Traditional Function Calling for LLM tool interactions. Demonstrates that Code Mode achieves 60% faster execution, 68% fewer tokens, and 88% fewer API round trips while maintaining equal accuracy.
| Metric | Regular Agent | Code Mode | Improvement |
|---|---|---|---|
| Average Latency | 11.88s | 4.71s | 60.4% faster β‘ |
| API Round Trips | 8.0 iterations | 1.0 iteration | 87.5% reduction π |
| Token Usage | 144,250 tokens | 45,741 tokens | 68.3% savings π° |
| Success Rate | 6/8 (75%) | 7/8 (88%) | +13% higher β |
| Validation Accuracy | 100% | 100% | Equal accuracy |
Annual Cost Savings: $9,536/year at 1,000 scenarios/day (Claude Haiku pricing)
π View Full Results | π Raw Data Tables
- Python 3.11+
- Anthropic API key (for Claude)
- Google API key (for Gemini, optional)
# Clone the repository
git clone <repository-url>
cd codemode_benchmark
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env and add your API keys# Run full benchmark with Claude
make run
# Run with Gemini
python benchmark.py --model gemini
# Run specific scenario
python benchmark.py --scenario 1
# Run limited scenarios
python benchmark.py --limit 3codemode_benchmark/
βββ README.md # This file
βββ benchmark.py # Main benchmark runner
βββ requirements.txt # Python dependencies
βββ Makefile # Convenient commands
β
βββ agents/ # Agent implementations
β βββ __init__.py
β βββ codemode_agent.py # Code Mode (code generation)
β βββ regular_agent.py # Traditional function calling
β βββ gemini_codemode_agent.py # Gemini Code Mode
β βββ gemini_regular_agent.py # Gemini function calling
β
βββ tools/ # Tool definitions
β βββ __init__.py
β βββ business_tools.py # Accounting/invoicing tools
β βββ accounting_tools.py # Core accounting logic
β βββ example_tools.py # Simple example tools
β
βββ sandbox/ # Secure code execution
β βββ __init__.py
β βββ executor.py # RestrictedPython sandbox
β
βββ tests/ # Test files
β βββ test_api.py
β βββ test_scenarios.py # Scenario definitions
β βββ ...
β
βββ debug/ # Debug scripts (development)
β βββ debug_*.py
β
βββ docs/ # Documentation
β βββ BENCHMARK_SUMMARY.md # Comprehensive analysis
β βββ RESULTS_DATA.md # Raw data tables
β βββ QUICKSTART.md # Quick start guide
β βββ TOOLS.md # Tool API documentation
β βββ CHANGELOG.md # Version history
β βββ GEMINI.md # Gemini-specific notes
β
βββ results/ # Benchmark results
βββ benchmark_results_claude.json
βββ benchmark_results_gemini.json
βββ results.log
βββ results-gemini.log
User Query β LLM β Tool Call #1 β Execute β Result
β
LLM processes result β Tool Call #2 β Execute β Result
β
[Repeat 5-16 times...]
β
Final Response
Problems:
- Multiple API round trips
- Neural network processing between each tool call
- Context grows with each iteration
- High latency and token costs
User Query β LLM generates complete code β Executes all tools β Final Response
Advantages:
- Single code generation pass
- Batch multiple operations
- No context re-processing
- Natural programming constructs (loops, variables, conditionals)
Example:
Regular Agent sees this as 3 separate tool calls:
{"name": "create_transaction", "input": {"amount": 2500, ...}}
{"name": "create_transaction", "input": {"amount": 150, ...}}
{"name": "get_financial_summary", "input": {}}Code Mode generates efficient code:
expenses = [
("rent", 2500, "Monthly rent"),
("utilities", 150, "Electricity")
]
for category, amount, desc in expenses:
tools.create_transaction("expense", category, amount, desc)
summary = json.loads(tools.get_financial_summary())
result = f"Total: ${summary['summary']['total_expenses']}"The benchmark includes 8 realistic business scenarios:
- Monthly Expense Recording - Record 4 expenses and generate summary
- Client Invoicing Workflow - Create 2 invoices, update status, summarize
- Payment Processing - Create invoice, process partial payments
- Mixed Income/Expense Tracking - 7 transactions with financial analysis
- Multi-Account Management - Complex transfers between 3 accounts
- Quarter-End Analysis - Simulate 3 months of business activity
- Complex Multi-Client Invoicing - 3 invoices with partial payments (16 operations)
- Budget Tracking - 14 categorized expenses with analysis
Each scenario includes automated validation to ensure correctness.
class CodeModeAgent:
def run(self, user_message: str) -> Dict[str, Any]:
# 1. Send message with tools API documentation
response = self.client.messages.create(
system=self._create_system_prompt(), # Contains tools API
messages=[{"role": "user", "content": user_message}]
)
# 2. Extract generated code
code = extract_code_from_response(response)
# 3. Execute in sandbox
result = self.executor.execute(code)
return resultfrom typing import TypedDict, Literal
class TransactionResponse(TypedDict):
status: Literal["success"]
transaction: TransactionDict
new_balance: float
def create_transaction(
transaction_type: Literal["income", "expense", "transfer"],
category: str,
amount: float,
description: str,
account: str = "checking"
) -> str:
"""
Create a new transaction.
Returns: JSON string with TransactionResponse structure
Example:
result = tools.create_transaction("expense", "rent", 2500.0, "Monthly rent")
data = json.loads(result)
print(data["new_balance"]) # 7500.0
"""
# Implementation...Code execution uses RestrictedPython for sandboxing:
- No filesystem access
- No network access
- No dangerous imports
- Controlled builtins
| Complexity | Scenarios | Avg Speedup | Avg Token Savings |
|---|---|---|---|
| High (10+ ops) | 2 | 79.2% | 36,389 tokens |
| Medium (5-9 ops) | 3 | 47.5% | 8,774 tokens |
| Low (3-4 ops) | 1 | 45.3% | 6,209 tokens |
Key Insight: Code Mode advantage scales with complexity, but even simple tasks benefit significantly.
| Daily Volume | Regular Annual | Code Mode Annual | Annual Savings |
|---|---|---|---|
| 100 | $252 | $77 | $175 |
| 1,000 | $2,519 | $766 | $1,753 |
| 10,000 | $25,185 | $7,665 | $17,520 |
| 100,000 | $251,850 | $76,650 | $175,200 |
(Based on Claude Haiku pricing: $0.25/1M input, $1.25/1M output)
- Model: Claude 3 Haiku
- Performance: 60.4% faster, 68.3% fewer tokens
- Best For: Cost-sensitive production workloads
- Status: β Fully tested (8/8 scenarios)
- Model: Gemini 2.0 Flash Experimental
- Performance: 15.1% faster, 70.6% fewer iterations
- Best For: Low-latency requirements
- Status: β Partially tested (2/8 scenarios)
- Note: Faster baseline but more verbose code generation
# Run all tests
make test
# Run specific test file
python -m pytest tests/test_scenarios.py
# Test Code Mode agent directly
python agents/codemode_agent.py
# Test Regular Agent directly
python agents/regular_agent.py
# Test sandbox execution
python sandbox/executor.py- Benchmark Summary - Comprehensive analysis with insights
- Results Data - Raw performance tables
- Quick Start Guide - Step-by-step setup
- Tools Documentation - Available tools and API
- Changelog - Version history
- Gemini Notes - Gemini-specific information
-
Batching Advantage
- Single code block replaces multiple API calls
- No neural network processing between operations
- Example: 16 iterations β 1 iteration (Scenario 7)
-
Cognitive Efficiency
- LLMs have extensive training on code generation
- Natural programming constructs (loops, variables, conditionals)
- TypedDict provides clear type contracts
-
Computational Efficiency
- No context re-processing between tool calls
- Direct code execution in sandbox
- Reduced token overhead
β Multi-step workflows - Greatest benefit with many operations β Complex business logic - Invoicing, accounting, data processing β Batch operations - Similar actions on multiple items β Cost-sensitive workloads - Production at scale β Latency-critical applications - User-facing systems
- Use TypedDict for response types - Provides clear structure to LLM
- Include examples in docstrings - Shows correct usage patterns
- Batch similar operations - Leverage loops in code
- Validate results - Automated checks ensure correctness
- Handle errors gracefully - Try-except in generated code
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (
make test) - Commit (
git commit -m 'Add amazing feature') - Push (
git push origin feature/amazing-feature) - Open a Pull Request
- Cloudflare Code Mode Blog Post
- Anthropic Building Effective Agents
- Claude API Documentation
- Gemini API Documentation
- RestrictedPython Documentation
MIT License - See LICENSE file for details
- Inspired by Cloudflare's Code Mode research
- Built on Anthropic's Building Effective Agents framework
- Uses RestrictedPython for secure code execution
For questions or feedback, please open an issue on GitHub.
Benchmark Date: January 2025 Models Tested: Claude 3 Haiku, Gemini 2.0 Flash Experimental Test Scenarios: 8 realistic business workflows Result: Code Mode is 60% faster, uses 68% fewer tokens, with equal accuracy