Real-World Foundation Model Costs For Agents: May 2026
Pricing pages tell you the per-million-token rate. They do not tell you what a real multi-turn conversation costs once prompt caching, reasoning tokens, output truncation, and tokenizer differences are accounted for. We could not find a recent, verifiable comparison that actually included Opus 4.7 and GPT-5.4 with reasoning effort dialed in, so we ran one.
Below is what a five-turn engineering conversation costs across four common configurations, measured against the live APIs on May 10, 2026.
Why this data is hard to find
Real multi-turn cost depends on:
- Prompt cache hit rate, which depends on how you mark cache breakpoints
- Reasoning tokens, which OpenAI bills as output but does not feed back into the next turn
- Tokenizer differences (Opus 4.7 uses about 35% more tokens than Sonnet 4.6 for identical text)
- Output truncation when the visible output cap is too low for high-reasoning models
The scaffold
The harness is small and self-contained. For each configuration it does the same thing:
- A fixed system prompt of about 1,800 tokens (a synthetic engineering persona with concrete operational context)
- Five scripted user turns of progressive depth, identical for every model
- Live calls to the official Anthropic and OpenAI SDKs, no provider abstraction in the way
- Per-turn capture of input, cache_read, cache_write, output, and reasoning tokens
- Cost computed against pricing verified the same week against the official sources
Prompt caching was enabled for every provider, configured per each one's recommended pattern, so the cache hit rates reflect what a well-implemented production client would see.
For GPT-5.4, the reasoning_effort parameter is set explicitly. With a 4,000-token output cap, high reasoning hit the cap on most turns and produced truncated visible output. The cap had to be raised to 12,000 to let high reasoning run free.
The criteria for a fair comparison
- Identical prompts across configurations
- Output caps high enough that no run gets truncated
- A system prompt large enough to exceed each provider's cache minimum (1,024 tokens for both Anthropic and OpenAI)
- Multiple iterations to estimate variance
- Cold cache for the cost-of-record measurement (warm cache distorts low)
- Pricing verified against the official source within the same month
We ran five iterations. Iterations 2 and 5 had verifiably cold caches on both sides. The reported costs are the mean across all valid runs.
The numbers
| Model | Mean cost | Range | vs Sonnet | Mean wall time |
|---|---|---|---|---|
| Claude Sonnet 4.6 | $0.090 | $0.080 to $0.105 | 1.0× | ~125s |
| GPT-5.4 (medium reasoning) | $0.157 | $0.152 to $0.169 | 1.7× | ~150s |
| Claude Opus 4.7 | $0.172 | $0.148 to $0.183 | 1.9× | ~95s |
| GPT-5.4 (high reasoning) | $0.221 | $0.208 to $0.245 | 2.5× | ~280s |
The cost ranking is stable across all five iterations.
Notable points
Opus 4.7 is closer to GPT-5.4 medium than people assume. With caching wired up correctly, Opus ran at $0.172 for the test conversation, marginally above GPT-5.4 medium and 22% below GPT-5.4 high. Most online cost guides still treat Opus as the expensive option. The pricing reorganization at Opus 4.5 (down from $15/$75 to $5/$25) closed most of that gap.
Caching changes the ranking. Without caching, Sonnet would cost about $0.130 and Opus about $0.250 for this conversation. With caching wired correctly, savings are roughly 30% for both. For GPT-5.4 the savings are smaller (12% on high, 17% on medium) because reasoning output dominates the bill, not input. Caching matters everywhere, just unevenly.
GPT-5.4 medium is not a low-reasoning mode. Medium produced 37 to 52% reasoning tokens per run on substantive prompts. It does drop to near zero on simple consolidation tasks (the final summary turn produced 0 reasoning tokens in 3 of 4 runs), but the default behavior is to spend meaningful reasoning compute.
GPT-5.4 high is roughly twice as slow as the next slowest model. Mean wall time was 280 seconds for a five-turn conversation, against 150 for GPT-5.4 medium and 95 for Opus 4.7.
Tokenizer differences are real. Opus 4.7 uses a new tokenizer that produces about 35% more tokens for identical text. The per-token rate is the same as Opus 4.6, but the per-text cost is higher. This shows up as Opus reporting 1,691 input tokens for the same prompt that Sonnet measured at 1,188.
Subjective quality, lite
We also read the actual responses on the meatiest turn (debugging a feature-store cache hit rate regression from 97% to 88%). All four models caught the core insight: a hit rate drop from 97% to 88% means the miss rate went from 3% to 12%, a 4× increase in slow-path traffic, which is what pushed p99 out. On sharpness:
- Opus 4.7 led with concrete arithmetic (0.97 × 1.5ms + 0.03 × 100ms before, 0.88 × 1.5ms + 0.12 × 100ms after) and set up the p99 reasoning explicitly
- Sonnet 4.6 was intellectually honest, opening by walking back its own prior advice
- GPT-5.4 high was well-structured and comprehensive, slightly more formal
- GPT-5.4 medium was correct but somewhat more verbose
This is one prompt and one observer. Treat it accordingly.
What this is, what this is not
This is a single five-turn conversation on one synthetic engineering scenario. It is not a benchmark. It is a calibration data point: enough to falsify "Opus is too expensive to consider" or "medium effort barely reasons," and enough to give a dollar number that includes everything the pricing pages omit.
The harness is about 400 lines of Go, runs end-to-end in roughly 10 minutes, and costs about $1 per iteration. If you want to reproduce these numbers on your own prompts: drive the SDKs directly, mark cache_control on the latest user message for Anthropic, set reasoning_effort explicitly for OpenAI, raise the output cap to 12,000 for high-reasoning models, and run multiple cold-cache iterations.
Pricing verified against the official Anthropic and OpenAI documentation on May 10, 2026. The model APIs and rates change. Re-measure when they do.
At Obelisk, we run measurements like this constantly. Scribes take on a wide range of jobs across very different tasks, and we are always testing models against real work to make sure each Scribe is using the right tool for the job. The data above is one slice of that ongoing evaluation. As models, prices, and capabilities shift, our routing shifts with them.
References
"Pricing"
Official per-million-token rates for Claude Sonnet 4.6 and Opus 4.7, including 5-minute and 1-hour cache write multipliers and cache read rates. Verified May 10, 2026.
"API Pricing"
Official per-million-token rates for GPT-5.4, including cached input pricing. Verified May 10, 2026.