multi-sdk-llm-notebooks: same task, two SDKs, what actually changes

Most "OpenAI vs Anthropic" writeups online are either flame-war shaped or cherry-picked for a single number. I wanted the narrower, structural answer: for the same task, at the same temperature, with the same prompt, what do the two SDKs actually cost — in dollars and in latency — to use?

So I built a small harness, ran 20 product reviews through both providers across six notebooks (function calling, structured output, prompt caching, streaming, an agent loop, and an agent loop with caching), and wrote down what the SDKs told me about themselves along the way.

The repo lives at github.com/dominic-righthere/multi-sdk-llm-notebooks. Headline tables and the notebook index are in the README. The long version of every finding — with reasoning and caveats — is in . The "what should I actually do?" version is in .

This post is the editorial cut — four things that surprised me, told quickly.

What I built

The shape is deliberately small and reproducible. Two providers, four models, six notebooks, and one shared bench() context manager that wraps every call and records latency + tokens + cost from the provider's own usage object.

with bench(f"openai/{model} — fn_call", model, "openai") as rec:
    args, usage = run_openai(review_text)
    rec.input_tokens = usage.prompt_tokens
    rec.output_tokens = usage.completion_tokens

Both providers return a usage object. Without that, any cost number is a guess. With it, the comparisons are apples-to-apples.

What surprised me

1. Anthropic counts the tool schema as input tokens. OpenAI doesn't.

Same review text. Same system prompt. The tool wrapper costs a fortune on one side and almost nothing on the other:

gpt-5.4-miniclaude-haiku-4-5
Mean input tokens (tool use)239816
Mean input tokens (structured output, no tool)144288

Mean input tokens, same review, same prompt

The Anthropic delta on tool use is, almost exactly, the classify_review tool definition being counted as input. Drop the tool wrapper and it evaporates.

OpenAI · tool use239 tokens
OpenAI · structured output144 tokens
Anthropic · tool use816 tokens

≈577 extra tokens vs structured output

Anthropic · structured output288 tokens

Sticker price doesn't tell the full cost story; at 1M requests/month, the same prompt is roughly 3× more expensive on Anthropic for tool-use workloads.

2. Structured output is 25–54% cheaper than tool use, on both providers.

Same classification task, two API mechanisms. The structured-output path doesn't pay the tool-schema tax, and on Anthropic the tax is the headline number:

# Anthropic — tool calling
resp = client.messages.create(
  model="claude-haiku-4-5",
  system=SYSTEM_PROMPT,
  messages=[{"role": "user", "content": review}],
  tools=[ANTHROPIC_TOOL],
  tool_choice={"type": "tool", "name": "classify_review"},
)

OpenAI saw a smaller version of the same effect — response_format={"type": "json_schema", ...} came in 25% cheaper than tool calling on the same task.

The architectural rule I'd take away: if your use case is "give me JSON", prefer structured output over tools. Reserve tools for when you actually need a name-bound function in a multi-step loop. Tutorials default to tools because tools sound more sophisticated; on this workload they're meaningfully more expensive for no reliability gain on a flat schema.

3. Prompt caching has a silent-failure mode.

Anthropic's prompt cache is the natural fix for finding 1 — the tool schema is static, exactly the shape caching is designed for. On a properly-sized prefix it cut total cost by 70% and p95 latency by 65% on this workload.

Anthropic claude-haiku-4-5, padded prompt (4741 tokens), N=20

Same prompt, same calls — only difference is cache_control on the last system block.

No cache$5.30 / 1k calls

p95 latency 5,525 ms

Cached (cache_read fires)$1.58 / 1k calls

p95 latency 1,930 ms · break-even at call 2

But there's a gotcha that cost me a notebook run. Below the model's per-token minimum (Haiku 4.5: 4096 tokens; Sonnet 4.6: 2048), cache_control is a no-op with no error. Just cache_read_input_tokens: 0 on every call. You silently pay full price and never notice unless you're inspecting usage.

The takeaway: caching is a fix for naturally-large prompts, not a lever you pull whenever. If your system prompt is 200 tokens, caching can't fire — and padding it to clear the minimum is, perversely, usually a net loss once you account for the extra input tokens you're now sending on every call.

4. For agents, the metric is cost-per-successful-task, not cost-per-call.

A model that takes 6 turns at $0.002/turn is more expensive than one that takes 3 turns at $0.003/turn — even though its per-call rate is lower. Per-call tables hide this; per-task numbers don't.

On the agent-loop benchmark (12-product catalog, 18 deterministic tasks, mock tools, 20% seeded errors), both providers scored 18/18 on success. The differentiation showed up in the margins:

Cost per successful task — agent loop, 18 tasks, both at 100% success

Per-turn token tax (Finding 1) compounds across 3-4 turns of accumulating tool results.

OpenAI · gpt-5.4-mini$0.0027

3.78 mean turns

Anthropic · claude-haiku-4-5$0.0078

3.44 mean turns · 2.9× cost

The full breakdown and the "if you're building an agent" guidance are in the agent section of .

Caveats

N=20, one trial per cell. Variance at the tail is unstable at this scale; a serious measurement would run N≥100 with at least 3 trials and confidence intervals. This run is directionally interesting, not statistically rigorous.

The specific numbers are also server weather — the same notebook in October will produce different latency. The shape of the comparison (p50 vs p95 differ; tool schemas tax one provider but not the other; cost-per-task ≠ cost-per-call) is the durable part. The notebooks themselves are designed to age — deterministic tools, fixed catalogs, ground truth as a pure function — so re-running them on a future model gives you the same comparison axes against new numbers.

What's next

The notebooks are reproducible. Clone the repo, run them, contradict the numbers. Follow-ups I want to add: a 1-hour-TTL caching run, a much larger prefix (10K+ tokens, closer to production RAG), and the same agent loop on Sonnet and Opus to see where the cost picture inverts.

If you've measured something that disagrees with this — or want to add a notebook for a third provider — PRs welcome.


github.com/dominic-righthere/multi-sdk-llm-notebooks · ·