Counting LLM Tokens Without Surprise Bills
The first $300 LLM bill I got was a batch summarization job over 14,000 customer support tickets through GPT-4. I expected about twenty dollars. The tickets were longer than I had guessed, the responses ran longer than I had budgeted for, and the temperature was high enough that some prompts triggered retries. I had used the API for months without thinking about cost. After that bill, I never run a batch without estimating first.
This post is what I have learned about predicting and controlling LLM costs in 2026. The short version: count tokens before you send, cap output explicitly, and pick the cheapest model that meets your quality bar. The long version is below.
Tokens, briefly
LLMs do not see characters or words. They see tokens, which are sub-word units produced by a tokenizer. Different model families use different tokenizers:
- GPT-4 family uses
cl100k_base, a byte-pair encoding (BPE) with about 100k vocabulary. - GPT-4o, GPT-5 use
o200k_base, a newer BPE with about 200k vocabulary. More efficient on non-English text. - Claude uses Anthropic's proprietary tokenizer. They do not publish the implementation.
- Gemini uses SentencePiece with custom training.
- Llama uses customized SentencePiece, similar to but distinct from Gemini.
The practical takeaway: token counts differ by model. A 1000-word English document is roughly 1300 tokens in GPT-4, 1200 tokens in GPT-4o, 1180 tokens in Claude, 1230 tokens in Gemini. Within 10% of each other for English prose, but the gaps grow on code (more punctuation density) and non-English text (more sub-word splits).
Cost arithmetic
For every model, the bill is straightforward:
cost = (input_tokens / 1_000_000) * input_price + (output_tokens / 1_000_000) * output_price
Output tokens are billed at 3-5x the input price for most providers. The reason: generating tokens is expensive (it's sequential autoregressive decoding) while reading tokens is cheap (it parallelizes well). May 2026 pricing for reference:
- GPT-5: $5/M input, $15/M output
- GPT-4o: $2.50/M input, $10/M output
- GPT-4o-mini: $0.15/M input, $0.60/M output
- Claude Opus 4: $15/M input, $75/M output
- Claude Sonnet 4: $3/M input, $15/M output
- Claude Haiku 4.5: $0.80/M input, $4/M output
- Gemini 2.0 Pro: $1.25/M input, $5/M output
- Gemini 2.0 Flash: $0.10/M input, $0.40/M output
Use the token counter to paste your prompt and see the side-by-side cost across all models. The visual comparison usually surfaces a cheaper option that works just as well.
Patterns I use to control cost
1. Always estimate before batches
If you are about to run a job over N items where each is non-trivial, multiply by N before you start. A request that costs $0.01 across 50,000 items is $500. The math is obvious in retrospect; it is also the math I forgot the day of the $300 bill. Build this estimate into your runner so the script refuses to start without an explicit confirm if the projected cost crosses your threshold.
2. Cap max_tokens on every request
Every API has a max_tokens parameter (sometimes called max_output_tokens). Setting it caps the worst-case bill. If you ask for a single-sentence summary and the model decides to give you a five-paragraph essay, max_tokens stops that at the cost you budgeted for. The default in most SDKs is "no limit", which is exactly the wrong default.
3. Prefer the cheapest model that works
For summarization, classification, extraction, and most other "process text" tasks, GPT-4o-mini, Claude Haiku, and Gemini Flash produce results within a few percent of their larger siblings at 10-50x lower cost. Run a small evaluation set across three model sizes, compare outputs, pick the cheapest one that passes your quality bar. I have moved several production pipelines from Claude Opus to Sonnet to Haiku over the last year as the smaller models improved.
4. Be deliberate about temperature
Higher temperature increases output diversity but also increases the chance of a malformed response that requires a retry. For deterministic tasks (extracting structured data, classifying), set temperature to 0. Each retry is its own billable request.
5. Use prompt caching for repeated context
Anthropic, OpenAI, and Gemini now support prompt caching. If you have a 5000-token system prompt that repeats across thousands of requests, mark it as cacheable. Subsequent requests within the cache window (5 minutes to an hour depending on provider) bill the cached portion at 10% of normal cost. For applications with long system prompts, this is a 5-10x savings.
// Anthropic example
{
"system": [
{
"type": "text",
"text": "You are an expert customer support agent for...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [...]
}
6. Batch where the API supports it
OpenAI's Batch API and Anthropic's Message Batches API charge 50% of normal pricing for asynchronous workloads with a 24-hour SLO. For overnight summarization, classification, or any non-interactive job, this is half-off without changing your prompts.
7. Watch for runaway loops
If your code re-prompts on errors, make sure there is a hard retry cap. The worst bill I have heard of (from someone else) was a customer-service bot that got stuck in a loop where each response triggered a re-prompt. Three days, $40,000.
The single most expensive mistake
Embedding endpoints. Many developers pay for embeddings on every request without realizing embeddings should be computed once and stored. If your retrieval-augmented-generation pipeline re-embeds the same documents on every query, you are paying for nothing. Compute embeddings once at ingestion, store in your vector database, and re-embed only when documents actually change.
Quick checklist
- Estimate cost per request × volume before running batches
- Set
max_tokenson every API call - Pick the cheapest model that meets your quality bar (test, do not assume)
- Use temperature 0 for deterministic tasks
- Mark long static system prompts as cacheable
- Use Batch APIs for non-urgent workloads (50% discount)
- Hard-cap retries
- Compute embeddings once and store them
- Monitor daily spend; set an alarm at your monthly budget / 30