Counting LLM Tokens Without Surprise Bills

By Λ · May 18, 2026 · 8 min read

The first $300 LLM bill I got was a batch summarization job over 14,000 customer support tickets through GPT-4. I expected about twenty dollars. The tickets were longer than I had guessed, the responses ran longer than I had budgeted for, and the temperature was high enough that some prompts triggered retries. I had used the API for months without thinking about cost. After that bill, I never run a batch without estimating first.

This post is what I have learned about predicting and controlling LLM costs in 2026. The short version: count tokens before you send, cap output explicitly, and pick the cheapest model that meets your quality bar. The long version is below.

Tokens, briefly

LLMs do not see characters or words. They see tokens, which are sub-word units produced by a tokenizer. Different model families use different tokenizers:

The practical takeaway: token counts differ by model. A 1000-word English document is roughly 1300 tokens in GPT-4, 1200 tokens in GPT-4o, 1180 tokens in Claude, 1230 tokens in Gemini. Within 10% of each other for English prose, but the gaps grow on code (more punctuation density) and non-English text (more sub-word splits).

Cost arithmetic

For every model, the bill is straightforward:

cost = (input_tokens / 1_000_000) * input_price + (output_tokens / 1_000_000) * output_price

Output tokens are billed at 3-5x the input price for most providers. The reason: generating tokens is expensive (it's sequential autoregressive decoding) while reading tokens is cheap (it parallelizes well). May 2026 pricing for reference:

Use the token counter to paste your prompt and see the side-by-side cost across all models. The visual comparison usually surfaces a cheaper option that works just as well.

Patterns I use to control cost

1. Always estimate before batches

If you are about to run a job over N items where each is non-trivial, multiply by N before you start. A request that costs $0.01 across 50,000 items is $500. The math is obvious in retrospect; it is also the math I forgot the day of the $300 bill. Build this estimate into your runner so the script refuses to start without an explicit confirm if the projected cost crosses your threshold.

2. Cap max_tokens on every request

Every API has a max_tokens parameter (sometimes called max_output_tokens). Setting it caps the worst-case bill. If you ask for a single-sentence summary and the model decides to give you a five-paragraph essay, max_tokens stops that at the cost you budgeted for. The default in most SDKs is "no limit", which is exactly the wrong default.

3. Prefer the cheapest model that works

For summarization, classification, extraction, and most other "process text" tasks, GPT-4o-mini, Claude Haiku, and Gemini Flash produce results within a few percent of their larger siblings at 10-50x lower cost. Run a small evaluation set across three model sizes, compare outputs, pick the cheapest one that passes your quality bar. I have moved several production pipelines from Claude Opus to Sonnet to Haiku over the last year as the smaller models improved.

4. Be deliberate about temperature

Higher temperature increases output diversity but also increases the chance of a malformed response that requires a retry. For deterministic tasks (extracting structured data, classifying), set temperature to 0. Each retry is its own billable request.

5. Use prompt caching for repeated context

Anthropic, OpenAI, and Gemini now support prompt caching. If you have a 5000-token system prompt that repeats across thousands of requests, mark it as cacheable. Subsequent requests within the cache window (5 minutes to an hour depending on provider) bill the cached portion at 10% of normal cost. For applications with long system prompts, this is a 5-10x savings.

// Anthropic example
{
  "system": [
    {
      "type": "text",
      "text": "You are an expert customer support agent for...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [...]
}

6. Batch where the API supports it

OpenAI's Batch API and Anthropic's Message Batches API charge 50% of normal pricing for asynchronous workloads with a 24-hour SLO. For overnight summarization, classification, or any non-interactive job, this is half-off without changing your prompts.

7. Watch for runaway loops

If your code re-prompts on errors, make sure there is a hard retry cap. The worst bill I have heard of (from someone else) was a customer-service bot that got stuck in a loop where each response triggered a re-prompt. Three days, $40,000.

The single most expensive mistake

Embedding endpoints. Many developers pay for embeddings on every request without realizing embeddings should be computed once and stored. If your retrieval-augmented-generation pipeline re-embeds the same documents on every query, you are paying for nothing. Compute embeddings once at ingestion, store in your vector database, and re-embed only when documents actually change.

Quick checklist

Related