In the previous article, I talked about treating AI agents as systems. In any system, cost is a first-class concern. When your agent talks to a model provider, tokens are where that cost lives.
So what is a token? It's not a word but a subword unit derived during tokenization. The most common approach is Byte Pair Encoding (BPE), though others exist like SentencePiece and WordPiece depending on the model family. The tokenizer breaks text into chunks that map to integer IDs in a fixed vocabulary. Common words might be a single token; uncommon ones get split into multiple. The word "tokenization," for example, often splits into "token" and "ization" — two tokens for what reads as one word.
Why does this matter? At a high level, more tokens means more work for the model. Language models process tokens through multiple layers of attention mechanisms, where each token is compared against other tokens in the sequence to understand context and relationships. As the number of tokens grows, the number of these comparisons increases significantly — roughly quadratically with sequence length in standard transformer architectures. There are optimizations like Flash Attention and sliding window attention that reduce the practical cost, but the general principle holds: doubling your token count doesn't just double the compute; it more than doubles it.
This translates directly into hardware demand: more GPU time, more memory, more electricity. Model providers price their APIs based on this reality, charging per token on both the input and output side. Output tokens are typically more expensive than input tokens — often 3x to 5x more depending on the provider. So when your agent sends a bloated prompt and receives a verbose response, you're paying for every piece of that inefficiency, and the output side is where it hurts the most.
Keeping Token Usage in Check
Prompt Hygiene
Clean your input before it hits the API. Trim whitespace, strip filler text, remove redundant context. Every token counts — literally. Watch your system prompt. Your system prompt consumes input tokens on every single API call. If your agent carries a large system prompt, that's a fixed cost repeated across every request. Keep it tight and only include what the model actually needs for the task.
Set Output Limits
Use the max token parameter on your API calls (e.g., max_tokens for Claude) to cap response length. Don't let the model ramble when you need a concise answer. Since output tokens cost more, this is often where the biggest savings come from.
Dynamic Limits
If you're enjoying this post, consider subscribing to get future articles delivered straight to your inbox.
SubscribeNot every prompt needs a 4,000-token response. Classify intent upstream using a lightweight classifier or a cheaper model call to triage, then adjust the token budget before hitting the expensive model.
Constrain the Output Format
Instruct the model to skip preamble and sign-offs. If your application layer handles the UX wrapper, there's no reason for the model to generate "Sure! Here's your answer..." every time. Also, if you do intend to provide intro and closure messages, don't get these from the model. You can hardcode the intro and closure especially if the app layer is not multi-generic.
Use Prompt Caching
Both Anthropic and OpenAI offer prompt caching where repeated prefixes in your input get charged at a reduced rate. If your agent reuses the same system prompt or context block across calls, caching can cut a significant chunk off your input token costs.
