AI Agents•February 24, 2026•3 min read

Tokens! What They Actually Are

Words aren't tokens. Understanding tokenization, BPE, token IDs, and embeddings — how raw text becomes the numbers a model actually reads.

A very common misconception I often come across is that words equal tokens. Many people assume that a sentence like "Hi! How are you!" would be 4 tokens, with Hi, How, are, and you each being one token.

But that is not how tokenization works. Punctuation, spaces, word parts, common chunks, and even entire words can all affect how text gets split into tokens. To unlearn this idea, write code to count tokens across various models for the same prompt. As you will see after running the code, different models can return different token counts for the same prompt.

import Anthropic from "@anthropic-ai/sdk";
import { GoogleGenAI } from "@google/genai";
import OpenAI from "openai";

const anthropic = new Anthropic();
const gemini = new GoogleGenAI({});
const openai = new OpenAI();

const prompt = 'Hi! How are you!';
const claudeModels = ["claude-haiku-4-5", "claude-opus-4-8"];

for (const model of claudeModels) {
  const count = await anthropic.messages.countTokens({
    model: model,
    system: "You are a helpful assistant",
    messages: [{ role: "user", content: prompt }],
  });

  console.log(model, ' :: ', count.input_tokens);
}

const geminiCount = await gemini.models.countTokens({
  model: "gemini-2.5-flash",
  contents: [
    { role: "user", parts: [{ text: prompt }] }
  ]
});

const openaiCount = await openai.responses.inputTokens.count({
  model: "gpt-5",
  input: [
    {
      role: "user",
      content: prompt,
    },
  ],
});

console.log("gemini-2.5-flash :: ", geminiCount.totalTokens);
console.log("gpt-5 :: ", openaiCount.input_tokens);

// Set up keys in shell first:
// export ANTHROPIC_API_KEY=...
// export GEMINI_API_KEY=...
// export OPENAI_API_KEY=...

claude-haiku-4-5  ::  18
claude-opus-4-8   ::  24
gemini-2.5-flash  ::  7
gpt-5             ::  12

Same prompt, four different counts: 18 on Claude Haiku, 24 on Claude Opus, 7 on Gemini, and 12 on GPT-5. Every provider trained its own tokenizer, so the same string lands on a different number of tokens depending on who you send it to.

So, once the myth that "words = tokens" is broken, you can then move on to understanding what tokens really are.

Picture training a model on a thousand PDFs. Before the model learns from text, the tokenizer is usually trained first. A tokenizer is basically an algorithm/software layer that reads raw text and learns how to split it into useful chunks.

In the beginning, it may start with very small units, such as characters or bytes: t is a token, h is a token, e is a token.

Each token is assigned an integer ID: t maps to 100, h maps to 101, e maps to 102.

Then the tokenizer starts noticing what often appears together. If t and h keep landing side by side, th may get promoted into a token of its own. Let's say it gets assigned an ID of 1000.

A few passes later, the may show up often enough to earn its own place too. Bit by bit, the tokenizer builds a vocabulary: a dictionary that maps chunks of text to numbers. This particular approach is called Byte Pair Encoding, or BPE. Other approaches like WordPiece and SentencePiece-based tokenizers follow a similar broad idea of breaking text into useful subword units, though their exact algorithms differ.

Tokenizer flow from PDFs to vocabulary: raw training text is read into memory; pass 1 creates character tokens (t to 100, h to 101, e to 102); pass 2 merges frequent pairs (t + h to th); later passes merge larger chunks (th + e to the); the result is a vocabulary dictionary where common words may become one token while rare words stay split, e.g. tokenization splits into token and ization. — How a BPE tokenizer grows a vocabulary: characters merge into larger chunks pass after pass, until common words collapse to single tokens while rare ones stay split. SentencePiece and WordPiece follow similar ideas.

Common words can collapse into a single token while rare or complex words may stay split. That is why a word like "tokenization" might break into something like "token" and "ization" - two tokens for what looks like one word.

The chunks like t, th, or the are the tokens. The numbers assigned to them are token IDs.

Once that vocabulary exists, the model does not read raw text directly. Your prompt comes in, and the tokenizer encodes it into token IDs — just numbers. For example, "Hi" might map to something like 20000.

Those token IDs are then converted into embeddings, which are vectors the model can process. The transformer then does its work through embeddings, attention, and multiple layers of computation what we can loosely describe as "the model's reasoning." What comes out the other end is also a sequence of token IDs. Those IDs are then decoded back into the text you actually see.

If you're enjoying this post, consider subscribing to get future articles delivered straight to your inbox.

Related articles