Tokens & Context Windows Explained: LLM Cost Optimization 2026

"Why does our AI bill look like this?" "The AI forgets my original instructions once the conversation gets long." Both puzzles have the same explanation: tokens and the context window.

This guide explains the two concepts that determine an LLM's capacity and cost, then walks through practical techniques to keep costs down while maintaining output quality. The content is based on the foundation lectures we use in our corporate training and online courses.

For the bigger picture of how generative AI works (next-token prediction and more), start with The Complete Guide to AI Agents for Business.

What you will learn

What a token is — the smallest unit an LLM processes
Why tokenization is necessary — from text to numeric IDs
Token estimation rules — and why some languages are "more expensive"
Input vs. output tokens — pricing mechanics with a worked example
What a context window is — comparing the limits of major models
Why AI "forgets" instructions — context contents and compaction
Six techniques to optimize token usage, plus a self-check list

What is a token? The smallest unit an LLM processes

A token is the smallest unit an LLM (large language model) uses to process text. The model splits text into chunks of words or characters before processing.

English: "Hello, world!" → ["Hello", ",", " world", "!"] (about 4 tokens)
Japanese: a five-character greeting may split into 3 tokens (varies by model)

Three key points:

Languages differ in token efficiency — Japanese, for example, consumes more tokens than English for the same meaning
One token ≈ about 4 characters in English (about 1–2 characters in Japanese)
Code and symbols follow their own tokenization rules

Why tokenization is necessary

Computers cannot understand characters directly; text must be converted to numbers:

Text: "Hello AI"
Tokenization: ["Hello", " AI"]
ID conversion: [15496, 9552]

Diagram showing the flow from text to token splits to numeric IDs

Each token gets a unique numeric ID, and these IDs are what the model takes as input. In other words, an LLM learns patterns of token IDs, not "the meaning of words."

Token estimation rules of thumb

Amount of text	Approximate tokens
1,000 English words	~750 tokens
1,000 Japanese characters	~500–700 tokens
100 lines of code	~500–1,500 tokens

Checking token counts before calling an API lets you forecast costs.

Input vs. output tokens: how pricing works

LLM usage is billed as input tokens plus output tokens — at different rates.

Category	What it includes	Price level
Input tokens	Your prompt, the system prompt, conversation history, attached file contents	Relatively cheap
Output tokens	The AI's response text, generated code, the entire answer	More expensive than input (2–8x)

Comparison chart showing the price difference between input and output tokens

A worked example from our course (GPT-5.2, 2026):

Input: 1,000 tokens × $1.75/1M = $0.00175
Output: 500 tokens × $14/1M = $0.007
Total: about $0.009 per request

Small per request — but hundreds of daily requests across a team add up fast. This is exactly why specifying the output format to limit output tokens is the single most effective saving technique.

What is a context window?

The context window is the maximum number of tokens an LLM can process at once. Major models compared (from our 2026 course material):

Model	Context window	Rough equivalent
GPT-5.2	400K tokens	~3 novels
Claude Sonnet 4.6	200K tokens (1M Beta)	~1.5 novels
Gemini 3 Pro	1M tokens	~7 novels
Llama 4 Scout	10M tokens	~70 novels
DeepSeek-V3.2	128K tokens	~1 novel

Why it matters:

It is the hard limit when processing long conversations or large files
Beyond the window, older information gets "forgotten"
A larger window = more information handled at once

Why AI "forgets" instructions: what's inside the context

"I gave careful instructions, but halfway through the AI started ignoring them." The culprit is the context window.

The crucial premise: the AI does not "remember" the conversation. On every turn, the context window is stuffed with all of the following:

System prompt — the AI's base configuration and behavior
User prompt — the current question or instruction
Tools / MCP / Rules — available tools, external connections, project rules
Documents (RAG) — reference material fetched by retrieval
Past conversation history — the entire chat so far

Close the session and the AI forgets everything. For long-term memory, you must explicitly save information using a memory feature or by writing to files.

Compaction: what happens when the window fills up

As a conversation approaches the window limit, compaction kicks in: older messages are summarized and deleted. Space is freed, but information is lost. Most cases of "the AI forgot my original instructions" trace back to this.

For scale: reading a 1,000-line file once consumes roughly 4,000 tokens. Read 30 files and run 20 commands and you can exceed 100,000 tokens — large jobs simply cannot run without compaction.

Illustration of a cluttered context window depicted as a messy desk overloaded with information

When all kinds of information pile into the context, the AI gets confused like a person at a cluttered desk. Keeping the context clean is the easiest way to maintain output quality. The remedy is simple: start a fresh session per task, and write important decisions out to files.

Six techniques to optimize token usage

Technique	What to do	Example
1. Concise prompts	Cut filler; focus instructions on essentials	"Please do X, and if possible…" → "Do X"
2. Only what's needed	Extract just the relevant part of large files	Pass the specific function or section, not the whole file
3. Specify output format	State the required format to prevent padding	"JSON, keys only", "bullet list, max 5 items"
4. Manage conversation history	Summarize and reset long conversations	Past ~20 turns, summarize key points into a new chat
5. Pick the right model	Match model to task complexity	Light models for simple tasks, top models for complex ones
6. Consider language	English can be more token-efficient	Write technical instructions in English, request output in your language

Self-check when costs feel high

The highest-impact items from our course checklist:

One task = one chat — continuing unrelated tasks in a single conversation piles up irrelevant context
Not attaching huge files whole — a 1,000+ line file alone burns 4,000+ tokens; pass only the needed range
Plan before executing — unplanned trial-and-error consumes 2–3x the tokens through re-reading and redoing
Rule files not bloated — rule files load into context every time; prune stale rules regularly
Controlling output volume — "explain in detail" inflates output tokens; specify "concise, bulleted" instead

In short: token consumption = input (context) + output (answer). Keep the input small and control the output format, and the same amount of work costs dramatically less. For instruction-writing itself, see our related guides on agent extension and retrieval below.

Frequently asked questions

Q. What is a token, and how does it differ from a character count? A. A token is the smallest unit an LLM uses to process text, and it does not map one-to-one to characters. In English, one token is roughly 4 characters; 1,000 English words come to about 750 tokens, while 1,000 Japanese characters come to about 500–700 tokens. Languages differ in token efficiency — the same meaning can cost more tokens in one language than another — which matters directly for cost management.

Q. Why does AI output cost more than input? A. LLM pricing separates input and output token rates, and output is typically 2–8 times more expensive. For example, with GPT-5.2 (2026), a request with 1,000 input tokens and 500 output tokens costs about $0.009. That is why constraining the output format — "bullet points, max 5 items" — is the cheapest, most effective optimization: it directly cuts the expensive side of the bill while improving readability.

Q. Why does the AI forget my instructions mid-conversation? A. Because the AI does not remember anything — every turn, the entire context (system prompt, rules, reference documents, full conversation history) is packed into the context window. As the conversation approaches the limit, compaction summarizes and deletes older messages, losing information. Countermeasures: run one task per chat, summarize and restart long conversations, and write important decisions out to files instead of relying on the AI's "memory."

Q. Is a bigger context window always better? A. A bigger window handles more information at once (GPT-5.2 offers 400K tokens, Gemini 3 Pro 1M, Llama 4 Scout 10M), but it is not a license to stuff everything in. Irrelevant information confuses the model and degrades accuracy, and more input tokens mean higher costs. In day-to-day use, passing only the necessary information and keeping the context clean improves output quality more than raw window size.

Q. What is the simplest cost reduction I can apply today? A. Specify the output format. Constraints like "bullet list, max 5 items" or "JSON, keys only" directly reduce the expensive output tokens. Next: enforce one task per chat, and never paste huge files whole — pass only the relevant range. All three require no tool configuration changes, work immediately, and cause no quality loss.

Related services

Public curriculumBrowse all module overviews and durations to see the full learning path.

Ready to put AI agents to work?

Turn what you just read into real workflows. AI Agent Camp helps non-technical professionals go from using to building — hands-on.

Start for free →

Last reviewed: 2026-06-10