Context window
Context window is the amount of text — measured in tokens — that a large language model can read in a single call, covering both the input you send and the output it generates. Once the conversation, documents and answer together exceed that budget, something has to be cut, summarised or retrieved on demand.
The context window is the working memory of an LLM for a single request. Everything the model needs to take into account — the system prompt, the conversation history, attached documents, tool outputs and the answer itself — has to fit inside it. When windows were small, the entire discipline of prompt engineering was partly about ruthlessly trimming what went in. With windows now stretching into hundreds of thousands of tokens, the question has shifted from "what do I leave out" to "what is actually useful to include".
Bigger windows let teams push entire codebases, contracts or knowledge bases into a single call and sometimes skip retrieval altogether. That is genuinely useful for one-off analysis and prototypes, where the simplicity of "just paste it all in" beats building a retrieval pipeline.
The trade-off shows up in production. Larger inputs cost more per call, slow down inference, and answer quality degrades on very long inputs — models start to miss things buried in the middle even when the tokens technically fit. The rule of thumb is to treat the window as a budget, not a destination: put only what the model needs for this question, and use retrieval for the rest.