Inference
Inference is running a trained model to produce an output, as opposed to training the model in the first place. In production AI features, inference is where the runtime cost lives — per token, per call, every time a user hits the system.
Training is the one-time, expensive act of teaching a model. Inference is everything that happens afterwards: a user asks a question, the large language model reads the prompt, generates tokens and returns an answer. From a finance point of view, training is a capital expense and inference is the variable cost that scales linearly with usage, which is why an AI product that looked cheap in the demo can become uncomfortable on the invoice once real traffic arrives.
The honest take is that faster and cheaper inference usually means a smaller or quantised model, and a smaller model often means lower quality on the genuinely hard tasks. The right model size is a deliberate choice, not a default — a foundation model for the hard questions, a cheaper model for the easy ones, and routing between them based on what the request actually needs.
This is where MLOps earns its keep: monitoring inference latency, cost per request, and quality over time, so you can spot when a fine-tuned smaller model would do the same job for a fraction of the price, or when the cheap model is silently degrading the user experience and needs to be swapped out.