LLM inference at 3,000 tokens/s: why AI speed is already a business problem

There are technical headlines that seem meant only for engineers, but in reality they change the rules for any business owner who is using — or thinking about using — AI. The latest comes from Kog, a French startup that has just demonstrated something that, until very recently, was considered impossible: generating 3,000 tokens per second per request on a 2B model on a standard 8-GPU node, without quantization or speculative decoding.

To put that into perspective: ChatGPT generates around 100 tokens per second. Kog is going 30 times faster on hardware that many companies already have in their data centers.

If your first reaction is "okay, and why should I care?", stay with me. This affects you more than it seems.

What "tokens per second" really means

A token is the smallest unit a language model works with — roughly three-quarters of a word. Tokens per second (t/s or tok/s) is the standard measure of how many tokens a model can generate each second, and it is the benchmark used to compare AI model inference speed across different hardware configurations.

In other words: it’s what determines whether your voice bot sounds natural or your customer hangs up before the sentence is over.

And here’s the detail many people overlook: there are three different metrics that constantly get mixed up.

Metric	What it measures	Who cares about it
Aggregate throughput	Total tokens generated per second across all users	The infrastructure provider
Time to first token	How long it takes to start responding	The user waiting
Per-request decode speed	Generation speed once it has started	Your AI agent and your customer

Aggregate throughput measures server utilization and rewards large batches. Time to first token measures prefill latency. Per-request decode speed defines how long a user waits before receiving the full response — and it is where AI agents get stuck.

Most providers sell you the first number because it is the easiest to inflate. But the one that matters for your business is the third.

Why speed matters more than you think

This is where it gets interesting. An AI agent operates in a sequential loop: inspect, plan, edit, test, review. Each step depends on the previous one. Sometimes tool time dominates, but the generation-heavy steps (planning, writing, analysis, debugging) set the pace of the loop.

Translate that into business terms:

Voice bot handling calls. If it takes 4 seconds to start speaking, the customer hangs up. If it responds with human-like fluency, you book the appointment.
Lead-qualification agent. It has to read the form, check the CRM, verify availability, generate a response. If each step takes 3 seconds, that’s 12 seconds before the first word. If each takes 300 ms, it feels instant.
Assistant answering FAQs. At 100 t/s, the customer sees the answer trickle in. At 1,000 t/s, it appears all at once — and the conversation flows.

At 4,000 tokens per second, each iteration cycle drops from 5 minutes to 6 seconds. The agent drafts, reviews, tests, and refines. You think. The cycle runs 40 times in the time a standard stack completes one.

This is not just "faster." It is a different product category.

The bottleneck is not what it seems

A lot of people assume AI is slow because it "does so much computing." False. The real problem is memory, not compute.

The bottleneck is not compute — it is memory bandwidth. Standard autoregressive generation processes one token at a time, requiring a full forward pass for every new token. When you run an LLM, the model weights (billions of parameters) sit in the GPU’s VRAM. For every token the model generates, those weights have to be read from VRAM into the compute cores.

It’s like having a Formula 1 car with a garden hose as the fuel line. The GPU can calculate incredibly fast — the problem is feeding it data.

That is why what Kog has done matters: they have shown that the decoding speed ceiling of standard datacenter GPUs is much higher than current inference stacks expose, because of software bottlenecks. The hardware was already there. It just had to be used properly.

Does this change anything for a small or midsize business?

Yes. And this is where I want to be very clear, because there is a lot of noise around news like this.

What does NOT change tomorrow:

Your OpenAI or Anthropic bill. These advances take months (or years) to reach commercial APIs.
The need to design your automations well. A fast bot that is poorly designed is still a poorly designed bot.
Human judgment. Speed does not replace design.

What DOES change in the medium term:

The cost per interaction will go down. When inference is 10x faster on the same hardware, the cost per request falls proportionally. That is what makes automations that are expensive today viable next year.
Voice agents become indistinguishable from humans. Today, a bot that replies in 2 seconds sounds like a bot. With this speed, it responds in under 300 ms — the psychological threshold where the human brain stops noticing the difference.
Complex agentic workflows become practical. Today, chaining 10 reasoning steps is prohibitively slow. Tomorrow, it will be trivial. That opens the door to automations we are not even considering yet.
Small, specialized models win. Kog stands out particularly on small models (1B-7B parameters) that can be specialized and fine-tuned to match the accuracy of much larger models on specific tasks, at a fraction of the cost and ten times faster. For an SMB, that is gold: you do not need GPT-5 to answer the phone at your clinic. You need a small, fast model trained well on YOUR information.

What this means for your decisions today

If you are thinking about implementing AI in your business, there are a few things worth keeping in mind:

Do not marry your infrastructure. What is "best" today for speed and cost will be obsolete in 6 months. Design your automations so they can switch providers without rebuilding everything.

Do not optimize too early. If your bot handles 10 calls a day, it does not matter whether it runs at 100 t/s or 3,000. Speed is a problem when you have volume. Start by making it work, then optimize.

Do pay attention to latency in human interactions. In any workflow where a customer is waiting for a response — voice, live chat, forms — response speed is UX. And UX is conversion.

Small, specialized models > giant generic models. For most SMB use cases (answering calls, qualifying leads, replying to FAQs), a well-tuned 7B model beats GPT-5 on speed, cost, and often accuracy within the specific domain.

How we see it at Studio SmartWork

We have been designing voice agents and automations since before this became a public conversation. And what we see every day is that speed is no longer a “technical detail”: it is the difference between a bot your customer tolerates and one your customer prefers over speaking with a human.

Our voice agent for an aesthetic clinic in Málaga calls the Meta Ads lead in under 60 seconds from the moment they submit the form. That already changes the conversion metric. But the conversation itself also has to flow — and that depends on inference latency, prompt design, and workflow architecture.

News like Kog’s is a good sign: the ground is shifting in the right direction. What is expensive to build well today will be cheaper, faster, and more accessible tomorrow. And businesses that already have their processes automated will be in position to take advantage from day one.

Those who keep waiting "for AI to mature" will still be waiting in two years — because AI never stops maturing. The question is not whether to wait. It is where to start.

Executive summary

Kog has demonstrated 3,000 tokens/s per request on a standard 8-GPU node without quantization or speculative decoding, compared with ChatGPT’s ~100 t/s.
AI’s bottleneck is not compute, it is memory bandwidth. Current hardware can do far more than software stacks expose.
For your business, this means: voice agents indistinguishable from humans, practical complex agentic workflows, and lower cost per interaction.
This does not change your decisions today, but it does confirm the direction is right: small, specialized, fast, well-integrated models beat generic, slow solutions.
Inference speed is no longer an engineering problem. It is a business lever.