AI for Customer Support: A Practitioner's Playbook (2026) – A&M Flow
An operator's guide to shipping AI customer support in 2026. Model choice, vendor reality, tool design and the eval set everyone skips until it bites.
Published: 2026-05-05 · Author: A&M Flow
The model is the cheapest decision you make. What kills support bots in month three is the eval set you skipped, the refund tool with no spending cap and the multilingual launch you treated as a model problem.
Most AI support bots that get fired in month three were doomed in week one, when somebody picked a vector database before writing fifty representative test conversations.
The takeaway most operators drew was the wrong one. It was not that LLMs cannot do support. It was that resolution rate is a vanity metric the moment a customer's choice is auto-resolve or escalate-and-suffer. If you measure only the first column you ship a system optimised for closing tickets, not solving problems. Klarna's number was real. The thing that number measured was not what the board thought they were buying.
Intercom Fin is good. It is good in a narrow way: B2B SaaS with clean help-centre docs, low SKU complexity and a CSAT bar that tolerates the occasional shrug. The pricing per resolved conversation looks reasonable until you model it against your real ticket mix and realise you are paying premium for the easy ones the bot was going to win anyway. For consumer brands with messy product catalogues, multilingual returns flows or anything regulated, Fin is the wrong shape of tool. Zendesk AI agents are in roughly the same bucket, with the additional friction that you are now married to Zendesk's data model.
Build only when the off-the-shelf path forces you to bend your business around its limits. The honest middle ground for most teams I talk to is a thin custom orchestration layer over a managed model and a managed vector store, talking to your existing helpdesk through its API. You write maybe four thousand lines of code. You own the bits that matter, which are the tool definitions, the eval harness and the escalation routing. You rent the bits that do not, which are the model weights and the storage.
The work that decides whether your bot survives contact with real traffic is the part nobody puts in the proposal deck. None of it is the model.
The capability shift that matters in 2026 is tool calling that mutates state. The bot is not answering questions, it is issuing refunds and changing addresses and pausing subscriptions. This is where the money is and also where the lawsuits live. The right design is three tiers, the wrong design is one big bucket called assistant permissions.
If your refund tool does not have a per-call spending cap and a daily aggregate cap, you do not have a refund tool, you have an incident waiting for a Hacker News thread. The cap is not there because the model is dumb. It is there because a determined customer with a working browser and a willingness to retry will find a phrasing that gets past your guardrails, and the cap is what limits the blast radius when they do.
I keep seeing teams treat multilingual as a model problem. It is not. Sonnet 4.5 and GPT-5 both speak twenty plus languages well enough to fool a casual listener. The problem is that your help articles exist in English, your returns policy exists in English with a half-translated German version from 2022 and your Polish escalation team uses a different ticket schema than your Spanish one. The bot speaks the language fluently and gives confidently wrong answers because the underlying knowledge is wrong or missing.
Launch in two languages. Get your CSAT, escalation rate and refund-disputed rate roughly equal across both. Only then add the third. Teams that launch in seven on day one are choosing the same dashboard screenshot Klarna chose.
Article sections
- Why the headline number was never the point
- Pick the model last, not first
- Intercom Fin, Zendesk AI and the build-vs-buy trap
- Retrieval, tools and the eval set you keep skipping
- What you let the bot actually do
- Multilingual support is mostly a knowledge problem
- What we would actually ship in twelve weeks
Key points
- If you want a second pair of eyes
Key quotes
The bot is as smart as the worst-translated paragraph in your knowledge base. Fix that paragraph before you change the model.