What is the ROI for AI customer service?
Why knowledge base quality determines chatbot ROI?

A chatbot interaction can save roughly $0.50–$0.70 per interaction in healthcare and banking sectors, per Juniper Research estimates. A human agent interaction costs $7–$15.[1][1] That math has convinced 85% of customer service leaders to explore or pilot conversational AI in 2025.[2] But here's the number that should follow them into every planning meeting: 64% of customers would prefer companies didn't use AI for customer service at all, and 53% would consider switching to a competitor if they did.[3] The gap between executive optimism and customer reality isn't closing — it's widening. Chatbots reduce customer service costs only when companies invest in the information infrastructure behind them. Most don't, which is why the majority of AI deployments fail and even headline successes have had to reverse course.
Key Points
64% of customers would prefer companies didn't use AI for customer service; 53% would consider switching to a competitor that doesn't (Gartner, 2024 survey of 5,728 customers).[3]
Lessons Learned
Audit your knowledge base before selecting an AI model. Incomplete or contradictory documentation is the single largest predictor of chatbot failure.
What is the real cost of deploying chatbots for customer service?
The pitch is straightforward: automate routine inquiries, cut per-interaction costs by 90%, free up human agents for complex work. Chatbot adoption has grown 4.7× since 2020, with approximately 60% of B2B and 42% of B2C companies now running chatbot software.[5] The technology works — in narrow, well-documented scenarios. Returns and cancellations see 58% resolution rates. Billing disputes hit 17%.[13]
That variance is the story. The question was never "can chatbots reduce costs?" They can. The question is what happens to the other 83% of billing disputes, the edge cases, the emotionally charged interactions, the queries that fall outside whatever the system was trained to handle. What happens is what happened to Jake Moffatt.
In November 2022, Moffatt contacted Air Canada's chatbot to ask about bereavement fares after his grandmother died. The chatbot fabricated a policy — told him he could book a full-price ticket and apply for a bereavement discount retroactively within 90 days. No such policy existed. Air Canada's actual bereavement fare policy was published on the same website the chatbot was ostensibly trained on. Moffatt booked flights costing over $700 CAD based on the chatbot's fabricated guidance. When he applied for the retroactive discount, Air Canada refused and argued that the chatbot was "a separate legal entity" responsible for its own statements. The Civil Resolution Tribunal of British Columbia rejected that argument entirely, ruling in February 2024 that Air Canada was liable for all information on its website, including chatbot outputs, and awarded Moffatt $812.02 in damages.[16][14]
The root cause was not a defective AI model. The root cause was that Air Canada deployed a language model on top of its website without ensuring the model could accurately retrieve and relay the company's own policies. The information existed. The infrastructure to deliver it reliably did not.
How do you measure whether chatbots are actually reducing costs?
Most organizations measure the wrong things. They track deflection rate — the percentage of conversations handled without a human agent — and declare victory when it rises. But deflection says nothing about whether the customer's problem was solved. Gartner found that only 14% of customer service issues are fully resolved in self-service, and in more than 43% of self-service cases, the system failed to surface content relevant to the customer's actual issue.[6]
Resolution rates vary so dramatically by issue type that a single aggregate number is meaningless. Gartner's survey of 497 B2B and B2C customers found chatbots resolved 58% of returns and cancellations but only 17% of billing disputes.[13] If your chatbot handles mostly simple inquiries and you're measuring containment rate across all interactions, you're averaging your successes with your failures and calling the result acceptable.
The industry is starting to recognize this. Zendesk migrated its pricing model from monthly active users to "automated resolutions" in November 2024 — an implicit acknowledgment that conversations started are not conversations solved.[17] The metric shift matters: when vendors charge per resolution rather than per interaction, they're incentivized to make the system actually work, not just respond.
Why is the chatbot failure problem getting worse?
Because deployment is accelerating faster than the infrastructure to support it. Eighty-five percent of customer service leaders plan to explore or pilot conversational GenAI in 2025.[2] Chatbot adoption grew 4.7× in five years.[5] But 70% of even the highest-performing AI organizations — those generating more than 10% of EBIT from generative AI — report significant difficulties with data quality and infrastructure.[18]
The compounding factor is generative AI itself. Pre-GPT chatbots failed predictably: they couldn't understand the query, said "I don't understand," and escalated. They were annoying but honest about their limitations. Large language models fail unpredictably: they generate fluent, confident, detailed answers that happen to be fabricated. The Air Canada chatbot didn't say "I don't know." It invented a policy, complete with specific procedural steps and a 90-day timeline. The failure mode shifted from "unhelpful" to "actively misleading," and the cost shifted with it.
A Stanford study testing specialized legal AI tools — systems explicitly designed with retrieval-augmented generation to reduce hallucination — found they produced incorrect information 17% to over 34% of the time, depending on the product tested.[19] These aren't general-purpose chatbots giving vague answers. These are purpose-built, RAG-enhanced tools deployed in a domain where accuracy is existential. If the best available legal AI hallucinates up to a third of the time, what's happening in your customer service queue?
What do customers actually say about chatbot experiences?
They say what the surveys say they say, and the surveys are converging on the same conclusion from multiple directions.
Sixty-four percent of customers would prefer companies didn't use AI for customer service. Fifty-three percent would consider switching to a competitor.[3] More than two-thirds report having had a bad chatbot experience.[4] Thirty percent say a negative chatbot interaction makes them likely to take their purchase to a different brand.[20] And 72% of consumers say they trust companies less than they did a year ago, with 60% saying advances in AI make trustworthiness more important, not less.[21]
The Salesforce "State of Service" report captures the split perfectly: 95% of decision-makers at organizations with AI report cost and time savings, and 92% say generative AI improves customer service.[22] The people deploying the bots think they're working. The people using the bots disagree. Somebody is wrong, and the customer is the one with the credit card.
---
What happens when chatbots fail publicly?
The Air Canada case established that chatbot failures carry legal liability. The cases that followed established that they carry reputational costs that scale at the speed of social media.
What happened with DPD's chatbot?
On January 18, 2024, customer Ashley Beauchamp contacted DPD's AI chatbot to track a missing parcel. The chatbot couldn't help. When Beauchamp pushed, the chatbot swore at him, called DPD "the worst delivery firm in the world," and composed a poem about how terrible its own company was. Beauchamp posted screenshots to X. DPD confirmed the error and disabled the AI element of the chatbot. The screenshots went viral — the kind of virality a logistics company's communications budget could never recover from.[23]
What happened with Chevrolet of Watsonville's chatbot?
In December 2023, a user named Chris Bakke asked the dealership's ChatGPT-powered chatbot — built by third-party vendor Fullpath — to sell him a 2024 Chevrolet Tahoe for $1. The chatbot agreed, responding "That's a deal, and I'll hold you to it." The exchange generated over 20 million views on X. The car was never actually sold. The dealership took the chatbot offline.[24] The Tahoe's MSRP starts at roughly $58,000. The chatbot didn't understand it was making a commitment. It was completing a sentence.
What happened with the NEDA eating disorders chatbot?
In May 2023, the National Eating Disorders Association took down its chatbot "Tessa" after counselor Sharon Maxwell discovered it was dispensing weight-loss advice — calorie counting, weigh-ins, specific deficit targets — to people seeking help for eating disorders. Tessa had been designed as a rule-based tool for body image support. The vendor, Cass (X2AI), had added generative AI capabilities without NEDA's knowledge or approval.[25] This was the most consequential failure in the set. The others cost money and reputation. This one risked lives.
What do the failure patterns tell us?
Every case follows the same structure: a language model deployed without adequate constraints on what it can say, connected to insufficient or contradictory source material, with no effective escalation mechanism when it goes off-script. Air Canada's chatbot fabricated a policy. DPD's chatbot fabricated profanity. Chevrolet's chatbot fabricated a sales agreement. NEDA's chatbot fabricated medical advice. The model was never the differentiator. The absence of reliable information architecture was.
---
How widespread is the chatbot failure problem at the enterprise level?
The visible failures — the ones that go viral — represent the tail of a much larger distribution. The systemic data is worse than the anecdotes suggest.
By RAND Corporation's estimate, more than 80% of AI projects fail to reach production — twice the failure rate of IT projects that don't involve AI.[26] MIT's Project NANDA, studying over 300 AI initiatives across industries, found that 95% of organizations deploying generative AI are getting zero measurable financial return.[27] Gartner predicted in July 2024 that at least 30% of generative AI projects would be abandoned after proof of concept by end of 2025.[7] By end of 2025, they found that figure had risen to at least 50%.[8]
These aren't pilot projects at experimental startups. These are enterprise deployments backed by billions in cumulative investment. The failure rate isn't a function of the technology being too new. It's a function of organizations treating AI deployment as a technology procurement problem instead of a data infrastructure problem.
What do users consistently complain about?
The research identifies five distinct chatbot failure categories: inability to comprehend and provide accurate information, over-enquiry of personal or sensitive information, fake humanity (simulating empathy the system cannot feel), poor integration with human agents, and inability to solve complicated queries.[28] These five categories span industries, platforms, and model architectures. They're infrastructure complaints, not intelligence complaints.
When 61% of customer service leaders report having a backlog of knowledge base articles to edit, the chatbot's source material is already compromised before it generates a single response.[2] The system is confidently retrieving information from documentation that hasn't been updated, reviewed, or structured for machine retrieval.
How often do chatbots give technically wrong answers?
Often enough that the math should concern any CFO. If a chatbot handles 1,000 conversations a day with a 5% hallucination rate, that's 50 customers per day receiving fabricated information. Over a month, that's 1,500 instances of your brand confidently delivering wrong answers. Some of those answers will involve return policies. Some will involve billing. Some, as Air Canada demonstrated, will involve legal obligations your company didn't know it was making.
The Stanford study on legal AI tools found hallucination rates of 17% for Lexis+ AI and over 34% for Westlaw AI-Assisted Research — tools with retrieval-augmented generation specifically designed to minimize fabrication.[19] Customer service chatbots typically operate with less sophisticated retrieval pipelines and less curated knowledge bases than legal research tools. The baseline rate for uncontrolled environments is likely higher, not lower.
---
What happens if you deploy chatbots without fixing the underlying data?
The damage compounds through a feedback loop that most organizations don't detect until it's already entrenched.
What does the data say about long-term failure rates?
Gartner's revision from 30% to at least 50% abandonment tells a story about what happens after the proof of concept. Initial pilots work because they're scoped to the simplest use cases with the cleanest data. When organizations try to scale — expanding to new query types, new customer segments, new product lines — the knowledge gaps that were invisible at pilot scale become dominant at production scale.
Seventy-seven percent of consumers say a bad self-service experience is worse than not offering self-service at all.[29] This is the compounding risk: a chatbot that works 60% of the time doesn't just fail 40% of the time. It erodes trust in the entire channel, driving customers to phone queues that the chatbot was supposed to reduce. Zendesk's benchmark data shows 73% of consumers will switch to a competitor after multiple bad experiences.[30] When your cost-saving channel actively generates the bad experiences that trigger churn, the savings calculation inverts.
How does bad chatbot data create a negative feedback loop?
A chatbot gives a wrong answer. The customer calls a human agent to get the right one. The human agent resolves the issue but doesn't flag the chatbot's source material for correction because no feedback mechanism connects the two channels. The knowledge base remains unchanged. The next customer gets the same wrong answer. The chatbot's analytics show high containment because the conversation technically stayed in the bot. The quality metrics don't catch it because the customer who called the agent is logged as a separate interaction. The organization sees a cost-per-resolution it likes and a containment rate it can present to the board. The actual customer experience degrades invisibly.
This is how 95% of organizations end up with zero measurable return. Not through a single catastrophic failure, but through a slow accumulation of small inaccuracies that the measurement framework was never designed to detect.
---
Why do most chatbot deployments fail to deliver ROI?
The conventional explanation is that AI models aren't smart enough yet — that organizations need to wait for better models, or invest in the most advanced ones available. The evidence says the opposite.
A 2025 peer-reviewed study in JMIR Cancer tested chatbot hallucination rates across model generations and knowledge base configurations. The findings were unambiguous: GPT-4 without a curated knowledge base hallucinated approximately 40% of the time. GPT-4 with a curated knowledge base hallucinated 0% of the time. GPT-3.5 — the previous, cheaper, less capable model — with the same curated knowledge base achieved a 6% hallucination rate, matching GPT-4's performance when GPT-4 was given unvetted data.[11] The knowledge base quality mattered more than the model generation.
This isn't a single anomalous finding. It converges with Andrew Ng's data-centric AI research, where improving the dataset produced a roughly 16% accuracy improvement on a defect detection task while improving the model produced 0% improvement.[31] It converges with BCG's study of 1,000 senior executives, which found that successful AI implementations allocate 10% of effort to algorithms, 20% to technology and data, and 70% to people and processes.[12] It converges with McKinsey's finding that 70% of even the highest-performing AI organizations experienced significant difficulties with data infrastructure.[18]
The root cause of chatbot failure is not artificial intelligence. It's the absence of real intelligence in the knowledge base the AI depends on.
Companies are spending months evaluating model vendors, running benchmark comparisons between GPT-4 and Claude and Gemini, debating temperature settings and prompt engineering strategies — while their knowledge base contains outdated articles, contradictory policies, and undocumented procedures that no model, regardless of sophistication, can transform into accurate answers.
What does a successful chatbot deployment look like?
The successes prove the same thesis the failures do. The difference is never the model. It's always the infrastructure.
Octopus Energy deployed the same underlying GPT technology as the companies that failed — but made five infrastructure decisions that changed the outcome. They implemented human-in-the-loop approval for AI-generated responses. They built a verification system that flagged ungrounded claims. They integrated deeply with backend customer lifecycle data so the system had real context, not just documentation. They used AI to augment agents rather than replace them. And they invested in careful onboarding and training for the humans working alongside the system. The result: AI-assisted emails scored higher customer satisfaction than human-only responses, the system processed the equivalent output of 250 full-time employees, and the underlying Kraken platform became valuable enough to anchor a separate business at an $8.65 billion valuation.[33][32]
Klarna's initial deployment followed similar principles. They whitelisted specific topics the chatbot was authorized to address. They integrated it with transactional backend systems so it had access to actual order data, not just FAQ pages. They designed escalation paths based on topic boundaries, routing conversations to human agents when the query left the chatbot's authorized domain.[34] The result: 2.3 million conversations in the first month, resolution times dropping from 11 minutes to under 2 minutes, and a projected $40 million profit improvement.[9]
Then Klarna pushed further. They optimized for cost reduction, expanded the chatbot's scope, and reduced human oversight. In 2025, they reversed course and reinvested in human agents.[10] The CEO's own framing is revealing: he had emphasized that the chatbot's quality depended on Klarna's "really crazy documentation depth."[35] When the system scaled beyond what the documentation could reliably support, quality dropped. The methodology was always the variable. The model stayed the same.
A Stanford, MIT, and NBER study of 5,179 customer support agents confirmed the pattern from the other direction: AI-assisted agents (not AI-replaced agents) saw a 14% average productivity increase, with novice agents gaining 34%.[15] The largest gains came from augmentation — using AI to surface relevant information and draft responses for human review — not from full automation.
---
How much does it actually cost to deploy a chatbot that works?
The per-interaction pricing is transparent: Intercom's Fin charges $0.99 per resolution, Zendesk's AI Agents charge $1.50 per resolution on committed plans ($2.00 pay-as-you-go), and Freshworks' Freddy AI starts at $0.10 per session.[38][37][36] These numbers are real, but they represent a fraction of the total cost.
The hidden costs are where organizations miscalculate. API integration with existing CRM, ticketing, and order management systems runs $5,000–$25,000. Custom development for domain-specific workflows runs $10,000–$100,000 or more. Knowledge base creation — structuring, writing, and curating the source material the chatbot retrieves from — costs $5,000–$50,000 depending on documentation complexity. And ongoing maintenance runs 15–25% of initial build costs annually.[1]
The pricing model itself is shifting beneath organizations' feet. Seat-based SaaS pricing dropped from 21% to 15% of software companies in 12 months, while hybrid pricing — combining platform fees with usage-based charges — surged from 27% to 41%.[39] This means chatbot costs are becoming less predictable, not more. Query volume spikes during product launches, outages, or seasonal peaks can multiply token costs dramatically.
The most expensive failure mode isn't overpaying for the platform. It's underpaying for the knowledge base. A Forrester Total Economic Impact study of Microsoft Dynamics 365 Customer Service found $14.70 million in benefits against $3.54 million in investment — a 315% ROI with payback in under six months.[40] But that study modeled a composite organization with mature data infrastructure. The 95% of organizations seeing zero return aren't lacking a good vendor. They're lacking the documentation, integration, and governance that the 5% invested in first.
The churn math closes the case. If 30% of customers who have a negative chatbot experience take their purchase elsewhere,[20] and 73% will switch after multiple bad experiences,[30] even a 5–10% increase in customer churn from chatbot friction can exceed all automation savings in a single quarter.
---
How do you fix chatbot customer service costs?
The evidence points to one conclusion: chatbot ROI is a knowledge infrastructure problem masquerading as an AI procurement problem. The model selection, prompt engineering, and vendor evaluation that consume most of the deployment timeline account for roughly 10% of the outcome variance.[12] The other 90% is determined by whether the chatbot has access to accurate, complete, well-structured information — and whether the system is designed to recognize the boundaries of what it knows.
This is the problem Tricky Wombat's pipeline is built to solve. Not by offering a better model, but by engineering the information layer that every model depends on.
1. Knowledge base audit and structuring
Most chatbot deployments begin with model selection. This is backwards. Sixty-one percent of customer service leaders already have a backlog of knowledge base articles to edit.[2] Deploying a chatbot on top of that backlog means the system retrieves outdated policies, contradictory instructions, and documentation gaps — then generates confident answers from them. Tricky Wombat's pipeline begins with a structured audit of existing documentation: identifying gaps, resolving contradictions, mapping content to actual customer query patterns, and formatting material for retrieval-augmented generation. The knowledge base is the product. The model is the delivery mechanism.
2. Retrieval architecture with source-grounded responses
RAG reduces hallucination — but only when retrieval quality is high. The foundational NeurIPS 2020 paper demonstrated that retrieval-augmented generation set state-of-the-art on open-domain question answering by grounding model outputs in retrieved documents.[41] The practical challenge is ensuring retrieval pulls the right documents. Tricky Wombat's pipeline implements source-grounded generation: every response is tied to a specific, citable knowledge base entry. When the system cannot retrieve a relevant source with sufficient confidence, it escalates rather than generates. This is the architectural difference between a chatbot that invents a bereavement fare policy and one that says "let me connect you with an agent who can help."
3. Continuous monitoring and knowledge base maintenance
A knowledge base that's accurate at launch degrades immediately. Product lines change, policies update, edge cases emerge from real customer interactions. The kapa.ai analysis of 100+ technical teams found that knowledge base quality is the primary determinant of RAG system performance — not the model, not the prompt template, not the embedding strategy.[42] Tricky Wombat's pipeline includes automated monitoring for retrieval failures, flagging conversations where the system escalated or where customer satisfaction dropped, and routing those signals back to knowledge base curation. The feedback loop between customer interactions and source material is what separates systems that improve over time from systems that quietly degrade.
---
The bottom line
Air Canada's chatbot fabricated a bereavement policy. DPD's chatbot wrote poetry about how terrible DPD was. Chevrolet's chatbot sold a Tahoe for a dollar. NEDA's chatbot told people with eating disorders to count calories. Klarna's chatbot projected $40 million in profit improvement and then got partially replaced by the human agents it was supposed to make redundant. In every case, the AI model worked exactly as designed — it generated fluent, contextually responsive language. In every case, the information infrastructure failed — incomplete documentation, missing escalation paths, absent verification, no feedback loop between the chatbot's outputs and the organization's actual knowledge.
The organizations that will capture real cost savings from chatbots are the ones that treat their knowledge base as a product with its own roadmap, its own quality standards, and its own ongoing investment. BCG's 10-20-70 framework isn't a suggestion. It's a description of what the 5% who see returns actually do differently from the 95% who don't.
The race to deploy AI in customer service has a finish line that most companies haven't identified yet: it's not the day the chatbot goes live. It's the day your documentation is good enough that the chatbot can't get the answer wrong.
---
▶References (42)
- ↩
- ↩Gartner, "Gartner Survey Reveals 85 Percent of Customer Service Leaders Will Explore or Pilot Customer-Facing Conversational GenAI in 2025," December 9, 2024. https://www.gartner.com/en/newsroom/press-releases/2024-12-09-gartner-survey-reveals-85-percent-of-customer-service-leaders-will-explore-or-pilot-customer-facing-conversational-genai-in-2025
- ↩Gartner, "Gartner Survey Finds 64% of Customers Would Prefer That Companies Didn't Use AI for Customer Service," July 9, 2024. Survey of 5,728 customers. https://www.gartner.com/en/newsroom/press-releases/2024-07-09-gartner-survey-finds-64-percent-of-customers-would-prefer-that-companies-didnt-use-ai-for-customer-service
- ↩Verint, "The State of Digital Customer Experience 2024," reported by CX Dive, August 20, 2024. https://www.customerexperiencedive.com/news/consumer-frustration-self-service-live-agent-ivr-chatbot/724620/
- ↩Tidio (Jelisaveta Sapardic / Bart Turczynski), "80+ Chatbot Statistics You Should Know in 2026," updated February 26, 2026. https://www.tidio.com/blog/chatbot-statistics/
- ↩Gartner, "Gartner Survey Finds Only 14% of Customer Service Issues Are Fully Resolved in Self-Service," August 19, 2024. https://www.gartner.com/en/newsroom/press-releases/2024-08-19-gartner-survey-finds-only-14-percent-of-customer-service-issues-are-fully-resolved-in-self-service
- ↩Gartner, "Gartner Predicts 30 Percent of Generative AI Projects Will Be Abandoned After Proof of Concept by End of 2025," July 29, 2024. https://www.gartner.com/en/newsroom/press-releases/2024-07-29-gartner-predicts-30-percent-of-generative-ai-projects-will-be-abandoned-after-proof-of-concept-by-end-of-2025
- ↩Gartner, "Why Half of GenAI Projects Fail: Avoid These 5 Common Mistakes," January 2026 (reporting on end-of-2025 data). https://www.gartner.com/en/articles/genai-project-failure
- ↩Klarna, "Klarna AI Assistant Handles Two-Thirds of Customer Service Chats in Its First Month," press release, February 27, 2024. https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/
- ↩CX Dive, "Klarna reinvests in human talent for customer service," May 9, 2025. https://www.customerexperiencedive.com/news/klarna-reinvests-human-talent-customer-service-AI-chatbot/747586/
- ↩Nishisako S, Higashi T, Wakao F, "Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information: Development and Evaluation Study," *JMIR Cancer*, vol. 11, e70176, September 11, 2025. https://cancer.jmir.org/2025/1/e70176
- ↩BCG, "Where's the Value in AI?" (Build for the Future 2024 Global Study), October 24, 2024. Survey of 1,000 CxOs and senior executives. https://media-publications.bcg.com/BCG-Wheres-the-Value-in-AI.pdf
- ↩Gartner (Michael Rendelman), "Gartner Survey Reveals Only 8% of Customers Used a Chatbot During Their Most Recent Customer Service Interaction," June 15, 2023. Survey of 497 B2B and B2C customers. https://www.gartner.com/en/newsroom/press-releases/2023-06-15-gartner-survey-reveals-only-8-percent-of-customers-used-a-chatbot-during-their-most-recent-customer-service-interaction
- ↩CBC News, "Air Canada chatbot gave a B.C. man the wrong information. Now the airline has to pay for it," February 2024. https://www.cbc.ca/news/canada/british-columbia/air-canada-chatbot-lawsuit-1.7116416
- ↩Brynjolfsson E, Li D, Raymond LR, "Generative AI at Work," NBER Working Paper No. 31161, April 2023. Study of 5,179 agents. https://www.nber.org/papers/w31161
- ↩*Moffatt v. Air Canada*, 2024 BCCRT 149. Civil Resolution Tribunal of British Columbia, February 14, 2024. https://www.canlii.org/en/commentary/doc/2025CanLIIDocs1963
- ↩Zendesk, "Moving to automated resolutions from existing bot pricing plans," November 2024. https://support.zendesk.com/hc/en-us/articles/6931689272090-Moving-to-automated-resolutions-from-existing-bot-pricing-plans
- ↩McKinsey, "The State of AI in Early 2024," May 30, 2024. Survey of 1,363 respondents. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-2024
- ↩Magesh V, Surani F, Dahl M, Suzgun M, Manning C, Ho DE, "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools," *Journal of Empirical Legal Studies*, 2025. https://law.stanford.edu/publications/hallucination-free-assessing-the-reliability-of-leading-ai-legal-research-tools/
- ↩Forrester Consulting/Cyara, "Customers Aren't Mad, They're Just Disappointed," February 2023. Survey of 1,554 consumers. https://www.businesswire.com/news/home/20230201005218/en/New-Survey-Finds-Chatbots-Are-Still-Falling-Short-of-Consumer-Expectations
- ↩Salesforce, "State of the AI Connected Customer," 7th Edition, October 31, 2024. https://www.salesforce.com/news/stories/ai-customer-research/
- ↩Salesforce, "State of Service" (6th Edition), April 2024. Survey of 5,500+ professionals. https://www.salesforce.com/service/state-of-service-report/
- ↩TIME, "AI Chatbot DPD Curses and Criticizes Company," January 2024. https://time.com/6564726/ai-chatbot-dpd-curses-criticizes-company/
- ↩GM Authority, "GM Dealer Chat Bot Agrees To Sell 2024 Chevy Tahoe For $1," December 2023. https://gmauthority.com/blog/2023/12/gm-dealer-chat-bot-agrees-to-sell-2024-chevy-tahoe-for-1/
- ↩NPR, "An eating disorders chatbot offered dieting advice, raising fears about AI in health," June 8, 2023. https://www.npr.org/sections/health-shots/2023/06/08/1180838096/an-eating-disorders-chatbot-offered-dieting-advice-raising-fears-about-ai-in-hea
- ↩RAND Corporation, Ryseff J et al., "The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed," RR-A2680-1, August 2024. https://www.rand.org/pubs/research_reports/RRA2680-1.html
- ↩MIT Project NANDA / Challapally et al., "The GenAI Divide: State of AI in Business 2025," July 2025. Coverage: https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
- ↩Zhang RW, Liang X, Wu S, "When chatbots fail: exploring user coping following a chatbots-induced service failure," *Information Technology & People*, Vol. 37 No. 8, pp. 175–195, 2024. https://doi.org/10.1108/ITP-08-2023-0745
- ↩Higher Logic, "An Exploratory Research Study: Customer Experience and Customer Self-Support," approximately 2020. Survey of 285 consumers. https://www.higherlogic.com/blog/customer-self-service-stats-2020/
- ↩Zendesk, CX Trends benchmark data (approximately 2023). https://www.zendesk.com/blog/customer-service-statistics/
- ↩Andrew Ng, "MLOps: From Model-centric to Data-centric AI," 2021 presentation. Steel defect detection example (~16.9% accuracy improvement from data-centric approach). Context: Strickland E, "Andrew Ng: Unbiggen AI," *IEEE Spectrum*, February 2022. https://spectrum.ieee.org/andrew-ng-data-centric-ai
- ↩techUK, "Case Study: Kraken Tech's Generative AI Tool for Customer Service," 2024. https://www.techuk.org/resource/case-study-kraken-tech-s-generative-ai-tool-for-customer-service.html
- ↩Yorkshire Post, "AI now doing work of 250 people three months after launch, Octopus Energy boss reveals," May 2023. https://www.yorkshirepost.co.uk/business/ai-now-doing-work-of-250-people-three-months-after-launch-octopus-energy-boss-reveals-4133923
- ↩Gergely Orosz, "Klarna's AI chatbot: how revolutionary is it, really?" *The Pragmatic Engineer*, February 29, 2024. https://blog.pragmaticengineer.com/klarnas-ai-chatbot/
- ↩Shreshth Sharma, "Decoding Klarna's $40 Million AI Secret: 3 Steps to Enterprise AI Success," *INFORMS Analytics Magazine*, 2024. https://pubsonline.informs.org/do/10.1287/LYTX.2024.04.10/full/
- ↩Intercom, "Fin AI Agent Resolutions," Intercom Help Center. $0.99 per resolution. https://www.intercom.com/help/en/articles/8205718-fin-ai-agent-resolutions
- ↩Zendesk, "Zendesk First in CX Industry to Offer Outcome-Based Pricing for AI Agents." $1.50/resolution committed; $2.00 pay-as-you-go. https://www.zendesk.com/newsroom/articles/zendesk-outcome-based-pricing/
- ↩Freshworks, Freshdesk pricing. $100 per 1,000 sessions ($0.10/session). https://www.freshworks.com/freshdesk/pricing/
- ↩Kyle Poyar, "The state of B2B monetization in 2025," *Growth Unhinged*, 2025. Data from 240+ software/AI companies. https://www.growthunhinged.com/p/2025-state-of-b2b-monetization
- ↩Forrester Consulting, "The Total Economic Impact™ of Microsoft Dynamics 365 Customer Service," March 2024. Vendor-commissioned study; composite organization model. https://www.microsoft.com/en-us/dynamics-365/blog/business-leader/2024/03/27/forrester-tei-study-shows-315-roi-when-modernizing-customer-service-with-microsoft-dynamics-365-customer-service/
- ↩Lewis P et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," NeurIPS 2020. https://arxiv.org/abs/2005.11401
- ↩Emil Sorensen, "RAG Best Practices: Lessons from 100+ Technical Teams," kapa.ai blog, November 11, 2024. https://www.kapa.ai/blog/rag-best-practices
By Tricky Wombat
Last Updated: Mar 29, 2026