Why do AI chatbots fail even when the underlying model works?

Model comprehension is rarely the bottleneck. Vodafone's TOBi understood 90% of queries but resolved only 15% on first contact.[^6] RAND Corporation found the top cause of AI failure is misunderstanding the problem the AI needs to solve, a design failure, not a model failure.[^4] The conversation layer, not the language model, is where most implementations break down.

How does interaction design affect user trust in AI?

A 2025 study found that the same LLM delivering identical health information through different interfaces (text, speech, embodied) produced significantly different trust levels. Trust in the interface correlated with trust in the information itself.[^5] Separately, Deloitte found that clear AI policies produce 52% high trust versus 5% when policies are unclear.[^20]

What is the ROI of investing in AI UX design?

Forrester TEI studies document ROI between 181% and 391% for well-designed conversational AI platforms.[^8][^9][^10][^11][^12] IBM's Enterprise Design Thinking practice produced 301% ROI with 2× faster time to market and 50%+ fewer defects.[^9] Post-launch repairs cost 3-4× more than front-loading design investment.

What percentage of AI projects fail?

RAND Corporation reports that 80% of AI projects fail, more than double the rate for non-AI IT projects.[^4] S&P Global found 42% of organizations abandoned most AI initiatives in 2025, up from 17% in 2024.[^3] Gartner predicts 30% of generative AI projects will be abandoned after proof of concept.[^2]

How do you measure AI conversation quality?

Microsoft's IDEAS team evaluates chatbot performance across four categories: task completion, intelligence, relevance, and hallucination, using data from 350+ product surfaces.[^40] Research from ICAAI 2024 suggests that conversational quality and recovery quality drive satisfaction more than raw functional capability.[^16] Measure first-contact resolution, escalation frequency, and user-reported trust alongside model accuracy.

What is the EU AI Act's impact on AI design?

The EU AI Act's Article 50 transparency obligations take effect August 2, 2026, requiring AI systems to disclose their nature to users. Violations carry fines up to €15 million or 3% of worldwide annual turnover.[^39] Interaction design decisions about transparency, disclosure, and user control become compliance requirements.

Why do users prefer humans over AI chatbots?

A 2024 Gartner survey of 5,728 customers found 64% would prefer companies not use AI for customer service, and 53% would consider switching to competitors that do not.[^24] Research shows 53% of users find humans more thorough and 52% find human interactions less frustrating than chatbot interactions.[^28] The gap is in conversation quality and error handling, not capability.

What frameworks exist for designing human-AI interactions?

Google's People + AI Guidebook provides design patterns for AI products.[^36] Microsoft's 18 HAI guidelines cover the full interaction lifecycle from onboarding through error recovery.[^37] The Rasa CALM architecture separates language understanding from business logic, reducing build time by 80%.[^38] The University of Maryland's HCAI framework provides a governance model for maintaining both high human control and high automation simultaneously.[^41]

How does AI interaction design affect business financial performance?

Organizations with top-quartile design practices grew revenues at nearly twice the rate of industry peers over five years across 300 publicly listed companies.[^13] Deloitte's 2026 survey of 9,000+ respondents found organizations that design human-AI interactions are nearly 2.5× more likely to report better financial results.[^7] The interaction design layer is a financial variable, not an aesthetic preference.

How people use AI matters more than which AI they use

Evidence that design and implementation decisions determine whether AI investments return 2% or 300%

Additional documents available for download

The Interaction Advantage

When MeasuringU tested ChatGPT, Claude, and Gemini with 153 users in 2025, the products earned different usability scores. Gemini scored an A on the System Usability Scale. Claude scored B+. ChatGPT scored B. Across all three, ease-of-use ratings exceeded usefulness ratings at a statistically significant level. Users did not differentiate these products on capability. They differentiated on design.

Meanwhile, 80% of AI projects fail at more than double the rate of conventional IT projects.^[1] The thesis is straightforward: the design of human-AI interaction, is a vital component as to whether AI systems succeed or fail in adoption, usage, and business impact.

Key Points

The same LLM delivering identical health information through text, speech, and embodied interfaces produced significantly different trust levels in a 2025 within-subjects study. Trust in the interface correlated with trust in the information itself.^[1]

Lessons Learned

Measure conversation quality and error recovery design, not just task completion rates. Research shows these drive satisfaction more than raw functional capability.^[1]

Why do identical AI models produce different outcomes?

Human-centered AI design is the practice of shaping how people encounter, interpret, trust, and collaborate with AI systems. It spans conversation structure, error recovery, interface metaphor, transparency, and escalation paths. It is the layer between the model's capability and the user's experience. And it is where most AI investments break down.

The conversational AI market hit $14.79 billion in 2025 and is projected to reach $82.46 billion by 2034 at a 21% compound annual growth rate.^[1] That growth masks a deeper problem. Global trust in AI dropped from 61% to 53% between 2019 and 2024. In the United States, the decline was steeper: from 50% to 35%.^[1] Investment is accelerating. Trust is eroding. The gap between what AI can do and what users will accept is widening.

The pattern is consistent across industries, geographies, and organization sizes. The model is rarely the bottleneck. The interaction is.

Comparison of SUS and UX-Lite scores for ChatGPT, Claude, and Gemini showing Gemini leading both metrics and ease-of-use outscoring usefulness across all three products. — Three frontier LLMs, three different usability scores. Users differentiate on design, not capability.

What do the measurements show?

A 2022 randomized experiment published in the International Journal of Human-Computer Studies tested how interaction design choices change user perception of the same chatbot technology. Researchers at SINTEF and OsloMet found that a topic-led conversation design strengthened users' sense of anthropomorphism and hedonic quality, while offering buttons (instead of free text) strengthened both pragmatic and hedonic user experience.^[1] The design frame, not the technology underneath, shifted how users perceived the interaction.

This pattern scales. The MeasuringU benchmark found that across all three frontier chat products, ease-of-use scores exceeded usefulness scores at a statistically significant level. The implication is blunt: users are not evaluating these tools primarily on what the AI can do. They are evaluating how easily they can get the AI to do it.

Research presented at ICAAI 2024 examined what drives chatbot satisfaction using a framework adapted from service quality theory. The findings suggest that conversational quality and recovery quality, both design-layer decisions, are the primary drivers of user satisfaction. Core service quality (the AI's raw functional capability) operates as a hygiene factor. It prevents dissatisfaction but does not generate satisfaction on its own.^[1] Getting the model right is table stakes. Getting the conversation right is what moves the needle.

What are practitioners reporting?

The practitioner data converges on the same point from the opposite direction: when design is poor, users reject even capable systems.

A 2023 Gartner survey of 497 customers found that only 8% had used a chatbot during their most recent customer service interaction. Of those who did, only 25% said they would use one again. The chatbot resolved returns 58% of the time but billing issues only 17% of the time.^[1] A year later, Gartner surveyed 5,728 customers and found 64% would prefer companies not use AI for customer service. 53% said they would consider switching to a competitor if they learned a company was going to use AI for service.^[1]

At the same time, 85% of customer service leaders plan to explore or pilot customer-facing conversational generative AI in 2025.^[1] The gap between leadership enthusiasm and customer resistance is a design gap. The technology works. The experience does not.

A 2024 Deloitte Connected Consumer Survey of 3,857 respondents quantified the mechanism. When companies provide clear AI policies, 52% of users report high or very high trust and only 10% report low trust. When policies are unclear, the numbers invert: 5% high trust, 52% low trust.^[1] Only 26% of consumers feel their providers supply clear policies. Transparency is a design decision. Most organizations are not making it.

What does this look like inside real organizations?

The aggregate data tells you the pattern. The case studies tell you the mechanism. Four organizations with different industries, geographies, and scales reveal how the interaction design layer determines outcomes regardless of the model powering the system.

BMO Assist: when the AI understands but the conversation fails

BMO, Canada's fourth-largest bank, launched BMO Assist on Amazon Lex. In its first year, the chatbot served 760,000 users across 1.08 million sessions.^[1] The model performed as expected. The conversation design did not. More than half of user queries were misclassified, not because the natural language understanding failed, but because the conversation lacked adequate fallback logic. A content strategist designed a "fall-forward" pattern: instead of dead-ending on misclassified intents, the system acknowledged the gap and redirected users toward the nearest productive path.^[1] The fix was not a model upgrade. It was a conversation architecture decision that treated failure as a design problem rather than a technology limitation.

Vodafone SuperTOBi: €140 million spent on the layer above the model

Vodafone's original TOBi chatbot, built on IBM Watson and Microsoft LUIS, understood 90% of incoming customer queries.^[1] By any model-capability metric, it was working. By customer experience metrics, it was not. First-contact resolution sat at 15% in Portugal. Vodafone invested €140 million in a customer experience transformation that included rebuilding TOBi as SuperTOBi on Azure OpenAI, but the investment was not primarily about the model swap. The redesign centered on full-sentence conversational responses (replacing menu trees), seamless human handoff protocols, and comprehensive response generation that addressed the full scope of a customer's problem.^[1] In Portugal, first-contact resolution jumped from 15% to 60%. NPS improved by 14 points, reaching 64. The system now operates across 13 countries. The original model understood the questions. The redesigned interaction actually answered them.

Booking.com: speed to market, then the hard work of scope

Booking.com built its AI Trip Planner in 10 weeks using OpenAI's technology, shifting the core search experience from keyword-based queries to conversational planning.^[1] The speed was impressive. The harder design work followed. A 2024 survey found 41% of travelers expressed interest in AI-generated itineraries.^[1] But the company also deployed separate AI tools for its partner ecosystem: Smart Messenger and Auto-Reply features for property owners, which produced a 73% increase in partner satisfaction.^[1] The analytical lesson is about scope discipline. The Trip Planner, the Property Q&A system, and the partner messaging tools serve different users with different needs. Each required distinct conversation design, not a single chatbot bolted onto everything. Booking.com treated each use case as its own design problem.

Visualization showing AI project abandonment rates rising from 17% to 42% between 2024 and 2025, with projections of 40%+ for agentic AI by 2027. — AI project abandonment is accelerating, not stabilizing, as investment increases.

What patterns emerge across these cases?

Three patterns recur. First, model comprehension is not the bottleneck. BMO Assist's NLU worked. Vodafone's TOBi understood 90% of queries. The failures occurred in the conversation layer above the model: fallback logic, response completeness, escalation design. Second, design investment scales. Vodafone's €140 million customer experience transformation turned a system that understood customers into one that actually served them. Booking.com's 10-week prototype worked, but the enduring value came from scoping distinct design solutions for distinct user types. Third, the metrics that matter are interaction metrics, not model metrics. Resolution rate, NPS, find-rate, partner satisfaction. None of these measure model capability. All of them measure design quality.

An InformationWeek study found that 48% of chat technology does not accurately solve issues, 38% is time-consuming and fails to self-learn, and only 6% of IT leaders believe chatbots are effective for self-service.^[1] The technology is not broken. The implementations are.

What happens at the organizational level when design is treated as optional?

The gap between organizations that invest in human-AI interaction design and those that do not is measurable and growing. The evidence shows this is not a matter of preference or maturity. It is a structural variable that predicts financial performance.

What do users and practitioners consistently report?

The signal from users is unambiguous. A 2024 survey found that 64% of customers experiencing chatbot interactions with complex issues report the chatbot could not handle them. 53% found humans more thorough, and 52% found human interactions less frustrating.^[1] The Gartner data showing 64% of customers prefer no AI in customer service and 53% would consider switching to a competitor lands harder when you realize the technology itself often works fine. Users are rejecting the experience, not the capability.

The practitioner view is more nuanced. A 2024 factorial experiment with 257 participants tested how failure timing affects chatbot trust. Encountering a failure early in the interaction damages trust more than encountering it late. But trust recovers after successful task completion, a "bounce-back" effect the researchers documented.^[1] The design implication: error recovery is not a nice-to-have. It is the mechanism that determines whether a failed interaction destroys or preserves the relationship. And most conversational AI implementations treat error states as edge cases rather than primary design surfaces.

Diagram showing conversational quality and recovery quality as the primary satisfaction drivers in chatbot interactions, with core service quality functioning as a hygiene factor that prevents dissatisfaction but does not generate satisfaction. — Functional capability is table stakes. Conversational quality and error recovery are what drive satisfaction.

What drives the gap between strong and weak outcomes?

McKinsey's 2025 State of AI survey found that two-thirds of companies have not begun scaling AI.^[1] Among the 25 attributes tested for their contribution to meaningful business impact, workflow redesign had the biggest effect.^[1] The organizations achieving top-quartile AI outcomes are not running better models. They are redesigning the workflows, interfaces, and human processes around those models.

The McKinsey design-value study, conducted over five years across 300 publicly listed companies, found top-quartile design organizations grew revenues at nearly twice the rate of industry peers, with 32 percentage points higher revenue growth and 56 percentage points higher total returns to shareholders.^[1] Connect those findings to AI, and the math becomes stark. If you invest in model capability without investing in design, you are optimizing the wrong variable.

The Deloitte 2026 Global Human Capital Trends survey of over 9,000 respondents across 89 countries found that 66% of C-suite leaders recognize the importance of designing human-AI interactions.^[1] Only 6% say they are leading on this practice. Organizations that do act on this are nearly 2.5× more likely to report better financial results.^[1] The knowing is not the problem. The doing is.

What is actually causing AI systems to fail?

The default diagnosis for AI adoption failure is a capability problem. The model is not good enough. The training data is insufficient. The feature set is incomplete. The evidence points in a different direction.

RAND Corporation's 2024 analysis, based on 65 interviews with data scientists and engineers, found the number-one cause of AI project failure is misunderstanding or miscommunicating the problem the AI needs to solve. The number-three cause is bias toward the latest technology rather than solving real problems for intended users.^[1] Both are design and human factors failures. The 80% failure rate for AI projects, more than double the rate for conventional IT, is not a technology story. It is a design story.

McKinsey's March 2026 research made the diagnosis explicit: "This is not an AI capability problem. It's an experience problem: We're stuck using search bars and chat boxes bolted onto interaction paradigms designed for a pre-AI era."^[1] They identified four specific breakdowns preventing AI from scaling, and all four are interaction design problems: intent ambiguity (the system cannot clarify what the user actually needs), context gaps (the system loses track of the conversation), generic outputs (responses are technically correct but not useful for the specific situation), and non-collaborative iteration (users cannot refine results through dialogue).^[1] The consequence: "Users oscillate between accepting uncritically or abandoning."^[1]

The binding constraint on AI system success is the interaction design layer: the decisions about how humans encounter, interpret, trust, and collaborate with the AI's output.

NNGroup's research across 425 AI conversations identified six distinct conversation types, each requiring different interface patterns.^[1] Their analysis found that companies are repeating a pattern from the Flash era: bolting a single chatbot interface onto every use case without evaluating whether the interface matches the user's actual task.^[1] As one NNGroup article stated directly: "Adding AI does not magically create value."^[1] A separate analysis published on a personal channel put it more bluntly: UX professionals are "trapped in complacency" about the scale of change AI requires.^[1]

The same LLM, delivering identical health information through a text interface, a voice interface, and an embodied interface, produced significantly different trust levels in a 2025 within-subjects study of 20 participants. Users trusted the text-based interface most. The correlation between trust in the interface and trust in the information was statistically significant.^[1] The information did not change. The frame around it did. And the frame determined whether users believed what the AI told them.

Ranked list of five root causes of AI project failure from RAND Corporation research, with the top cause being misunderstanding the problem and the third cause being technology bias over user needs, both highlighted as design failures. — RAND's analysis of 65 AI practitioners found that the top failure causes are design problems, not technology problems.

What does a successful implementation look like?

The organizations producing the strongest AI outcomes share one trait: they invested in the design layer as the primary variable, treating the model as infrastructure rather than product.

Woebot Health built a therapeutic chatbot that achieved a bond subscale score of 3.8 out of 5.0 in a study of 36,070 users, comparable to group cognitive behavioral therapy (3.8) and approaching face-to-face therapy (4.0).^[1] A 2017 randomized controlled trial with 70 participants showed significant reduction in depression symptoms (PHQ-9, p=.01) with 83% retention at two weeks.^[1] The most revealing finding came later: a 2025 RCT with 160 participants compared an LLM-augmented version against the original rules-based version and found no significant between-group differences in clinical outcomes.^[1] The study was not powered to detect between-group differences, an important caveat. But the directional finding is striking. The conversation design architecture, refined over years of clinical iteration, produced equivalent therapeutic outcomes regardless of whether a rules engine or a large language model powered the responses. The model did not change the experience. The design was the experience.

DBS Bank in Singapore generated S$370 million in AI economic value in 2023, up from S$180 million the prior year.^[1] Their AI chatbot, DBS Joy, handles over 120,000 chats and produced a 23% improvement in customer satisfaction. The bank runs 45 million personalized customer interactions per month across its ecosystem.^[1] This did not happen by deploying a better model. DBS built a 700-person Data Chapter, deployed over 800 AI models across 350 use cases, and embedded its "Managing through Journeys" design framework into every customer touchpoint.^[1] A business school published a case study (Case 625-053) on the approach. The memorable detail: the bank treats AI design as a discipline with dedicated career paths, not a project with a ship date.

Nubank serves 114 million customers across Latin America and achieves an NPS of 90, according to industry analyses, compared to 40-60 for traditional banking incumbents.^[1] Their AI assistant, built on GPT-4o (the same model available to every competitor), resolves up to 50% of Tier-1 issues autonomously and reduced response times by 70%.^[1] The design differentiator is a five-interaction escalation ceiling: if the AI cannot resolve the issue within five exchanges, it hands off to a human. This is a design decision, not a model capability. Nubank's evaluation framework measures empathy alongside accuracy, treating conversational quality as a first-class metric.^[1] Their customer acquisition cost of approximately $15, according to industry analyses, reflects what happens when AI design converts capability into experience at scale.

Side-by-side comparison of four organizations showing that design investment, not model selection, drove measurable improvements in resolution rates, NPS, customer satisfaction, and AI economic value. — Across four organizations, the measurable outcomes tracked with design investment, not model sophistication.

What are the financial consequences of getting AI design right or wrong?

The economics run in both directions. Getting interaction design wrong is expensive in ways that do not appear on a model-selection budget line. Getting it right produces returns that dwarf the cost of the design investment.

S&P Global's 2025 survey of 1,006 respondents found 42% of organizations abandoned most of their AI initiatives, up from 17% the prior year.^[1] Gartner projects over 40% of agentic AI projects will be canceled by the end of 2027.^[1] Each abandoned project carries sunk costs in compute, integration, data preparation, and organizational attention. When 80% of AI projects fail at double the rate of conventional IT projects, the cumulative waste across an organization is not a rounding error.^[1] It is a strategic drag.

The return side is equally measurable. Forrester's Total Economic Impact study of IBM's Enterprise Design Thinking practice documented 301% ROI and $36.3 million in net present value over three years. Defects fell by more than 50%. Projects moved 2× faster to market.^[1] A separate Forrester TEI study of PolyAI's conversational AI platform found 391% ROI, with $14.2 million in benefits against $2.9 million in costs. Call abandonment dropped 50%.^[1] Boost.ai documented 293% ROI. LivePerson documented 191%. Rasa documented 181%.^[1]^[1]^[1] All used Forrester's standardized Total Economic Impact methodology. All are vendor-commissioned studies, which means the methodology is consistent but the sponsoring interest should be noted.

The common thread across every positive ROI case: the organizations did not just deploy a model. They designed the interaction around it. The post-launch repair costs run 3-4× higher than getting the design right in the first place.^[1] The economics argue for front-loading design investment, not for treating it as polish after the model ships.

Bar chart comparing documented ROI percentages across five conversational AI and design investments, ranging from 181% to 391%, all using Forrester Total Economic Impact methodology. — Documented ROI across five conversational AI design investments, all using Forrester's standardized methodology.

How do you fix AI interaction design?

The evidence converges on one claim: the design of the interaction layer, not the selection of the model, is the primary variable that determines AI system outcomes. Frameworks exist for organizations ready to act. Google's People + AI Guidebook provides design patterns used across 200+ countries.^[1] Microsoft Research's Human-AI Interaction guidelines, validated with 49 design practitioners across 20 AI products, distill 20+ years of research into 18 evidence-based design principles.^[1] The Rasa CALM architecture separates the language understanding layer from the business logic layer, reducing journey build time by 80% and enabling designers to control conversation flows independently of model behavior.^[1] The EU AI Act's Article 50 transparency obligations, effective August 2, 2026, with fines up to €15 million or 3% of worldwide turnover, will make interaction design a compliance requirement, not just a user experience preference.^[1]

At Tricky Wombat, we build the interaction layer for AI-powered knowledge systems. The problem is not the model. The problem is the pipeline between the model and the user. Three things must be right.

1. Context architecture before conversation design

Most systems dump a user's question into an LLM with minimal context and hope the model figures out intent. This produces McKinsey's first two breakdowns: intent ambiguity and context gaps. The correct approach structures the context pipeline so the AI receives the user's query alongside relevant prior interactions, domain-specific knowledge, and organizational rules before generating a response. The context architecture determines the quality ceiling. The model cannot exceed what the pipeline gives it.

2. Conversation design as a first-class discipline

Most implementations treat the conversation interface as a wrapper around API calls. Fallback states get an error message. Escalation happens when the system crashes. Recovery is not designed at all. The correct approach treats every failure state as a design surface, with explicit fallback logic (like BMO's "fall-forward" pattern), escalation ceilings (like Nubank's five-interaction maximum), and response completeness checks (like Vodafone's shift from menu trees to full-sentence answers). The conversation design is the product. Everything else is infrastructure.

3. Continuous measurement against interaction metrics

Most organizations measure model accuracy, latency, and cost. They do not measure conversation quality, recovery effectiveness, or trust. The correct approach builds a measurement layer that tracks the metrics the research shows actually drive outcomes: first-contact resolution, conversation completion rate, escalation frequency, and user-reported trust. Microsoft's IDEAS team evaluates LLM-based chatbots across four categories (task completion, intelligence, relevance, and hallucination) using 240 petabytes of interaction data across 350+ product surfaces.^[1] The measurement system must match the precision of the model evaluation. When it does, the system improves over time because you are optimizing for the right variable.

Architectural diagram showing three layers of a conversational AI system with the interaction design layer at top touching the user, context pipeline in the middle, and foundation model at the bottom, illustrating that users experience the design layer while most investment targets the model layer. — The user only experiences the top layer. Most organizations only invest in the bottom layer.

The bottom line

The same models power the best and worst AI experiences on the market. The same GPT-4o runs a banking chatbot with NPS of 90 and a customer service bot that 64% of users wish did not exist. The same generation of LLMs produces usability scores that range from B to A depending on nothing more than design decisions made by the team that built the interface.

The organizations producing meaningful AI returns, DBS with S$370 million in value, Vodafone with a 4× improvement in resolution rate, Nubank with customer acquisition costs that incumbents cannot match, did not win on model selection. They won on interaction design: conversation structure, escalation logic, error recovery, transparency, and the relentless measurement of whether the experience works for the human on the other end.

The 66% of C-suite leaders who recognize this and the 6% acting on it define the gap where competitive advantage lives for the next decade. Every month you spend optimizing the model while neglecting the interaction layer is a month your competitors use to design the experience that makes your model irrelevant.

▶References (1)

By Tricky Wombat

Last Updated: Apr 7, 2026

Good knowledge = good engineering

What is the ROI for AI customer service?

The real benefits of adding an AI chatbot to your website.