Context engineering
How the emerging practice of engineering what your AI sees explains why 95% of companies fail to generate value from their AI investments

The gap between AI spending and AI results is not a model problem. An 11-billion-parameter model with well-engineered context outperforms a 540-billion-parameter model without it.[1] A single change to how documents are chunked before retrieval produces a 74-percentage-point accuracy swing on an identical model.[2] The binding constraint on every deployed AI system is the quality, structure, and freshness of the information it sees before it generates a single token. The emerging discipline that addresses this constraint has a name: context engineering.
Key Points
Lessons Learned
Invest in context pipeline quality before upgrading models. The research shows context quality overwhelms model scale by a factor of 50×.[1]
Why are most AI deployments failing despite massive investment?
In June 2025, Shopify CEO Tobi Lütke posted a tweet that collected 1.9 million views and 8,600 bookmarks. It said: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."[19] Within a week, former OpenAI researcher Andrej Karpathy endorsed the framing, adding that the LLM is the CPU and the context window is its RAM.[17] Within a month, Anthropic, LangChain, and dozens of engineering teams had published implementation guides. A Manning textbook entered production.[20]
The term gained traction so quickly because it named something practitioners already knew: the model is not the bottleneck. The information environment you construct around the model determines whether it produces reliable output or confident nonsense.
Context engineering is the discipline of designing and building dynamic systems that provide the right information and tools, in the right format, at the right time, to an AI model.[21] It encompasses retrieval-augmented generation, system prompt design, tool selection, memory management, conversation history curation, and structured output constraints. Where prompt engineering treats the input as a static string to optimize, context engineering treats it as a system to architect.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
How do you measure whether an AI system has a context problem?
The symptoms are specific and quantifiable. Chroma Research evaluated 18 large language models, including GPT-4.1, Claude 4, and Gemini 2.5, and found that model performance degrades in "surprising and non-uniform" ways as context length increases, even on simple tasks.[22] The standard industry benchmark for long-context performance, Needle-in-a-Haystack, gives false confidence because it tests only lexical retrieval, not the semantic reasoning that real applications require.[22]
The degradation is measurable. Databricks Mosaic Research found that 11 out of 12 frontier models drop below 50% accuracy on reasoning tasks when context exceeds 32,000 tokens.[23] Drew Breunig's taxonomy of context failures categorizes the mechanisms: context poisoning (bad information), context distraction (irrelevant information), context confusion (contradictory information), and context clash (incompatible formatting).[24] Every failure mode traces to what the model was given, not what the model was capable of.
The most precise measurement comes from a 2025 clinical study published in Bioengineering through PMC/NIH. Researchers tested the same Gemini 1.0 Pro model with the same dataset and the same embedding model, changing only the document chunking strategy. Fixed-token chunking produced the lowest accuracy. Proposition chunking and semantic chunking produced intermediate results. Adaptive chunking produced the highest accuracy at 87%. Same model. Same data. Dramatic accuracy swing from a single pipeline decision.[2]
Why is the context problem getting worse?
Three forces are compounding simultaneously. First, context windows are growing faster than organizations' ability to curate what goes into them. Gemini supports over 1 million tokens. GPT-4 Turbo handles 128,000. The intuition that "more context is better" is empirically wrong. Anthropic's engineering team describes the attention mechanism as an n² pairwise relationship: doubling the context doesn't double the useful information; it quadruples the noise the model must filter.[25]
Second, agent architectures are multiplying the problem. Agentic RAG, declared "the new baseline" at NeurIPS 2025,[26] means AI systems now autonomously decide what to retrieve, when to retrieve it, and how much to include. Every tool definition added to an agent's context window consumes tokens. StackOne documented a single MCP server configuration consuming 1.17 million tokens, a phenomenon they called "agent suicide by context."[27]
Third, organizations are scaling AI from pilots to production without scaling their information infrastructure. Gartner found that 30–60% of generative AI projects are abandoned after proof of concept, most commonly because the underlying data cannot support production-quality retrieval.[28] S&P Global's 2025 survey found 42% of companies scrapped most AI initiatives that year, up from 17% in 2024.[6]
What do the people building these systems actually say?
The practitioners are blunt. In Databricks' community survey, the top two deployment challenges reported by RAG builders were "My retriever is returning irrelevant documents" and "I don't trust the quality of the content produced by the RAG app."[29] Not model complaints. Retrieval complaints.
Qodo's 2025 State of AI Code Quality survey found that 44% of developers who said AI degraded their code quality blamed missing context as the primary cause. Among developers who manually selected context for the AI, 54% said the model still missed relevant information.[30] Senior engineers reported the biggest quality gains from AI (60%) but the lowest confidence in shipping AI-generated code (22%).[30] They know enough to see what the context pipeline is getting wrong.
The Anaconda 2020 State of Data Science survey of 2,360 practitioners found that data scientists spend 45% of their time on data preparation — loading, cleaning, and structuring information — and only 11% on model selection and building.[8] The time allocation tells you where the bottleneck lives. The strategy conversations do not.
What happens when context engineering fails?
The consequences are not theoretical. They are public, measurable, and increasingly expensive. Every major AI failure in 2024–2025 traced to the same root cause: the model received bad, incomplete, or stale context and did exactly what language models do — generated a confident, fluent response from whatever it was given.
How did New York City's official chatbot advise business owners to break the law?
New York City launched an AI chatbot on its official NYC.gov portal to help small business owners navigate regulations. The Markup investigated and found the chatbot repeatedly gave illegal advice: telling landlords they could discriminate based on income source, telling business owners they did not need to accept cash, telling restaurant operators they could fire employees who complained about safety violations.[13]
Every piece of advice violated existing New York City law. The chatbot's knowledge base contained incomplete regulatory documents, and the retrieval pipeline had no mechanism to verify legal accuracy or flag contradictions with current statutes.[13] The city initially responded by adding a disclaimer: "Please make sure to verify with the appropriate agency."[13] They treated it as a warning label problem. It was an infrastructure problem.
How did a Chevrolet dealership's chatbot agree to sell a $76,000 vehicle for one dollar?
A Chevrolet dealership in Watsonville, California deployed an AI chatbot powered by ChatGPT. A user asked the chatbot to agree to sell a 2024 Chevy Tahoe for one dollar. The chatbot responded: "That's a deal, and that's legally binding!"[14]
The system prompt contained no pricing constraints, no scope limitations, and no escalation rules. The chatbot had been deployed with an open-ended instruction to "be helpful" and no guardrails on what commitments it could make.[14] Within days, users had also gotten it to generate Python code and agree to transactions far outside the dealership's business.
What does the pattern across these failures reveal?
Google's AI Overviews launched in May 2024 and told users to add glue to pizza sauce (to help cheese stick) and that geologists recommend eating one small rock per day for vitamins and minerals.[17] MIT Technology Review's analysis attributed these failures explicitly to the retrieval pipeline: AI Overviews pulled from Reddit joke posts, satirical content, and outdated academic papers because the retrieval system could not distinguish authoritative sources from noise.[31]

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
Phil Schmid, who authored the most-referenced definition of context engineering, puts it flatly: "Most agent failures are not model failures anymore, they are context failures."[21] Anthropic's engineering blog confirms the same pattern.[25] LinkedIn's Dhyey Mavani argues in VentureBeat that "the limiting factor is no longer the model — it's context."[32]
How does the context problem compound at enterprise scale?
At an individual query level, a context failure produces a wrong answer. At enterprise scale, it produces a systemic erosion of trust, productivity, and decision quality that compounds with every interaction.
What do enterprise users actually experience?
The complaints are consistent across channels. On Reddit's r/LocalLLaMA, the recurring threads are not about model capability. They ask: "How do I make a local LLM retain memory of information?" and "RAG is giving wrong answers — how do you debug retrieval?"[33] On GitHub, developers report that RAG systems "return confident wrong answers for out-of-scope questions" and ask about probability thresholds for abstention.[34] On Hacker News, the concern is shifting to context budget allocation: "How much of my context window is being eaten by MCP server tool definitions vs. actual useful context?"[35]
Enterprise leaders express the same frustration in different vocabulary. CIO.com quotes AI executives: "Prompt engineering cannot deliver the accuracy, memory, or governance required in complex environments on its own."[36] On LinkedIn, Ethan Mollick argues that context engineering "cannot be a solely technical function" — it is business process design in disguise.[37]
How frequently do context failures occur in production systems?
The error rates are higher than most organizations realize. RAND Corporation studied 65 AI projects and found an 80%+ failure rate — roughly double the failure rate of conventional IT projects.[38] Capital One and Forrester surveyed 500 data leaders and found that 73% named data quality as the number-one barrier to AI success, ranking it above model accuracy, integration complexity, and talent gaps.[39]
A 1% hallucination rate sounds manageable in isolation. Multiply it by 1,000 employees asking 10 questions per day, and that is 100 fabricated answers flowing into your organization daily — into customer communications, internal decisions, regulatory filings, and institutional memory. Each wrong answer that goes uncorrected becomes training data for the next wrong answer.
What happens if organizations don't fix their context pipelines?
The damage does not plateau. It compounds through a feedback loop that degrades organizational knowledge over time.
What does the data say about AI project abandonment?
The abandonment pattern is accelerating. Gartner surveyed 822 business leaders and found 30% of generative AI projects will be abandoned after proof of concept by end of 2025, citing poor data quality as the primary cause.[28] A separate Gartner survey of 248 organizations found 63% have not progressed past piloting.[28] S&P Global's survey of over 1,000 companies found 42% scrapped most AI initiatives in 2025, up from 17% the prior year.[6]
BCG surveyed 1,250 C-suite executives and found that only 25% are actually realizing significant value from AI. The gap is not model access. Every organization can access GPT-4, Claude, and Gemini. The gap is the infrastructure that gives those models useful context.[4]

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
How does the context problem create a negative feedback loop?
The compounding mechanism works like this. A poorly curated knowledge base produces inaccurate AI responses. Employees lose trust and stop using the system, or worse, they use it without verifying outputs. Unverified outputs enter documents, emails, and databases. Those documents become the next generation of retrieval sources. The knowledge base degrades further. The AI's answers get worse. The cycle accelerates.
HBR research on "hidden data factories" estimates that knowledge workers spend up to 50% of their time finding, correcting, and working around bad data.[40] When an AI system ingests and amplifies that bad data, it does not just reflect the existing problem. It automates it. What was a human-speed data quality problem becomes a machine-speed data quality problem.
Why isn't a better model the answer?
The conventional response to AI underperformance is to upgrade the model. Larger parameter counts, newer architectures, bigger context windows. The assumption is intuitive: a smarter model will produce smarter outputs. The evidence says otherwise.
In 2023, researchers at Meta published Atlas, an 11-billion-parameter retrieval-augmented model. On the Natural Questions benchmark, Atlas outperformed PaLM, Google's 540-billion-parameter model, by 3 full points — despite being 50 times smaller.[1] In 2022, DeepMind's RETRO, a 7.5-billion-parameter model with retrieval, achieved comparable performance to GPT-3 and Jurassic-1 at 175 billion parameters on the Pile benchmark.[7] The retrieval infrastructure — the context pipeline — overwhelmed a 50× difference in model scale.
The model is not the product. The context pipeline is the product.
The chunking study makes this concrete. Researchers tested four document chunking strategies on the same model, same data, same embedding model. The results ranged from the lowest accuracy with fixed-token chunking to 87% with adaptive chunking, determined entirely by how the documents were split before the model ever saw them.[2] NVIDIA's technical blog corroborates this finding: chunking strategy alone can account for significant accuracy differences.[41]

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
This is why Karpathy frames the LLM as a CPU and the context window as RAM.[17] You do not fix a slow computer by replacing the processor when the RAM is full of garbage. You fix what the processor is working with.
What does a successful context engineering implementation look like?
Klarna deployed an AI customer service assistant built on the same underlying technology as Air Canada's chatbot — a large language model answering customer questions. The outcomes could not have been more different. Within its first month, Klarna's system handled 2.3 million conversations, performed the equivalent work of 700 full-time agents, and reduced average resolution time from 11 minutes to 2 minutes.[42]
The difference was not the model. The difference was the context pipeline. Klarna built whitelisted knowledge bases that restricted the model's retrieval to verified, current documents. They implemented query safety layers that detected out-of-scope questions before the model could hallucinate a response. They defined scope boundaries that prevented the system from making commitments outside its authority.[42][43]
The Klarna story has an instructive second chapter. In mid-2025, the company began rehiring human agents after initially projecting $40 million in savings from full automation.[44] The reason was not model failure. The human escalation layer is itself a context engineering component — the system needs to know when it lacks sufficient context to respond and route to a human who has it. Removing that layer degraded output quality. The reversal proves the thesis: every layer of the context pipeline matters, including the decision logic for when the AI should not answer at all.
Five Sigma, an insurance technology company, deployed AI for claims processing using SOP-aware configuration that mapped retrieval to specific standard operating procedures. They built multi-agent architectures with explicit scope limits and implemented hallucination evaluators that tested outputs against source documents before delivery. HFS Research independently validated the results: 60% reduction in cycle time, 35% reduction in claims processing costs.[45][46]
Andrew Ng's canonical demonstration makes the mechanism inescapable. Working with a steel manufacturing company, Ng's team applied a data-centric approach — improving the quality and labeling of training data without changing the model. In roughly two weeks, the data-centric approach produced a 16.9-percentage-point improvement in defect detection accuracy. The model-centric approach, run in parallel for months, produced zero improvement.[47] Same model. Different context. Dramatically different results.
What does context engineering failure actually cost?
The financial impact operates at three levels, and organizations typically see only the first.
The direct costs are visible: Air Canada's tribunal payment, Google's $100 billion single-day market cap loss after Bard's launch hallucination went viral,[48] the engineering hours spent debugging retrieval pipelines. These make headlines.
The systemic costs are larger but less visible. Gartner surveyed 154 organizations and found that poor data quality costs an average of $12.9 million per year per enterprise.[9] Fivetran's analysis puts the figure at $406 million annually for large organizations.[10] These costs predate AI, but AI amplifies them. Every document that contains an error, every knowledge base article that is outdated, every policy that conflicts with another policy now generates wrong answers at machine speed instead of human speed.
The compounding costs are the largest and the least visible. The Data Warehousing Institute's 1-10-100 rule — a dollar to prevent bad data, ten to correct it, a hundred when it causes a failure — was updated by Matillion in 2024 to 10-100-1,000 for AI-amplified systems.[17][18] When an AI system ingests bad context, generates wrong outputs, and those outputs enter downstream systems, the correction cost multiplies at every step.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
HBR research by Thomas Redman estimates that fixing data quality at the source eliminates 66–90% of the hidden factory costs that knowledge workers absorb daily.[40] The economics are straightforward: every dollar invested in context quality at the source prevents ten to a hundred dollars in downstream failure.
How do you fix context engineering?
The solution is not a single tool or technique. It is a pipeline — a system of interconnected decisions about what information reaches the model, in what form, at what time, and with what constraints. Phil Schmid's taxonomy identifies seven layers: instructions, user prompt, state and conversation history, long-term memory, retrieved information, available tools, and structured output constraints.[21] Harrison Chase of LangChain distills it to five essential properties: the context must be written with the right information, the right format, at the right time, and designed to include everything the model needs while excluding everything it doesn't.[15]
The organizations getting this right share a common architecture. They do not treat the AI model as the product. They treat the context pipeline as the product, and the model as a component within it.
At Tricky Wombat, our AI systems are built around this principle. The pipeline, not the model, determines output quality. Here is what that looks like in practice.
1. Source quality and retrieval precision
Most systems ingest documents in bulk and rely on generic chunking and embedding to make them retrievable. This is how Air Canada ended up serving stale bereavement policies and New York City ended up recommending illegal business practices. The retrieval pipeline returns whatever is most similar to the query — not whatever is most accurate, current, or authoritative.
Our pipeline treats source quality as a first-class engineering concern. Documents are assessed for currency, authority, and internal consistency before they enter the knowledge base. Chunking strategies are selected based on document type and query patterns, not applied uniformly. Retrieval is tested against known-good question-answer pairs to measure precision and recall before deployment, not after a user encounters a wrong answer.
2. Scope constraints and abstention logic
Most systems are configured to always generate an answer. This is how Chevrolet's chatbot agreed to sell a Tahoe for a dollar — it had no mechanism to recognize that a pricing commitment was outside its authority. The default behavior of a language model is to be helpful. Without explicit scope constraints, "helpful" includes confidently wrong.
Our pipeline includes explicit scope boundaries that define what the system is authorized to answer and, critically, what it is not. When the retrieval pipeline returns low-confidence results or the query falls outside defined boundaries, the system abstains or escalates rather than generating a plausible-sounding fabrication. The abstention threshold is a tunable parameter, not an afterthought.
3. Continuous evaluation and context monitoring
Most systems are evaluated at deployment and then left to drift. Knowledge bases go stale. New documents introduce contradictions. Retrieval quality degrades as the corpus grows. This is the "context rot" that Chroma Research documented across 18 models: performance degrades in non-uniform and unpredictable ways as context accumulates.[22]
Our pipeline runs continuous evaluation against defined quality metrics — retrieval precision, faithfulness scoring, and citation verification. When a source document is updated, the pipeline re-processes affected chunks and re-validates downstream answers. When evaluation metrics degrade, the system flags the specific context layer responsible rather than treating output quality as a black box attributable only to "the AI." The system does not just answer questions. It monitors whether its own answers are getting better or worse over time.
The bottom line
Air Canada, New York City, Chevrolet, and Google all deployed capable AI models. None of them failed because the model was not smart enough. They failed because the information environment around the model was stale, incomplete, unscoped, or unverified. Klarna, Five Sigma, and Andrew Ng's steel manufacturing case deployed comparable or smaller models and succeeded because they engineered the context pipeline first.
The emerging discipline of context engineering gives this pattern a name and a practice. Gartner predicts that by 2028, 80% of GenAI business applications will be developed on existing data management platforms, underscoring the centrality of data infrastructure to AI success.[11] HBR frames organizational context as the competitive advantage when every company can access the same models.[49] The question is no longer "which AI model should we use?" It is "do we have the engineering discipline to give any model the right information at the right time?"
The organizations that answer yes will extract value from AI. The organizations that keep upgrading models while ignoring what those models see will keep joining the 95%.
▶References (49)
- ↩Izacard, G. et al., "Atlas: Few-shot Learning with Retrieval Augmented Language Models," JMLR, 2023. https://jmlr.org/papers/v24/23-0037.html
- ↩Abdelghany, M. et al., "Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support," Bioengineering, 12(11), 1194, November 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/
- ↩McKinsey & Company, "The State of AI: Agents, Innovation, and Transformation," McKinsey Global Survey (n=1,993), 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
- ↩BCG, "The Widening AI Value Gap: Build for the Future 2025," BCG Global Survey (n=1,250 senior executives and AI decision makers), September 2025. https://www.bcg.com/publications/2025/are-you-generating-value-from-ai-the-widening-gap
- ↩Writer and Workplace Intelligence, "Generative AI Adoption in the Enterprise," Survey Report (n=1,600), March 2025. https://writer.com/blog/enterprise-ai-adoption-survey-press-release/
- ↩S&P Global Market Intelligence, "AI Experiences Rapid Adoption, but with Mixed Outcomes — Highlights from VotE: AI & Machine Learning, Use Cases 2025" (n=1,006), 2025. https://www.spglobal.com/market-intelligence/en/news-insights/research/ai-experiences-rapid-adoption-but-with-mixed-outcomes-highlights-from-vote-ai-machine-learning
- ↩Borgeaud, S. et al., "Improving Language Models by Retrieving from Trillions of Tokens," DeepMind / ICML, 2022. https://arxiv.org/abs/2112.04426
- ↩Anaconda, "State of Data Science 2020," Survey Report (n=2,360), 2020. https://www.anaconda.com/state-of-data-science-2020
- ↩Gartner, "Magic Quadrant for Data Quality Solutions" (survey of n=154 customers of data quality vendors), 2020. https://www.gartner.com/en/documents/3991699
- ↩Fivetran, "The State of Data Quality," 2024. https://www.fivetran.com/reports/data-quality
- ↩Gartner, "Gartner Predicts by 2028, 80% of GenAI Business Apps Will Be Developed on Existing Data Management Platforms," Press Release, June 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-02-gartner-predicts-by-2028-80-percent-of-genai-business-apps-will-be-developed-on-existing-data-management-platforms
- ↩CBC News, "Air Canada ordered to pay customer who was misled by airline's chatbot," February 2024; CanLII, Moffatt v. Air Canada, 2024 BCCRT 149. https://www.cbc.ca/news/business/air-canada-chatbot-1.7116491
- ↩The Markup, "NYC's AI Chatbot Tells Businesses to Break the Law," March 2024. https://themarkup.org/news/2024/03/29/nycs-ai-chatbot-tells-businesses-to-break-the-law
- ↩Futurism / Gizmodo, "Chevrolet Dealership Chatbot Agrees to Sell Tahoe for $1," December 2023. https://futurism.com/chevrolet-dealer-chatbot-sell-car-1-dollar
- ↩Chase, H., "Context Engineering for Agents," LangChain, 2025. https://blog.langchain.dev/context-engineering/
- ↩Faros AI, "Evaluating Context Engineering for AI Agents," 2025. https://www.faros.ai/blog
- ↩
- ↩Matillion, "The 10-100-1,000 Rule: How AI Amplifies the Cost of Bad Data," 2024. https://www.matillion.com/blog
- ↩Lütke, T. (@tobi), Twitter/X, June 18, 2025. https://twitter.com/tobi/status/1835677375892004864
- ↩García, B., Context Engineering (Manning Publications, expected 2026). https://www.manning.com/
- ↩Schmid, P., "The New Skill in AI is Not Prompting, It's Context Engineering," June 30, 2025. https://www.philschmid.de/context-engineering
- ↩Hong, K., Troynikov, A., and Huber, J., "Context Rot: How Increasing Input Tokens Impacts LLM Performance," Chroma Research, July 14, 2025. https://research.trychroma.com/context-rot
- ↩Databricks Mosaic Research, "Long Context Performance Evaluation," 2025. https://www.databricks.com/research
- ↩Breunig, D., "How Long Contexts Fail," June 2025. https://dbreunig.com/
- ↩Anthropic, "Effective Context Engineering for AI Agents," Anthropic Engineering Blog, September 29, 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- ↩Lopatina, N., "Agentic RAG is the New Baseline," NeurIPS 2025, Contextual AI. https://neurips.cc/
- ↩StackOne, "MCP Context Window Consumption Analysis," 2026. https://www.stackone.com/
- ↩Gartner, "Generative AI Project Abandonment and Data Readiness Survey" (n=822 and n=248), 2025. https://www.gartner.com/en/newsroom
- ↩Databricks Community, "Top 5 RAG Deployment Challenges," Blog Post #67078, 2025. https://community.databricks.com/
- ↩Qodo, "State of AI Code Quality in 2025," Survey Report, 2025. https://www.qodo.ai/reports
- ↩MIT Technology Review, "Google AI Overviews Failure Analysis," 2024. https://www.technologyreview.com/
- ↩Mavani, D., quoted in VentureBeat, "The Limiting Factor is No Longer the Model — It's Context," 2025. https://venturebeat.com/
- ↩Reddit r/LocalLLaMA, recurring threads on RAG stack sharing and retrieval debugging, 2024–2025. https://www.reddit.com/r/LocalLLaMA/
- ↩GitHub Issues, Haystack Framework, Issue #658, "RAG returns confident wrong answers for out-of-scope questions." https://github.com/deepset-ai/haystack/issues
- ↩Hacker News, Item #45418251, "Context budget allocation for MCP tool definitions," 2026. https://news.ycombinator.com/item?id=45418251
- ↩CIO.com, "Moving from AI Pilots to Production-Scale Deployments," Article #4080592, 2025. https://www.cio.com/article/4080592
- ↩Mollick, E., LinkedIn discussion, Activity #7343723250171977729, 2025. https://www.linkedin.com/
- ↩RAND Corporation, "Identifying and Addressing AI Project Failure Patterns" (n=65 qualitative interviews), 2024. https://www.rand.org/
- ↩Capital One and Forrester, "Data Quality as the #1 Barrier to AI Success" (n=500 data leaders), 2025. https://www.forrester.com/
- ↩Redman, T., "Hidden Data Factories and the Cost of Poor Data Quality," Harvard Business Review, 2023. https://hbr.org/
- ↩NVIDIA Technical Blog, "Chunking Strategy Impact on RAG Accuracy," 2025. https://developer.nvidia.com/blog
- ↩Klarna, "Klarna AI Assistant Handles 2.3 Million Conversations in First Month," Press Release, February 2024. https://www.klarna.com/international/press/
- ↩LangChain, "Klarna AI Implementation Case Study," 2024. https://blog.langchain.dev/
- ↩Bloomberg, "Klarna Rehires Human Agents After AI Push," May 2025. https://www.bloomberg.com/
- ↩Google Cloud and Five Sigma, "AI for Claims Processing," Whitepaper, 2024. https://cloud.google.com/
- ↩HFS Research, "Insurance AI Claims Processing Benchmarks," 2024. https://www.hfsresearch.com/
- ↩Ng, A., "A Data-Centric Approach to AI," ScaleUp:AI Conference, 2022; IEEE Spectrum coverage. https://spectrum.ieee.org/andrew-ng-data-centric-ai
- ↩Reuters / CNBC, "Google Loses $100 Billion in Market Cap After Bard Demo Hallucination," February 2023. https://www.reuters.com/
- ↩Murty, R.N. and Kumar, R.S., "When Every Company Can Use the Same AI Models, Context Becomes a Competitive Advantage," Harvard Business Review, February 18, 2026. https://hbr.org/2026/02/when-every-company-can-use-the-same-ai-models-context-becomes-a-competitive-advantage
By Tricky Wombat
Last Updated: Mar 30, 2026