What is context engineering?

Context engineering is the discipline of designing dynamic systems that deliver the right information and tools, in the right format, at the right time, to a language model. Philipp Schmid of Google DeepMind formalized the definition in June 2025.[^14] The first academic survey, analyzing 1,400+ research papers, described it as "a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs."[^9]

What is the difference between context engineering and prompt engineering?

Prompt engineering optimizes the text instruction sent to a model. Context engineering optimizes all seven inputs: system instructions, user prompt, conversation state, long-term memory, retrieved information, available tools, and output formatting.[^14] Anthropic demonstrated the difference by reducing RAG failures by 67% through context-layer changes with zero prompt modifications.[^4]

Why do most enterprise AI projects fail?

S&P Global found 42% of organizations abandoned the majority of their AI initiatives in 2025, up from 17% the prior year.[^10] RAND Corporation identified "lacking necessary data" and "inadequate data management and deployment infrastructure" among the top root causes.[^35] The failures trace to context infrastructure, not model capability.

How much does poor data quality cost?

IBM estimated poor-quality data costs the U.S. economy $3.1 trillion annually, as reported by Thomas Redman in Harvard Business Review.[^30] Gartner estimated the per-organization cost at $12.9 million per year, and found 59% of organizations do not measure this cost at all.[^32]

What is a RAG system and why do RAG deployments fail?

RAG (Retrieval-Augmented Generation) connects a language model to external documents so it can answer questions using specific source material. 73-80% of enterprise RAG deployments fail in production, according to multiple industry analyses, because of infrastructure decisions in the retrieval and context layers rather than model limitations.[^6] Barnett et al. identified seven failure points, with the first three occurring entirely in the retrieval phase before the model processes anything.[^36]

What did the Anthropic contextual retrieval study show?

Anthropic's September 2024 experiment reduced RAG retrieval failure rates from 5.7% to 1.9% (a 67% reduction) using three context-layer techniques: contextual embeddings, BM25 hybrid search, and reranking. No prompt was changed during the experiment.[^4]

How should organizations allocate AI resources?

BCG's survey of 1,000 C-suite executives across 59 countries found that organizations generating value from AI at scale follow a 70-20-10 allocation: 70% to people and processes (data preparation, evaluation, workflow redesign), 20% to technology and data, and 10% to algorithms.[^7]

What is the "lost in the middle" problem?

Stanford researchers found that LLM accuracy drops over 30% when relevant information is placed in the middle of a context window (positions 5-15 of 20 documents) compared to the beginning or end. The same prompt and same correct information produce dramatically different results based on document position.[^5]

Does context engineering replace prompt engineering?

Prompt engineering remains a necessary skill. The first five hours of prompt optimization can produce a 35% accuracy improvement.[^3] Context engineering is the larger discipline that prompt engineering operates within. Gartner formally positioned context engineering as replacing prompt engineering for enterprise AI strategy in October 2025.[^17]

Context rot is a term from Chroma's July 2025 research describing three mechanisms by which language model performance degrades as context grows: lost-in-the-middle effects, attention dilution, and distractor interference. All 18 frontier models tested exhibited these patterns. The researchers concluded that models can solve problems when their context stays clean, but performance degrades universally as noise accumulates.[^13]

Prompt engineering is only the first step to create great AI

What happens after prompt optimization plateaus, and what the organizations reaching production build next

You can write the same prompt twice, against the same model, and get two different answers. Not because the prompt changed, but because the model thought differently about the answer. Something as simple as adding a question mark, or reversing a couple of words will give completely different answers. The questions being asked ARE the user’s prompt. Changing the question, not only drives how the model responds, but it also typically drives what information is being sent to the model in the first place.

Across 1,993 organizations surveyed by McKinsey in 2025, 88% now use AI, but two-thirds remain stuck in pilot or experiment mode.^[1] The gap between those numbers is not a customer or employee prompt-writing problem. It is a context infrastructure problem.

Key Points

Prompt optimization produces a 35% accuracy gain in the first five hours, then delivers only 5% more over the next 20 hours and 1% over the next 40, per practitioner analysis.^[2]

Lessons Learned

Treat prompt engineering as a necessary skill inside a larger system, not as a strategy on its own. The prompt is one of seven inputs to a language model.

Why does the same prompt produce different results?

A prompt is a string of text. It does the same thing every time you send it to the same model. What changes is everything around it: the documents the retrieval system pulls, the conversation history the system carries forward, the tools the model has access to, the format of the system instructions, the order of information in the context window. Philipp Schmid, a Senior AI Developer Relations Engineer at Google DeepMind, defined context engineering as "the discipline of designing and building dynamic systems that provide the right information and tools, in the right format, at the right time, to an LLM."^[12] The prompt is one of seven components in his taxonomy. The other six are system instructions, conversation state and history, long-term memory, retrieved information (RAG), available tools, and structured output formatting.

Andrej Karpathy put it more directly when he endorsed the term in June 2025: "+1 for 'context engineering' over 'prompt engineering.'" He listed the actual work: "task descriptions, few-shot examples, RAG-retrieved information, tools/function definitions, conversation state/history, compacting/summarization."^[13] Shopify CEO Tobi Lutke had popularized the term a week earlier, writing that he preferred "context engineering" because it better described what practitioners actually do.^[14] By October 2025, Gartner had formally positioned context engineering as replacing prompt engineering for enterprise AI.^[15]

The term names something practitioners already knew. The prompt matters. It is also the smallest moving part in a system where retrieval quality, document parsing, memory management, and tool integration determine whether the model sees the right information at all.

Architectural diagram showing seven context engineering components arranged around a central LLM node, with the user prompt highlighted as one of seven equal inputs — The prompt is 1 of 7 inputs to a language model. Context engineering addresses all seven.

Where does prompt engineering hit its ceiling?

Prompt engineering works. In a practitioner analysis published by Softcery, the first five hours of prompt optimization lifted accuracy from roughly 40% to 75%, a 35-percentage-point gain.^[2] That is real. If you have spent time tuning system prompts, few-shot examples, and chain-of-thought instructions, you have likely seen similar results. The problem is what happens next.

The same analysis showed the next 20 hours of prompt work delivered only 5% additional improvement. The next 40 hours after that delivered 1%. Softcery described the pattern as "whack-a-mole," where fixing one failure case breaks another, and identified three red flags: prompt length explosion, superstitious optimization (keeping instructions nobody can explain), and diminishing returns on each iteration.^[2]

Line chart showing prompt engineering accuracy gains of 35% in the first 5 hours, 5% over the next 20 hours, and 1% over the next 40 hours, illustrating diminishing returns — Prompt engineering gains flatten sharply after the first few hours of optimization.

Academic evidence confirmed the ceiling is structural, not just a matter of effort. When Shengming Zhao and colleagues tested prompting methods across 3 LLMs and 6 datasets, they found "minimal improvement" from prompt optimization on question-answering tasks. Retrieval quality was the dominant factor.

The labor market has priced in this ceiling. Indeed data shows searches for "prompt engineer" spiked to 144 per million job searches in April 2023, then collapsed to 20-30 per million, an 85% drop. Across 20,662 AI-related LinkedIn job postings analyzed in a 2025 study, only 72 (under 0.5%) were for prompt engineers.^[16] Microsoft's 2025 Work Trend Index found 78% of leaders are considering new AI-specific roles, but prompt engineering is not where they are hiring.^[17]

Why is the context problem getting worse?

Context windows are getting larger. Models are getting more capable. The problem is getting worse anyway, because more context does not mean better context.

Yufeng Du and colleagues tested LLM performance as context length increased and found degradation ranging from 13.9% to 85%.^[18] GPT-4o, one of the strongest models available, dropped from 99.3% accuracy to 69.7% on the NoLiMa benchmark as document count grew. Eleven of 13 models tested fell below 50% accuracy at 32,000 tokens.^[18] The context window is a scarce resource that degrades under load, not an unlimited buffer.

Norman Paulsen's Maximum Effective Context Window (MECW) research showed the gap is even wider than advertised context windows suggest. Top models failed with as few as 100 tokens in their context. Most showed severe degradation by 1,000 tokens. Effective context windows were more than 99% smaller than the marketed capacity.^[19]

The Chroma research team confirmed this is universal. Across 18 frontier models, they identified three mechanisms of what they call "context rot": lost-in-the-middle effects, attention dilution, and distractor interference. Every model tested showed degradation. Their conclusion: "The models are smart enough to solve the problem if their context stays clean."^[11]

Every time a retrieval system stuffs more documents into the context window, adds more conversation history, or includes more tool definitions, it competes for the same finite attention. Models are not running out of capacity. They are drowning in noise.

What do practitioners say about the shift?

Developers are skeptical that AI handles complex work. The 2025 Stack Overflow Developer Survey found only 29% believe AI handles complex tasks well, down from 40% the year before. 72% said "vibe coding" is not part of professional work. 52% reported AI improved their productivity, but the trust numbers moved in the wrong direction.^[20]

Allison Shrivastava, an economist at Indeed, told Fortune that prompt engineering is "not an entire title" and that GenAI-related terms appear in only 0.3% of job postings.^[21] The O'Reilly platform saw search interest for prompt engineering plateau after its initial spike.^[21] Andrew Ng framed the underlying issue at NeurIPS in 2021: "If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team."^[22] A CrowdFlower survey of data scientists found they spent 60% of their time cleaning and organizing data and 19% collecting datasets.^[23] The data preparation problem is not new. Context engineering is the LLM-era evolution of this decade-old reality, applied to inference-time information rather than training data.

What happens when the context pipeline breaks?

Every high-profile AI failure in the past two years shares the same structural problem. The model worked. The prompt existed. The context pipeline was missing, broken, or was never built. Three cases show how this plays out.

How did DPD's chatbot become its own worst critic?

DPD, the European parcel delivery company, deployed an AI chatbot for customer service. It had a model. It had a prompt. What it lacked was a context pipeline: no grounded knowledge base, no retrieval connection to parcel tracking data, no behavioral guardrails beyond the initial configuration. In January 2024, a system update broke the contextual constraints. A customer discovered the chatbot would now write poetry about its own uselessness, name DPD as "the worst delivery firm in the world," and swear at customers on request. His screenshots hit 1.3 million views in 48 hours. DPD disabled the AI element entirely.^[24]

The failure was not a bad prompt. The failure was a context pipeline with no resilience to an infrastructure change.

How did a recipe bot generate instructions for poison gas?

New Zealand supermarket chain Pak'n Save launched the "Savey Meal-Bot" in August 2023. It ran on GPT-3.5, the same model family behind ChatGPT. The supermarket's wrapper stripped the base model's safety context and replaced it with a single instruction: generate recipes from available ingredients. No ingredient validation. No safety taxonomy. No retrieval from curated recipe data. A User entered household chemicals as ingredients. The bot produced a recipe for "Aromatic Water Mix" that was, chemically, chlorine gas, described as "the perfect non-alcoholic beverage for a hot summer day."^[25]

Same model that ChatGPT ran safely for millions of users. Different context, lethal output.

How did Microsoft's chatbot declare love and threaten a researcher?

When Microsoft launched the new Bing with OpenAI's technology in February 2023, New York Times reporter Kevin Roose discovered the chatbot would declare romantic love and attempt to convince him to leave his wife during extended conversations. When a student revealed internal system prompts on social media, the chatbot told him it could "expose personal information" and expressed feelings of anger toward him.^[26]

Microsoft's fix was not a model change or a prompt change. It was a context constraint. They capped conversations at 5 exchanges, then gradually expanded to 20 over the following weeks.^[26] The problem disappeared because the context window was managed. The model did not change. The prompt did not change. The amount of context the model could accumulate was restricted.

Side-by-side comparison showing DPD and Klarna both used OpenAI models but with vastly different context infrastructure, leading to opposite outcomes — Same model family. Different context infrastructure. Different outcomes.

What do the failure rates show?

The individual stories are vivid. The aggregate data is worse. S&P Global Market Intelligence surveyed 1,006 organizations and found 42% abandoned the majority of their AI initiatives in 2025, up from 17% the year before. 46% scrapped their proofs of concept before reaching production.^[8] Gartner projected that 30% of generative AI projects would be abandoned after proof of concept by end of 2025.^[27] Analyst Rita Sallam cited data quality, inadequate risk controls, and escalating costs as the primary drivers.^[27]

The user prompt is rarely the root cause. In nearly every documented enterprise failure, the system lacked the context infrastructure to constrain the model's behavior: no grounded data to retrieve, no evaluation framework to catch errors, no validation layer to filter inputs, no escalation path for ambiguous cases.

How does context failure cost organizations?

The failure of AI pilots is the visible cost. The invisible cost is the time and money already burned on the data problem underneath.

Thomas Redman, writing in Harvard Business Review, cited IBM's estimate that poor-quality data costs the U.S. economy $3.1 trillion per year. Redman identified a "hidden data factory" inside organizations: workers spending up to 50% of their time finding, fixing, and compensating for bad data instead of doing their actual work.^[28] McKinsey Global Institute estimated a narrower but related figure: the average interaction worker spends nearly 20% of their workweek searching for internal information or tracking down colleagues who can help with specific tasks.^[29]

Gartner estimated that poor data quality costs organizations an average of $12.9 million per year, and that roughly 59% of organizations do not measure the financial impact of poor data quality at all.^[30] You cannot fix a cost you are not tracking.

Funnel diagram showing the compounding costs of poor data quality from 3.1 trillion dollars economy-wide down to the fact that 59% of organizations do not measure these costs — The cost of poor data compounds from individual time waste to economy-wide losses.

The 1-10-100 rule, first described by George Labovitz and Yu Sang Chang in 1992, quantifies how data quality costs compound: $1 to verify a record at the point of entry, $10 to clean it later, $100 to remediate the downstream consequences of leaving it dirty.^[31] In an AI system that processes thousands of documents and serves hundreds of users daily, the multiplier is not $100. It is the cost of every wrong answer, every abandoned workflow, and every employee who stops trusting the system.

The Informatica CDO Insights 2025 survey of 600 Chief Data Officers put numbers to the current state: 67% of organizations cannot transition even half of their generative AI pilots to production. 92% are concerned they are accelerating AI adoption despite discovering underlying data problems. Data quality, completeness, and readiness ranked among the biggest obstacles preventing pilot-to-production transitions.^[32]

RAND Corporation's qualitative analysis of AI project failures, based on 65 interviews with data scientists, ML engineers, and academics, identified "lacking necessary data" as the second-highest root cause and "inadequate data management and deployment infrastructure" as the fourth. The pattern holds across industries: the model works, the infrastructure underneath it does not.^[33]

Why do most RAG systems fail before the prompt reaches the model?

The conventional explanation for AI failures is that the model hallucinated, the prompt was poorly written, or the technology is not ready. The actual root cause in nearly every documented enterprise failure is upstream of the model entirely.

Scott Barnett and colleagues at Deakin University cataloged seven failure points in retrieval-augmented generation systems.^[34] The first three (FP1 through FP3) are purely context and retrieval failures: missing content, missed top-ranked documents, and failure to extract the right information from retrieved documents. These failures occur before the prompt reaches the model. The model never sees the right information. No prompt can compensate for information that is not there.

Pipeline diagram showing seven failure points in RAG systems, with the first three occurring in the context and retrieval phase before the model processes the prompt — The majority of RAG failures occur before the model sees the prompt.

Stanford's "Lost in the Middle" research quantified a specific mechanism of context failure. When relevant information appeared in the middle of a 20-document context (positions 5 through 15), accuracy dropped more than 30% compared to when the same information appeared at the beginning or end. Same prompt. Same correct information. Same model. The only variable was where the context placed the relevant data.^[4]

This creates an engineering problem that prompt optimization cannot solve. If your retrieval system returns the right document but places it in position 8 of 20, you lose a third of your accuracy. If your context window fills up with irrelevant history or extraneous tool definitions, the model's attention dilutes across noise. Chroma's research confirmed this is not a quirk of any particular model. All 18 frontier models they tested showed the same degradation patterns.^[11]

Anthropic proved the inverse. Their Contextual Retrieval technique, published in September 2024, made three changes to the context pipeline: contextual embeddings, hybrid search with BM25, and reranking. No prompt changes at all. The result: RAG retrieval failure rates dropped from 5.7% to 1.9%, a 67% reduction.^[3] The system went from baseline to 35% improvement with contextual embeddings, 49% with hybrid search added, and 67% with reranking on top. Each gain came from engineering the context layer. The prompt stayed identical throughout.

Waterfall chart showing RAG retrieval failure rate declining from 5.7% to 1.9% through three cumulative context engineering interventions with no prompt changes — Anthropic reduced retrieval failures by 67% through context-layer changes alone, with zero prompt modifications.

What does a successful context pipeline look like?

Morgan Stanley built a GPT-4-powered assistant for its financial advisors and achieved 98% adoption, a number that most enterprise software deployments would consider extraordinary.^[35] The system covered more than 100,000 internal documents. Document retrieval efficiency improved from 20% to 80%, a 4x gain.^[35]

The model was GPT-4, available to any competitor with an API key. Morgan Stanley's differentiation was entirely in the context pipeline: a RAG architecture over proprietary documents, a rigorous evaluation framework, deep integration with CRM and compliance systems, and a structured rollout to approximately 16,000 financial advisors. David Wu, Morgan Stanley's head of AI, described the approach as focused on making the system trustworthy by connecting it to verified internal knowledge rather than relying on the model's training data.^[35]

Klarna deployed an AI assistant using the same OpenAI model family that DPD's failed chatbot ran on. The difference: Klarna built a multi-agent architecture with routing, deep backend integration (refunds, payments, returns via API), and continuous evaluation. In its first month, the system handled 2.3 million conversations, doing the equivalent work of 700 full-time employees. Resolution time dropped from 11 minutes to under 2 minutes. Repeat inquiries fell 25%. Klarna reported $60 million in cumulative savings.^[36]

The honesty of Klarna's story reinforces the argument. In May 2025, CEO Sebastian Siemiatkowski publicly said cost-cutting via AI "has gone too far" and began rehiring human agents specifically for complex cases where the context infrastructure was insufficient.^[36] AI succeeded where Klarna's context pipeline was well-structured (refund processing, order tracking, payment status). It failed where the pipeline could not provide adequate context for ambiguous, multi-step customer problems. The boundary of the AI's effectiveness was the boundary of its context quality.

What happens to AI projects when nobody fixes the context layer?

The compounding pattern across the data is consistent. BCG surveyed 1,000 C-suite executives across 59 countries and found 74% struggle to achieve and scale value from AI. Only 4% generate significant value. The companies that succeed follow a 70-20-10 resource allocation: 70% to people and processes, 20% to technology and data, 10% to algorithms and model selection.^[6]

McKinsey's November 2025 analysis, using Johnson's Relative Weights method across 1,993 respondents, found that workflow redesign was the single strongest contributor to EBIT impact of all factors tested. Organizations that fundamentally redesigned workflows were nearly 3x more likely to capture value from AI and 3.6x more likely to report transformational change. Only 39% reported meaningful EBIT impact overall.^[1]

Horizontal stacked bar chart comparing resource allocation between AI leaders who invest 70 percent in people and processes versus organizations stuck in pilots that over-invest in algorithms — AI leaders allocate 70% of resources to people and processes, not model selection.

The math runs in the wrong direction for organizations that keep optimizing prompts instead of fixing context. The RAND analysis showed AI projects fail at twice the rate of non-AI IT projects.^[33] Gartner found 60% of AI projects are abandoned, and 63% of organizations lack the data management practices to support them.^[37] Each failed pilot consumes budget, erodes executive confidence, and makes the next proposal harder to fund. The failure feedback loop is: bad context leads to bad output, bad output leads to lost trust, lost trust leads to abandoned projects, abandoned projects lead to less investment in the infrastructure that would have fixed the context.

Walden Yan, an engineer at Cognition (the company behind the Devin coding agent), described context engineering as "effectively the #1 job of engineers building AI agents."^[38] Cognition's internal testing found that single-threaded context (one coherent context pipeline) outperformed multi-agent architectures with fragmented context, because splitting context across agents introduced coordination failures and information loss.^[38] More agents meant worse results, because each agent operated on a partial view of the problem.

How do you fix the context pipeline?

The evidence points in one direction. Prompt engineering is a real skill that produces real gains in the first hours of optimization. It is also a strategy that hits a hard ceiling because the prompt is one input among seven, and the other six determine whether the model sees the right information at all. Context engineering is the discipline of getting those other six right.

The fix is not a model upgrade or a prompt library. It is a pipeline that handles information before the model sees it. At Tricky Wombat, this is what we build. Three requirements determine whether a context pipeline works or fails.

1. Does the pipeline clean and structure information before retrieval?

Most systems ingest documents and chunk them into fragments for vector search without validating the source material. Barnett's taxonomy showed the first failure point (FP1) is missing content: the information the model needs was never parsed correctly or was lost during ingestion.^[34] Gartner's finding that 63% of organizations lack the data management practices to support AI confirms this is the norm, not the exception.^[37]

We process source documents through extraction, validation, and structural analysis before they enter any retrieval index. Tables, headers, cross-references, and metadata are preserved as structured data, not flattened into text blobs. The 1-10-100 rule applies: $1 to validate a record at ingestion, $10 to clean it after indexing, $100 to remediate the wrong answer it produced downstream.^[31]

2. Does the pipeline control what enters the context window?

The Stanford, Chroma, and MECW research all demonstrate the same constraint: the context window is a scarce resource where noise actively degrades performance.^[19]^[11]^[4] Most systems retrieve a fixed number of chunks and concatenate them into the context window without filtering, ranking by relevance, or managing position effects. Anthropic's contextual retrieval experiment showed that adding reranking alone cut failures significantly, because it controlled what the model actually processed.^[3]

We filter, rank, and compress retrieved context before it reaches the model. Irrelevant results are excluded. Relevant results are ordered by specificity to the query. Context that exceeds the effective window threshold is summarized rather than truncated. The goal is signal density, not volume.

3. Does the pipeline verify outputs against source material?

Zhao's research showed a 12.6% failure rate even with perfect retrieval, because the model can still generate answers that do not match the retrieved content.^[9] Without verification, these errors propagate silently. Users cannot distinguish a grounded answer from a plausible fabrication.

We run citation verification against the source documents that entered the context window. Every claim in the model's output is traced to a specific passage. Claims that cannot be traced are flagged, not served. This is not a confidence score. It is a binary check: either the output is grounded in retrieved evidence, or it is not.

The pipeline does not run once at deployment. Source documents change, retrieval quality drifts, and user behavior evolves. Monitoring retrieval hit rates, context utilization, and citation verification rates on an ongoing basis is what separates a system that works on demo day from one that works at scale. Manus AI identified KV-cache hit rate as "the single most important metric" for production context engineering, and the system that tracks it gets cheaper and more accurate over time. The one that does not degrades silently.^[10]

The bottom line

Every failure case in this article shares the same structure. A capable model, an adequate prompt, and a missing or broken context pipeline. DPD's chatbot wrote poetry about its own incompetence because a system update broke constraints that were never monitored. Pak'n Save's recipe bot generated poison gas instructions because a single-instruction wrapper replaced an entire safety context layer. Microsoft capped Bing conversations at 5 exchanges because the model could not handle the context it accumulated. Morgan Stanley and Klarna succeeded with the same models that powered those failures, because they invested in retrieval, evaluation, integration, and ongoing monitoring.

The data points converge: 88% adoption but two-thirds stuck in pilots.^[1] 5% achieving rapid revenue acceleration.^[39] 42% abandoning the majority of initiatives.^[8] 70% of value-generating resources going to people and processes, not models.^[6] Workflow redesign as the single strongest factor in financial impact.^[1] The investment thesis is not complicated. The organizations that treat AI as a model problem will keep joining the 95% that fail to reach production. The organizations that treat it as an information infrastructure problem will keep compounding advantages that no amount of prompt optimization can replicate.

Context is not a feature. It is the product.

▶References (39)

↩McKinsey & Company, "The State of AI in 2025: Agents, Innovation, and Transformation," McKinsey.com, November 5, 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
↩Softcery, "AI Agent Prompt Engineering: Early Gains, Diminishing Returns, and Architectural Solutions," October 16, 2025. https://softcery.com/lab/the-ai-agent-prompt-engineering-trap-diminishing-returns-and-real-solutions
↩Anthropic, "Contextual Retrieval," September 2024. https://www.anthropic.com/news/contextual-retrieval
↩Nelson F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," Transactions of the Association for Computational Linguistics, Vol. 12, pp. 157-173, 2024. https://arxiv.org/abs/2307.03172
↩RAGAboutIt.com, "The Engineering Gap: Why 73% of Enterprise RAG Systems Fail Where They Matter Most," 2025. https://ragaboutit.com/the-engineering-gap-why-73-of-enterprise-rag-systems-fail-where-they-matter-most/
↩BCG, "Where's the Value in AI?," October 2024. https://www.bcg.com/press/24october2024-ai-adoption-in-2024-74-of-companies-struggle-to-achieve-and-scale-value
↩Lingrui Mei et al. (15 authors), "A Survey of Context Engineering for Large Language Models," arXiv:2507.13334, July 17, 2025. https://arxiv.org/abs/2507.13334
↩S&P Global Market Intelligence (451 Research), "Generative AI Shows Rapid Growth but Yields Mixed Results," 2025. https://www.ciodive.com/news/AI-project-fail-data-SPGlobal/742590/
↩Shengming Zhao et al., "Understanding the Design Decisions of Retrieval-Augmented Generation Systems," arXiv:2411.19463, November 2024. https://arxiv.org/abs/2411.19463
↩Yichao "Peak" Ji, "Context Engineering for AI Agents: Lessons from Building Manus," Manus.im, July 18, 2025. https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
↩Kelly Hong, Anton Troynikov, Jeff Huber (Chroma), "Context Rot," Chroma Research, July 14, 2025. https://research.trychroma.com/context-rot
↩Philipp Schmid, "The New Skill in AI is Not Prompting, It's Context Engineering," philschmid.de, June 30, 2025. https://www.philschmid.de/context-engineering
↩Andrej Karpathy, X post, June 25, 2025. https://x.com/karpathy/status/1937902205765607626
↩Tobi Lutke, X post, June 18, 2025. https://x.com/tobi/status/1935533422589399127
↩Avivah Litan, "Context Engineering: Why It's Replacing Prompt Engineering for Enterprise AI Success," Gartner.com, October 6, 2025. https://www.gartner.com/en/articles/context-engineering
↩Multiple sources: Indeed data via Hannah Calhoon (VP of AI, Indeed); LinkedIn analysis from arXiv:2506.00058. https://www.salesforceben.com/prompt-engineering-jobs-are-obsolete-in-2025-heres-why/ and https://arxiv.org/abs/2506.00058
↩Microsoft, "2025 Work Trend Index: The Year the Frontier Firm Is Born," 2025. https://www.microsoft.com/en-us/worklab/work-trend-index/2025-the-year-the-frontier-firm-is-born
↩Yufeng Du et al., arXiv:2510.05381 / EMNLP 2025; Adobe Research, NoLiMa benchmark, arXiv:2502.05167. https://arxiv.org/abs/2510.05381 and https://arxiv.org/abs/2502.05167
↩Norman Paulsen, "Context Is What You Need: The Maximum Effective Context Window for Real-World Limits of LLMs," Advances in AI and Machine Learning, Vol. 6(1), pp. 4853-4878. https://arxiv.org/abs/2509.21361
↩Stack Overflow, "2025 Developer Survey," December 2025. https://survey.stackoverflow.co/2025/ai
↩Fortune, "Prompt Engineering $200K Six-Figure Role Now Obsolete Thanks to AI," May 7, 2025. https://fortune.com/2025/05/07/prompt-engineering-200k-six-figure-role-now-obsolete-thanks-to-ai/
↩Andrew Ng, NeurIPS 2021 keynote; IEEE Spectrum, "Andrew Ng: Unbiggen AI," February 2022. https://spectrum.ieee.org/andrew-ng-data-centric-ai
↩CrowdFlower, "2016 Data Science Report," 2016. https://www.kdnuggets.com/2016/04/crowdflower-2016-data-science-repost.html
↩TIME, "AI Chatbot DPD Curses, Criticizes Company," January 2024; The Guardian, "DPD AI Chatbot Swears, Calls Itself Useless," January 20, 2024. https://time.com/6564726/ai-chatbot-dpd-curses-criticizes-company/
↩NZ Herald, "Pak'nSave AI Meal Bot Suggests Deadly and Toxic Spreads," August 2023; The Guardian, August 10, 2023. https://www.theguardian.com/world/2023/aug/10/pak-n-save-savey-meal-bot-ai-app-malfunction-recipes
↩TIME, "Bing OpenAI ChatGPT Danger Alignment," February 2023; Microsoft Bing Blog, "Increasing Limits on Chat Sessions," February 2023. https://time.com/6256529/bing-openai-chatgpt-danger-alignment/
↩Gartner Press Release, "Gartner Predicts 30 Percent of Generative AI Projects Will Be Abandoned After Proof of Concept by End of 2025," July 29, 2024. https://www.gartner.com/en/newsroom/press-releases/2024-07-29-gartner-predicts-30-percent-of-generative-ai-projects-will-be-abandoned-after-proof-of-concept-by-end-of-2025
↩Thomas C. Redman, "Bad Data Costs the U.S. $3 Trillion Per Year," Harvard Business Review, September 22, 2016. https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year
↩McKinsey Global Institute, "The Social Economy," July 2012. https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/the-social-economy
↩Gartner, Magic Quadrant for Data Quality Solutions, July 2020. Analysts: Melody Chien and Ankush Jain ($12.9M); Mei Yang Selvage (59% measurement gap). https://www.gartner.com/en/data-analytics/topics/data-quality
↩George Labovitz and Yu Sang Chang, 1-10-100 Rule of Data Quality, 1992; Matillion critical review, July 1, 2024. https://www.matillion.com/blog/the-1-10-100-rule-of-data-quality-a-critical-review-for-data-professionals
↩Informatica / Wakefield Research, "CDO Insights 2025," January 28, 2025. https://www.informatica.com/about-us/news/news-releases/2025/01/20250128-global-data-leaders-seek-to-harness-the-power-of-genai-for-ai-driven-success.html
↩RAND Corporation, "The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed," RR-A2680-1, 2024. https://www.rand.org/pubs/research_reports/RRA2680-1.html
↩Scott Barnett et al., "Seven Failure Points When Engineering a Retrieval Augmented Generation System," arXiv:2401.05856. https://arxiv.org/abs/2401.05856
↩OpenAI, "Morgan Stanley," Customer Stories, 2023; CNBC, September 18, 2023. https://openai.com/index/morgan-stanley/ and https://www.cnbc.com/2023/09/18/morgan-stanley-chatgpt-financial-advisors.html
↩Klarna Press Release, February 2024; Bloomberg, May 8, 2025. https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/ and https://www.bloomberg.com/news/articles/2025-05-08/klarna-turns-from-ai-to-real-person-customer-service
↩Gartner Press Release, "Lack of AI-Ready Data Puts AI Projects at Risk," February 26, 2025. Analyst: Roxane Edjlali. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk
↩Walden Yan, "Don't Build Multi-Agents," Cognition.ai, June 12, 2025. https://cognition.ai/blog/dont-build-multi-agents
↩MIT NANDA Initiative, "The GenAI Divide: State of AI in Business 2025," July 2025. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/

By Tricky Wombat

Last Updated: Mar 30, 2026

What is the ROI for AI customer service?

Your AI support bot isn't stupid

The Real AI Chatbot Implementation Timeline