What is context engineering and how is it different from prompt engineering?

Context engineering is the discipline of designing dynamic systems that provide the right information, tools, and constraints to an AI model at the right time.[^23] Prompt engineering optimizes a static text string. Context engineering architects the entire information environment: retrieval pipelines, system prompts, tool definitions, memory systems, and output constraints. Andrej Karpathy frames the distinction as the difference between writing a single instruction and engineering "the delicate art and science of filling the context window with just the right information for each step."[^21]

Why does context matter more than model selection for AI quality?

Empirical research demonstrates that context quality overwhelms model scale. Meta's Atlas, an 11-billion-parameter model with retrieval augmentation, outperformed Google's PaLM at 540 billion parameters on the Natural Questions benchmark.[^5] A 2025 PMC/NIH study showed that changing only the chunking strategy on an identical model produced accuracy ranging from the lowest with fixed-token chunking to 87% with adaptive chunking.[^6] The model is not the bottleneck. The information the model receives is.

What are the components of an LLM's context window?

Phil Schmid's taxonomy identifies seven layers: instructions (system prompt), user prompt, state and conversation history, long-term memory, retrieved knowledge (RAG), available tools (function definitions), and structured output constraints.[^23] Each layer consumes tokens from a finite context budget. Anthropic describes the attention mechanism as an n² pairwise relationship — every token interacts with every other token, making curation essential as context grows.[^27]

How do you prevent context rot in AI agents?

Context rot is the empirically documented degradation of model performance as context windows grow with poorly curated information.[^24] Chroma Research's study of 18 LLMs found "surprising and non-uniform" degradation on semantic retrieval tasks, even when Needle-in-a-Haystack benchmarks showed perfect scores.[^24] Prevention requires continuous evaluation of retrieval precision, regular audits of knowledge base currency, and aggressive pruning of irrelevant or stale context. Anthropic recommends maintaining lightweight references rather than loading full documents — a "just in time" context strategy.[^27]

What is the role of RAG in context engineering?

Retrieval-augmented generation is one component of context engineering, not the whole of it. RAGFlow's 2025 analysis frames RAG as "evolving from 'Retrieval-Augmented Generation' into a 'Context Engine.'"[^54] LlamaIndex acknowledges the overlap: "You may ask 'isn't this just RAG?' And you'd be correct to ask."[^55] Context engineering encompasses RAG but also includes system prompt design, tool selection, memory management, conversation history curation, and output constraints — every element that shapes what the model sees.

Is it better to use a small model with RAG or rely on a bigger context window?

The research favors well-engineered retrieval over raw context size. Dev.to reported experiments showing a 1.1-billion-parameter model with RAG outperforming GPT-4 Turbo's 128K context for fact-finding tasks.[^56] Databricks found 11 of 12 frontier models drop below 50% accuracy at 32K tokens on reasoning tasks.[^25] Bigger context windows do not solve the curation problem — they often make it worse by increasing the noise the model must filter.

How do you evaluate whether context engineering is actually working?

No shared standard exists yet. Faros AI notes that "agent evaluations are noisy, non-deterministic, and highly context-dependent. Today, there is no shared standard for evaluating context engineering itself."[^17] Practitioners use retrieval metrics (precision@K, recall@K, nDCG), output metrics (faithfulness scores via RAGAS framework, LLM-as-judge evaluation), and A/B experiments comparing pipeline changes. The key is to measure context quality independently from model output quality — you cannot improve what you attribute only to the model.

Who should own context engineering in an organization?

No existing framework definitively answers this question, but the evidence points toward a cross-functional discipline. Gartner recommends assembling multidisciplinary teams that combine data engineering, domain expertise, and product thinking for AI implementation.[^30] Ethan Mollick argues context engineering "cannot be a solely technical function" because the decisions about what information matters are fundamentally business process decisions.[^41] Anaconda's data shows that 45% of project time already goes to the data preparation that context engineering formalizes — the question is whether that time is being invested deliberately or wasted on ad hoc fixes.[^9]

Is "context engineering" just a rebranding of prompt engineering?

The pushback is real. Hacker News commenters have called it "rebranding prompt engineering in a desperate attempt to stroke our own egos" and argued that the underlying work "is a field of science called Information Retrieval" with decades of history.[^57] The distinction is scope: prompt engineering optimizes what you write in a text box; context engineering architects the retrieval systems, memory stores, tool configurations, and evaluation pipelines that dynamically construct that text box's contents. Simon Willison offers a pragmatic framing: "Context engineering is what we do instead of fine-tuning."[^58]

What does Gartner predict for the future of context engineering?

Gartner predicted in 2025 that by 2028, 80% of GenAI business applications will be developed on existing data management platforms, underscoring the critical role of data infrastructure and context delivery in AI success.[^12] Their June 2025 press release emphasized that RAG is becoming "a cornerstone for deploying GenAI applications" and that semantics and metadata play a crucial role in enhancing LLM accuracy with business data.[^59] Cognizant's enterprise strategy team frames the shift as moving "from data-first to context-first" organizational strategy, arguing that by 2030, the critical question will be "do we have the right context to make decisions?" rather than "do we have enough data?"[^60] ---

Context engineering

How the emerging practice of engineering what your AI sees explains why 95% of companies fail to generate value from their AI investments

The gap between AI spending and AI results is not a model problem. An 11-billion-parameter model with well-engineered context outperforms a 540-billion-parameter model without it.^[1] A single change to how documents are chunked before retrieval produces a 74-percentage-point accuracy swing on an identical model.^[2] The binding constraint on every deployed AI system is the quality, structure, and freshness of the information it sees before it generates a single token. The emerging discipline that addresses this constraint has a name: context engineering.

Key Points

Only 5–6% of organizations generate meaningful value from AI despite 88% adoption and 73% spending over $1M/year.^[3]^[4]^[4]

Lessons Learned

Invest in context pipeline quality before upgrading models. The research shows context quality overwhelms model scale by a factor of 50×.^[1]

Why are most AI deployments failing despite massive investment?

In June 2025, Shopify CEO Tobi Lütke posted a tweet that collected 1.9 million views and 8,600 bookmarks. It said: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."^[19] Within a week, former OpenAI researcher Andrej Karpathy endorsed the framing, adding that the LLM is the CPU and the context window is its RAM.^[17] Within a month, Anthropic, LangChain, and dozens of engineering teams had published implementation guides. A Manning textbook entered production.^[20]

The term gained traction so quickly because it named something practitioners already knew: the model is not the bottleneck. The information environment you construct around the model determines whether it produces reliable output or confident nonsense.

Context engineering is the discipline of designing and building dynamic systems that provide the right information and tools, in the right format, at the right time, to an AI model.^[21] It encompasses retrieval-augmented generation, system prompt design, tool selection, memory management, conversation history curation, and structured output constraints. Where prompt engineering treats the input as a static string to optimize, context engineering treats it as a system to architect.

Bar chart showing 88% AI adoption and 73% spending over $1M per year, contrasted with only 5% generating real value and 34% executive disappointment rate. — The AI adoption-value gap: massive spending, minimal returns

How do you measure whether an AI system has a context problem?

The symptoms are specific and quantifiable. Chroma Research evaluated 18 large language models, including GPT-4.1, Claude 4, and Gemini 2.5, and found that model performance degrades in "surprising and non-uniform" ways as context length increases, even on simple tasks.^[22] The standard industry benchmark for long-context performance, Needle-in-a-Haystack, gives false confidence because it tests only lexical retrieval, not the semantic reasoning that real applications require.^[22]

The degradation is measurable. Databricks Mosaic Research found that 11 out of 12 frontier models drop below 50% accuracy on reasoning tasks when context exceeds 32,000 tokens.^[23] Drew Breunig's taxonomy of context failures categorizes the mechanisms: context poisoning (bad information), context distraction (irrelevant information), context confusion (contradictory information), and context clash (incompatible formatting).^[24] Every failure mode traces to what the model was given, not what the model was capable of.

The most precise measurement comes from a 2025 clinical study published in Bioengineering through PMC/NIH. Researchers tested the same Gemini 1.0 Pro model with the same dataset and the same embedding model, changing only the document chunking strategy. Fixed-token chunking produced the lowest accuracy. Proposition chunking and semantic chunking produced intermediate results. Adaptive chunking produced the highest accuracy at 87%. Same model. Same data. Dramatic accuracy swing from a single pipeline decision.^[2]

Why is the context problem getting worse?

Three forces are compounding simultaneously. First, context windows are growing faster than organizations' ability to curate what goes into them. Gemini supports over 1 million tokens. GPT-4 Turbo handles 128,000. The intuition that "more context is better" is empirically wrong. Anthropic's engineering team describes the attention mechanism as an n² pairwise relationship: doubling the context doesn't double the useful information; it quadruples the noise the model must filter.^[25]

Second, agent architectures are multiplying the problem. Agentic RAG, declared "the new baseline" at NeurIPS 2025,^[26] means AI systems now autonomously decide what to retrieve, when to retrieve it, and how much to include. Every tool definition added to an agent's context window consumes tokens. StackOne documented a single MCP server configuration consuming 1.17 million tokens, a phenomenon they called "agent suicide by context."^[27]

Third, organizations are scaling AI from pilots to production without scaling their information infrastructure. Gartner found that 30–60% of generative AI projects are abandoned after proof of concept, most commonly because the underlying data cannot support production-quality retrieval.^[28] S&P Global's 2025 survey found 42% of companies scrapped most AI initiatives that year, up from 17% in 2024.^[6]

What do the people building these systems actually say?

The practitioners are blunt. In Databricks' community survey, the top two deployment challenges reported by RAG builders were "My retriever is returning irrelevant documents" and "I don't trust the quality of the content produced by the RAG app."^[29] Not model complaints. Retrieval complaints.

Qodo's 2025 State of AI Code Quality survey found that 44% of developers who said AI degraded their code quality blamed missing context as the primary cause. Among developers who manually selected context for the AI, 54% said the model still missed relevant information.^[30] Senior engineers reported the biggest quality gains from AI (60%) but the lowest confidence in shipping AI-generated code (22%).^[30] They know enough to see what the context pipeline is getting wrong.

The Anaconda 2020 State of Data Science survey of 2,360 practitioners found that data scientists spend 45% of their time on data preparation — loading, cleaning, and structuring information — and only 11% on model selection and building.^[8] The time allocation tells you where the bottleneck lives. The strategy conversations do not.

What happens when context engineering fails?

The consequences are not theoretical. They are public, measurable, and increasingly expensive. Every major AI failure in 2024–2025 traced to the same root cause: the model received bad, incomplete, or stale context and did exactly what language models do — generated a confident, fluent response from whatever it was given.

How did New York City's official chatbot advise business owners to break the law?

New York City launched an AI chatbot on its official NYC.gov portal to help small business owners navigate regulations. The Markup investigated and found the chatbot repeatedly gave illegal advice: telling landlords they could discriminate based on income source, telling business owners they did not need to accept cash, telling restaurant operators they could fire employees who complained about safety violations.^[13]

Every piece of advice violated existing New York City law. The chatbot's knowledge base contained incomplete regulatory documents, and the retrieval pipeline had no mechanism to verify legal accuracy or flag contradictions with current statutes.^[13] The city initially responded by adding a disclaimer: "Please make sure to verify with the appropriate agency."^[13] They treated it as a warning label problem. It was an infrastructure problem.

How did a Chevrolet dealership's chatbot agree to sell a $76,000 vehicle for one dollar?

A Chevrolet dealership in Watsonville, California deployed an AI chatbot powered by ChatGPT. A user asked the chatbot to agree to sell a 2024 Chevy Tahoe for one dollar. The chatbot responded: "That's a deal, and that's legally binding!"^[14]

The system prompt contained no pricing constraints, no scope limitations, and no escalation rules. The chatbot had been deployed with an open-ended instruction to "be helpful" and no guardrails on what commitments it could make.^[14] Within days, users had also gotten it to generate Python code and agree to transactions far outside the dealership's business.

What does the pattern across these failures reveal?

Google's AI Overviews launched in May 2024 and told users to add glue to pizza sauce (to help cheese stick) and that geologists recommend eating one small rock per day for vitamins and minerals.^[17] MIT Technology Review's analysis attributed these failures explicitly to the retrieval pipeline: AI Overviews pulled from Reddit joke posts, satirical content, and outdated academic papers because the retrieval system could not distinguish authoritative sources from noise.^[31]

Comparison table showing four major AI failures mapped to their context pipeline root causes, with each case confirming the model itself was not deficient. — Four AI failures, one root cause: every failure was a context pipeline problem, not a model capability problem

Phil Schmid, who authored the most-referenced definition of context engineering, puts it flatly: "Most agent failures are not model failures anymore, they are context failures."^[21] Anthropic's engineering blog confirms the same pattern.^[25] LinkedIn's Dhyey Mavani argues in VentureBeat that "the limiting factor is no longer the model — it's context."^[32]

How does the context problem compound at enterprise scale?

At an individual query level, a context failure produces a wrong answer. At enterprise scale, it produces a systemic erosion of trust, productivity, and decision quality that compounds with every interaction.

What do enterprise users actually experience?

The complaints are consistent across channels. On Reddit's r/LocalLLaMA, the recurring threads are not about model capability. They ask: "How do I make a local LLM retain memory of information?" and "RAG is giving wrong answers — how do you debug retrieval?"^[33] On GitHub, developers report that RAG systems "return confident wrong answers for out-of-scope questions" and ask about probability thresholds for abstention.^[34] On Hacker News, the concern is shifting to context budget allocation: "How much of my context window is being eaten by MCP server tool definitions vs. actual useful context?"^[35]

Enterprise leaders express the same frustration in different vocabulary. CIO.com quotes AI executives: "Prompt engineering cannot deliver the accuracy, memory, or governance required in complex environments on its own."^[36] On LinkedIn, Ethan Mollick argues that context engineering "cannot be a solely technical function" — it is business process design in disguise.^[37]

How frequently do context failures occur in production systems?

The error rates are higher than most organizations realize. RAND Corporation studied 65 AI projects and found an 80%+ failure rate — roughly double the failure rate of conventional IT projects.^[38] Capital One and Forrester surveyed 500 data leaders and found that 73% named data quality as the number-one barrier to AI success, ranking it above model accuracy, integration complexity, and talent gaps.^[39]

A 1% hallucination rate sounds manageable in isolation. Multiply it by 1,000 employees asking 10 questions per day, and that is 100 fabricated answers flowing into your organization daily — into customer communications, internal decisions, regulatory filings, and institutional memory. Each wrong answer that goes uncorrected becomes training data for the next wrong answer.

What happens if organizations don't fix their context pipelines?

The damage does not plateau. It compounds through a feedback loop that degrades organizational knowledge over time.

What does the data say about AI project abandonment?

The abandonment pattern is accelerating. Gartner surveyed 822 business leaders and found 30% of generative AI projects will be abandoned after proof of concept by end of 2025, citing poor data quality as the primary cause.^[28] A separate Gartner survey of 248 organizations found 63% have not progressed past piloting.^[28] S&P Global's survey of over 1,000 companies found 42% scrapped most AI initiatives in 2025, up from 17% the prior year.^[6]

BCG surveyed 1,250 C-suite executives and found that only 25% are actually realizing significant value from AI. The gap is not model access. Every organization can access GPT-4, Claude, and Gemini. The gap is the infrastructure that gives those models useful context.^[4]

Comparison showing data preparation consumes 45% of AI project time while model selection takes only 11%, contrasted with strategy discussions that disproportionately focus on model selection. — Where AI project time actually goes versus where strategy conversations focus

How does the context problem create a negative feedback loop?

The compounding mechanism works like this. A poorly curated knowledge base produces inaccurate AI responses. Employees lose trust and stop using the system, or worse, they use it without verifying outputs. Unverified outputs enter documents, emails, and databases. Those documents become the next generation of retrieval sources. The knowledge base degrades further. The AI's answers get worse. The cycle accelerates.

HBR research on "hidden data factories" estimates that knowledge workers spend up to 50% of their time finding, correcting, and working around bad data.^[40] When an AI system ingests and amplifies that bad data, it does not just reflect the existing problem. It automates it. What was a human-speed data quality problem becomes a machine-speed data quality problem.

Why isn't a better model the answer?

The conventional response to AI underperformance is to upgrade the model. Larger parameter counts, newer architectures, bigger context windows. The assumption is intuitive: a smarter model will produce smarter outputs. The evidence says otherwise.

In 2023, researchers at Meta published Atlas, an 11-billion-parameter retrieval-augmented model. On the Natural Questions benchmark, Atlas outperformed PaLM, Google's 540-billion-parameter model, by 3 full points — despite being 50 times smaller.^[1] In 2022, DeepMind's RETRO, a 7.5-billion-parameter model with retrieval, achieved comparable performance to GPT-3 and Jurassic-1 at 175 billion parameters on the Pile benchmark.^[7] The retrieval infrastructure — the context pipeline — overwhelmed a 50× difference in model scale.

The model is not the product. The context pipeline is the product.

The chunking study makes this concrete. Researchers tested four document chunking strategies on the same model, same data, same embedding model. The results ranged from the lowest accuracy with fixed-token chunking to 87% with adaptive chunking, determined entirely by how the documents were split before the model ever saw them.^[2] NVIDIA's technical blog corroborates this finding: chunking strategy alone can account for significant accuracy differences.^[41]

Bar chart comparing Atlas at 11 billion parameters outperforming PaLM at 540 billion parameters, and RETRO at 7.5 billion parameters achieving comparable performance to GPT-3 and Jurassic-1 at 175 billion parameters, demonstrating that retrieval augmentation overwhelms model scale. — Context quality overwhelms model scale by a factor of 50×

Horizontal bar chart showing accuracy ranging from lowest to 87% across four chunking strategies on the same Gemini 1.0 Pro model, demonstrating a dramatic accuracy swing from a single context engineering decision. — Same model, same data, same embedding — only the chunking strategy changed

This is why Karpathy frames the LLM as a CPU and the context window as RAM.^[17] You do not fix a slow computer by replacing the processor when the RAM is full of garbage. You fix what the processor is working with.

What does a successful context engineering implementation look like?

Klarna deployed an AI customer service assistant built on the same underlying technology as Air Canada's chatbot — a large language model answering customer questions. The outcomes could not have been more different. Within its first month, Klarna's system handled 2.3 million conversations, performed the equivalent work of 700 full-time agents, and reduced average resolution time from 11 minutes to 2 minutes.^[42]

The difference was not the model. The difference was the context pipeline. Klarna built whitelisted knowledge bases that restricted the model's retrieval to verified, current documents. They implemented query safety layers that detected out-of-scope questions before the model could hallucinate a response. They defined scope boundaries that prevented the system from making commitments outside its authority.^[42]^[43]

The Klarna story has an instructive second chapter. In mid-2025, the company began rehiring human agents after initially projecting $40 million in savings from full automation.^[44] The reason was not model failure. The human escalation layer is itself a context engineering component — the system needs to know when it lacks sufficient context to respond and route to a human who has it. Removing that layer degraded output quality. The reversal proves the thesis: every layer of the context pipeline matters, including the decision logic for when the AI should not answer at all.

Five Sigma, an insurance technology company, deployed AI for claims processing using SOP-aware configuration that mapped retrieval to specific standard operating procedures. They built multi-agent architectures with explicit scope limits and implemented hallucination evaluators that tested outputs against source documents before delivery. HFS Research independently validated the results: 60% reduction in cycle time, 35% reduction in claims processing costs.^[45]^[46]

Andrew Ng's canonical demonstration makes the mechanism inescapable. Working with a steel manufacturing company, Ng's team applied a data-centric approach — improving the quality and labeling of training data without changing the model. In roughly two weeks, the data-centric approach produced a 16.9-percentage-point improvement in defect detection accuracy. The model-centric approach, run in parallel for months, produced zero improvement.^[47] Same model. Different context. Dramatically different results.

What does context engineering failure actually cost?

The financial impact operates at three levels, and organizations typically see only the first.

The direct costs are visible: Air Canada's tribunal payment, Google's $100 billion single-day market cap loss after Bard's launch hallucination went viral,^[48] the engineering hours spent debugging retrieval pipelines. These make headlines.

The systemic costs are larger but less visible. Gartner surveyed 154 organizations and found that poor data quality costs an average of $12.9 million per year per enterprise.^[9] Fivetran's analysis puts the figure at $406 million annually for large organizations.^[10] These costs predate AI, but AI amplifies them. Every document that contains an error, every knowledge base article that is outdated, every policy that conflicts with another policy now generates wrong answers at machine speed instead of human speed.

The compounding costs are the largest and the least visible. The Data Warehousing Institute's 1-10-100 rule — a dollar to prevent bad data, ten to correct it, a hundred when it causes a failure — was updated by Matillion in 2024 to 10-100-1,000 for AI-amplified systems.^[17]^[18] When an AI system ingests bad context, generates wrong outputs, and those outputs enter downstream systems, the correction cost multiplies at every step.

Stepped cost diagram showing how prevention, correction, and failure costs escalate from the traditional 1-10-100 rule to a 10-100-1,000 ratio in AI-amplified systems, annotated with real-world cost examples from Air Canada to Google. — The cost multiplier of context failure accelerates with AI amplification

HBR research by Thomas Redman estimates that fixing data quality at the source eliminates 66–90% of the hidden factory costs that knowledge workers absorb daily.^[40] The economics are straightforward: every dollar invested in context quality at the source prevents ten to a hundred dollars in downstream failure.

How do you fix context engineering?

The solution is not a single tool or technique. It is a pipeline — a system of interconnected decisions about what information reaches the model, in what form, at what time, and with what constraints. Phil Schmid's taxonomy identifies seven layers: instructions, user prompt, state and conversation history, long-term memory, retrieved information, available tools, and structured output constraints.^[21] Harrison Chase of LangChain distills it to five essential properties: the context must be written with the right information, the right format, at the right time, and designed to include everything the model needs while excluding everything it doesn't.^[15]

The organizations getting this right share a common architecture. They do not treat the AI model as the product. They treat the context pipeline as the product, and the model as a component within it.

At Tricky Wombat, our AI systems are built around this principle. The pipeline, not the model, determines output quality. Here is what that looks like in practice.

1. Source quality and retrieval precision

Most systems ingest documents in bulk and rely on generic chunking and embedding to make them retrievable. This is how Air Canada ended up serving stale bereavement policies and New York City ended up recommending illegal business practices. The retrieval pipeline returns whatever is most similar to the query — not whatever is most accurate, current, or authoritative.

Our pipeline treats source quality as a first-class engineering concern. Documents are assessed for currency, authority, and internal consistency before they enter the knowledge base. Chunking strategies are selected based on document type and query patterns, not applied uniformly. Retrieval is tested against known-good question-answer pairs to measure precision and recall before deployment, not after a user encounters a wrong answer.

2. Scope constraints and abstention logic

Most systems are configured to always generate an answer. This is how Chevrolet's chatbot agreed to sell a Tahoe for a dollar — it had no mechanism to recognize that a pricing commitment was outside its authority. The default behavior of a language model is to be helpful. Without explicit scope constraints, "helpful" includes confidently wrong.

Our pipeline includes explicit scope boundaries that define what the system is authorized to answer and, critically, what it is not. When the retrieval pipeline returns low-confidence results or the query falls outside defined boundaries, the system abstains or escalates rather than generating a plausible-sounding fabrication. The abstention threshold is a tunable parameter, not an afterthought.

3. Continuous evaluation and context monitoring

Most systems are evaluated at deployment and then left to drift. Knowledge bases go stale. New documents introduce contradictions. Retrieval quality degrades as the corpus grows. This is the "context rot" that Chroma Research documented across 18 models: performance degrades in non-uniform and unpredictable ways as context accumulates.^[22]

Our pipeline runs continuous evaluation against defined quality metrics — retrieval precision, faithfulness scoring, and citation verification. When a source document is updated, the pipeline re-processes affected chunks and re-validates downstream answers. When evaluation metrics degrade, the system flags the specific context layer responsible rather than treating output quality as a black box attributable only to "the AI." The system does not just answer questions. It monitors whether its own answers are getting better or worse over time.

The bottom line

Air Canada, New York City, Chevrolet, and Google all deployed capable AI models. None of them failed because the model was not smart enough. They failed because the information environment around the model was stale, incomplete, unscoped, or unverified. Klarna, Five Sigma, and Andrew Ng's steel manufacturing case deployed comparable or smaller models and succeeded because they engineered the context pipeline first.

The emerging discipline of context engineering gives this pattern a name and a practice. Gartner predicts that by 2028, 80% of GenAI business applications will be developed on existing data management platforms, underscoring the centrality of data infrastructure to AI success.^[11] HBR frames organizational context as the competitive advantage when every company can access the same models.^[49] The question is no longer "which AI model should we use?" It is "do we have the engineering discipline to give any model the right information at the right time?"

The organizations that answer yes will extract value from AI. The organizations that keep upgrading models while ignoring what those models see will keep joining the 95%.

▶References (49)

↩Izacard, G. et al., "Atlas: Few-shot Learning with Retrieval Augmented Language Models," JMLR, 2023. https://jmlr.org/papers/v24/23-0037.html
↩Abdelghany, M. et al., "Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support," Bioengineering, 12(11), 1194, November 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/
↩McKinsey & Company, "The State of AI: Agents, Innovation, and Transformation," McKinsey Global Survey (n=1,993), 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
↩BCG, "The Widening AI Value Gap: Build for the Future 2025," BCG Global Survey (n=1,250 senior executives and AI decision makers), September 2025. https://www.bcg.com/publications/2025/are-you-generating-value-from-ai-the-widening-gap
↩Writer and Workplace Intelligence, "Generative AI Adoption in the Enterprise," Survey Report (n=1,600), March 2025. https://writer.com/blog/enterprise-ai-adoption-survey-press-release/
↩S&P Global Market Intelligence, "AI Experiences Rapid Adoption, but with Mixed Outcomes — Highlights from VotE: AI & Machine Learning, Use Cases 2025" (n=1,006), 2025. https://www.spglobal.com/market-intelligence/en/news-insights/research/ai-experiences-rapid-adoption-but-with-mixed-outcomes-highlights-from-vote-ai-machine-learning
↩Borgeaud, S. et al., "Improving Language Models by Retrieving from Trillions of Tokens," DeepMind / ICML, 2022. https://arxiv.org/abs/2112.04426
↩Anaconda, "State of Data Science 2020," Survey Report (n=2,360), 2020. https://www.anaconda.com/state-of-data-science-2020
↩Gartner, "Magic Quadrant for Data Quality Solutions" (survey of n=154 customers of data quality vendors), 2020. https://www.gartner.com/en/documents/3991699
↩Fivetran, "The State of Data Quality," 2024. https://www.fivetran.com/reports/data-quality
↩Gartner, "Gartner Predicts by 2028, 80% of GenAI Business Apps Will Be Developed on Existing Data Management Platforms," Press Release, June 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-02-gartner-predicts-by-2028-80-percent-of-genai-business-apps-will-be-developed-on-existing-data-management-platforms
↩CBC News, "Air Canada ordered to pay customer who was misled by airline's chatbot," February 2024; CanLII, Moffatt v. Air Canada, 2024 BCCRT 149. https://www.cbc.ca/news/business/air-canada-chatbot-1.7116491
↩The Markup, "NYC's AI Chatbot Tells Businesses to Break the Law," March 2024. https://themarkup.org/news/2024/03/29/nycs-ai-chatbot-tells-businesses-to-break-the-law
↩Futurism / Gizmodo, "Chevrolet Dealership Chatbot Agrees to Sell Tahoe for $1," December 2023. https://futurism.com/chevrolet-dealer-chatbot-sell-car-1-dollar
↩Chase, H., "Context Engineering for Agents," LangChain, 2025. https://blog.langchain.dev/context-engineering/
↩Faros AI, "Evaluating Context Engineering for AI Agents," 2025. https://www.faros.ai/blog
↩
↩Matillion, "The 10-100-1,000 Rule: How AI Amplifies the Cost of Bad Data," 2024. https://www.matillion.com/blog
↩Lütke, T. (@tobi), Twitter/X, June 18, 2025. https://twitter.com/tobi/status/1835677375892004864
↩García, B., Context Engineering (Manning Publications, expected 2026). https://www.manning.com/
↩Schmid, P., "The New Skill in AI is Not Prompting, It's Context Engineering," June 30, 2025. https://www.philschmid.de/context-engineering
↩Hong, K., Troynikov, A., and Huber, J., "Context Rot: How Increasing Input Tokens Impacts LLM Performance," Chroma Research, July 14, 2025. https://research.trychroma.com/context-rot
↩Databricks Mosaic Research, "Long Context Performance Evaluation," 2025. https://www.databricks.com/research
↩Breunig, D., "How Long Contexts Fail," June 2025. https://dbreunig.com/
↩Anthropic, "Effective Context Engineering for AI Agents," Anthropic Engineering Blog, September 29, 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
↩Lopatina, N., "Agentic RAG is the New Baseline," NeurIPS 2025, Contextual AI. https://neurips.cc/
↩StackOne, "MCP Context Window Consumption Analysis," 2026. https://www.stackone.com/
↩Gartner, "Generative AI Project Abandonment and Data Readiness Survey" (n=822 and n=248), 2025. https://www.gartner.com/en/newsroom
↩Databricks Community, "Top 5 RAG Deployment Challenges," Blog Post #67078, 2025. https://community.databricks.com/
↩Qodo, "State of AI Code Quality in 2025," Survey Report, 2025. https://www.qodo.ai/reports
↩MIT Technology Review, "Google AI Overviews Failure Analysis," 2024. https://www.technologyreview.com/
↩Mavani, D., quoted in VentureBeat, "The Limiting Factor is No Longer the Model — It's Context," 2025. https://venturebeat.com/
↩Reddit r/LocalLLaMA, recurring threads on RAG stack sharing and retrieval debugging, 2024–2025. https://www.reddit.com/r/LocalLLaMA/
↩GitHub Issues, Haystack Framework, Issue #658, "RAG returns confident wrong answers for out-of-scope questions." https://github.com/deepset-ai/haystack/issues
↩Hacker News, Item #45418251, "Context budget allocation for MCP tool definitions," 2026. https://news.ycombinator.com/item?id=45418251
↩CIO.com, "Moving from AI Pilots to Production-Scale Deployments," Article #4080592, 2025. https://www.cio.com/article/4080592
↩Mollick, E., LinkedIn discussion, Activity #7343723250171977729, 2025. https://www.linkedin.com/
↩RAND Corporation, "Identifying and Addressing AI Project Failure Patterns" (n=65 qualitative interviews), 2024. https://www.rand.org/
↩Capital One and Forrester, "Data Quality as the #1 Barrier to AI Success" (n=500 data leaders), 2025. https://www.forrester.com/
↩Redman, T., "Hidden Data Factories and the Cost of Poor Data Quality," Harvard Business Review, 2023. https://hbr.org/
↩NVIDIA Technical Blog, "Chunking Strategy Impact on RAG Accuracy," 2025. https://developer.nvidia.com/blog
↩Klarna, "Klarna AI Assistant Handles 2.3 Million Conversations in First Month," Press Release, February 2024. https://www.klarna.com/international/press/
↩LangChain, "Klarna AI Implementation Case Study," 2024. https://blog.langchain.dev/
↩Bloomberg, "Klarna Rehires Human Agents After AI Push," May 2025. https://www.bloomberg.com/
↩Google Cloud and Five Sigma, "AI for Claims Processing," Whitepaper, 2024. https://cloud.google.com/
↩HFS Research, "Insurance AI Claims Processing Benchmarks," 2024. https://www.hfsresearch.com/
↩Ng, A., "A Data-Centric Approach to AI," ScaleUp:AI Conference, 2022; IEEE Spectrum coverage. https://spectrum.ieee.org/andrew-ng-data-centric-ai
↩Reuters / CNBC, "Google Loses $100 Billion in Market Cap After Bard Demo Hallucination," February 2023. https://www.reuters.com/
↩Murty, R.N. and Kumar, R.S., "When Every Company Can Use the Same AI Models, Context Becomes a Competitive Advantage," Harvard Business Review, February 18, 2026. https://hbr.org/2026/02/when-every-company-can-use-the-same-ai-models-context-becomes-a-competitive-advantage

By Tricky Wombat

Last Updated: Mar 30, 2026

Solving the Enterprise AI Build-Buy Dilemma

RAG beats fine-tuning for most enterprise use cases

Harness engineering is the missing discipline in enterprise AI adoption