AI Engineering Disciplines
Prompt engineering dominated the AI conversation from 2022 to 2024. It turned out to be one layer in a stack of seven. Between mid-2024 and early 2026, six additional engineering disciplines emerged around AI inference quality, each addressing a different failure mode: context engineering, intent engineering, specification engineering, evaluation engineering, retrieval engineering, and memory engineering. Most teams are still tuning prompts while ignoring the layers that determine whether the model sees the right information, understands what the user actually needs, and can verify its own output. Studies have found that 72% of enterprise search queries fail to return meaningful results on the first attempt. The fix is rarely a better prompt. It is better infrastructure across the full stack.
Additional documents available for download
Andrej Karpathy posted a short endorsement on X in June 2025 that captured a shift years in the making. He described context engineering as "the delicate art and science of filling the context window with just the right information for the next step," and contrasted it with prompt engineering, which he said people associate with "short task descriptions you'd give an LLM in your[1] Within two weeks, the term had been amplified by Shopify's CEO, formalized by Anthropic, and picked up by Gartner. The speed of adoption was unusual for a technical term. The reason: practitioners had been doing context engineering for over a year without a name for it.
The broader pattern matters more than the naming convention. Between mid-2024 and early 2026, at least seven distinct engineering disciplines emerged around AI inference quality, each addressing a different failure mode. Prompt engineering, which dominated the conversation from 2022 to 2024, turned out to be one layer in a much deeper stack. The other layers, including context engineering, intent engineering, specification engineering, evaluation engineering, retrieval engineering, and memory engineering, collectively determine whether an AI system returns a useful answer or a plausible-sounding wrong one.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
For enterprise search and RAG (retrieval-augmented generation, the pattern of feeding retrieved documents to a language model before it generates an answer), this matters directly. Coveo found that 72% of enterprise search queries fail to return meaningful results on the first attempt.[2] Most teams respond by tuning the model or rewriting the prompt. The research says the problem is more often in how context is assembled, how intent is classified, how specifications are enforced, and how outputs are evaluated. Fixing the prompt while ignoring the rest of the stack is like adjusting the seasoning on a dish made with spoiled ingredients.
Prompt engineering was the starting point, not the destination
The term "prompt engineering" dates to 2020 and GPT-3's release, though Sander Schulhoff published the first dedicated guide in October 2022, two months before ChatGPT.[3] By 2023, "prompt" was runner-up for Oxford's Word of the Year and job postings for prompt engineers commanded six-figure salaries.
The discipline evolved through identifiable phases. Few-shot prompting dominated 2020 to 2022. Google Brain introduced chain-of-thought prompting in late 2022. Tree-of-thought and ReAct paradigms followed in 2023. By 2024, prompts were treated as code: versioned, A/B tested, and evaluated systematically.[4]
Current best practices are well-documented. Define the goal explicitly. Specify output format. Set constraints. Provide context data. Establish evaluation criteria. Include fallback instructions.[5] Techniques like system/user prompt separation, structured formatting with XML or markdown, role assignment, and prompt chaining are production defaults in most serious deployments.[6]
For RAG systems specifically, prompts handle three functions: grounding enforcement (instructing the model to answer only from provided context), hallucination control (teaching it to say "I don't know"), and citation requirements (specifying how sources should be referenced).[7] IBM's assessment is direct: RAG systems need precise prompt engineering to locate the right data and make sure the LLM knows what to do with it.[8]
The limitations became apparent in production. Prompt brittleness means semantically equivalent prompts produce significantly different results. Adding a single word can shift output quality in unpredictable ways. Prompts optimized for one model version break on the next. Stanford's HELM benchmark showed no single prompting strategy consistently outperforms across models.[9]
The most important limitation is architectural. A perfectly worded prompt cannot compensate for missing context, absent tools, or incomplete data. If the retrieval system pulls the wrong documents, the prompt cannot fix that. If the user's intent is misunderstood, a better prompt template will not help. This is the ceiling that created demand for the rest of the stack.
Context engineering is the discipline that subsumes prompting
Harrison Chase, CEO of LangChain, offered the clearest framing of the relationship: prompt engineering is a subset of context engineering.[10] Where prompt engineering focuses on the instruction text sent to a model, context engineering governs everything the model sees, from system prompts to retrieved documents, tool definitions, conversation history, and structured metadata.
The term crystallized through a rapid cascade. Walden Yan of Cognition AI (the team behind Devin) is widely credited with sparking the shift in a mid-June 2025 blog post titled "Don't Build Multi-Agents." He described context engineering as "doing this automatically in a dynamic system" and called it "the #1 job of engineers building AI agents."[11] Shopify CEO Tobi Lutke posted his definition on June 18, 2025: "The art of providing all the context for the task to be plausibly solvable by the LLM." That post received 2.3 million views.[12]
Anthropic formalized the concept in a September 2025 engineering blog post, defining it as "the set of strategies for curating and maintaining the optimal set of tokens during LLM inference."[13] Gartner followed in October 2025, recommending that organizations appoint a context engineering lead or team.[14]
Karpathy's mental model is the most instructive one for understanding the relationship. Think of LLMs as CPUs and context windows as RAM. Context engineering is the operating system that loads exactly the right code and data into working memory for each task.[1]
The discipline encompasses ten components: system instructions, user input, conversation history, long-term memory, retrieved knowledge (RAG), tool definitions, tool results, structured output schemas, global state or scratchpads, and metadata.[15] Four strategies govern these components. Write: persist information externally for later retrieval. Select: retrieve relevant context through RAG and search. Compress: retain only essential tokens through summarization. Isolate: use sub-agents with focused context to avoid overloading a single window.[13]

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
One finding from Anthropic's research deserves special attention: context rot. As token count increases, model accuracy in recalling information decreases across all models, following the n-squared attention complexity curve. Dex Horthy's "12 Factor Agents" framework identified a threshold where performance degrades when more than 40% of the context window is consumed.[13] The implication is counterintuitive. Context engineering is not primarily about getting more information into the window. It is about ruthlessly curating the smallest possible set of high-signal tokens.
For enterprise search, this reframes the problem. The default approach is to retrieve as many potentially relevant documents as possible and stuff them into the context. The research says this degrades performance. Better to retrieve fewer, more precisely matched documents and compress them to their essential claims.
LlamaIndex's guide on context engineering for agents identifies the shift happening in RAG architectures. Traditional RAG treated retrieval as a fixed pipeline: query comes in, documents come out, model generates. Modern context engineering treats retrieval as one of many dynamic context sources, alongside tool calls, memory lookups, user preferences, and structured knowledge graphs.[16] Theory Ventures coined the term "Context Platform" to describe this evolution from isolated retrieval to full context assembly infrastructure.[17]
Intent engineering addresses the gap between what users ask and what they need
A user types "Q3 revenue" into an enterprise search tool. The system retrieves the Q3 revenue figure. The user actually wanted a comparison with the previous quarter's forecast. This gap between the literal query and the underlying goal is the problem intent engineering addresses.
The discipline operates at two levels, both relevant to enterprise AI. In its query-understanding sense, rooted in information retrieval since Andrei Broder's 2002 taxonomy, intent engineering classifies queries into categories (navigational, informational, transactional) and routes them to appropriate retrieval strategies. Profound's study of over 50 million ChatGPT prompts revealed how dramatically these patterns are shifting. Generative intent now accounts for 37.5% of queries. Navigational intent collapsed from 32% to 2%. Transactional intent jumped 9x.[18] Traditional search intent models built for Google-style queries are inadequate for AI-powered systems.
In its agent-architecture sense, formalized in late 2025 and early 2026, intent engineering makes organizational purpose (goals, values, tradeoffs, decision boundaries) machine-readable. Pawel Huryn published the first detailed framework in January 2026 in Product Compass, with seven components: Objective, Desired Outcomes, Health Metrics, Strategic Context, Constraints, Decision Types/Autonomy, and Stop Rules.[19]
The progression is crisp: prompt engineering told AI what to do. Context engineering tells AI what to know. Intent engineering tells AI what to want.[20]
The Klarna case illustrates why the distinction matters. Klarna's AI assistant handled 2.3 million conversations in its first month, cut resolution times from 11 to 2 minutes, and reportedly did the work of 853 full-time employees. The CEO later acknowledged the strategy had backfired and began re-hiring humans.[20] The AI optimized for speed because speed was measurable. Customer trust and retention were not encoded in its objectives. The system had excellent prompts and adequate context. It lacked intent specification.
For enterprise search, a January 2026 analysis by Cary Huang details an "Intent-First Architecture" for RAG: add a lightweight LLM as intent classifier before retrieval to determine primary intent category, sub-intents, which backend sources to query, and confidence level.[21] The approach addresses a structural problem: standard RAG treats every query the same way, running identical retrieval logic regardless of whether the user wants a specific fact, a comparative analysis, or a creative synthesis.
A maturity caveat is warranted. Intent engineering is the least mature of these disciplines. No established benchmarks exist for measuring its effectiveness. Tericsoft, one of its proponents, explicitly describes performance claims as "forward-looking projections rather than confirmed results."[22] The term carries two distinct meanings that different authors conflate. But the underlying problem, the gap between what users ask and what they need, is universally recognized as critical to enterprise search quality.
Specification engineering defines "good" before the model generates
Specification engineering is the discipline of defining what AI systems should produce and how to verify they produced it correctly. A landmark November 2024 paper by Ion Stoica, Matei Zaharia, and nine co-authors from UC Berkeley and Stanford formalized the concept.[23] The paper draws a distinction between statement specifications (what a task should accomplish) and solution specifications (how to verify outputs are correct), and identifies five properties specifications enable: verifiability, debuggability, modularity, reusability, and automatic decision-making.
The tooling ecosystem followed. GitHub released Spec Kit in September 2024, establishing a four-phase workflow: Specify, Plan, Tasks, Implement. AWS launched Kiro in mid-2025, an agentic IDE whose first interaction asks users whether they want to start with specs or prompts.[24] Guardrails AI provides an open-source framework using RAIL specifications to define structure, type, and quality guarantees for LLM outputs, with over 60 validators that automatically trigger corrective re-generation when specs are not met.[25]
Birgitta Bockeler of Thoughtworks identified three maturity levels: spec-first (write specifications before any code), spec-anchored (specs guide but don't fully constrain), and spec-as-source (specifications become the canonical truth, replacing code). She also noted that the term is not well-defined yet and is already semantically diffused.[26]

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
Specification engineering spans both pre-generation and post-generation in a pipeline. Pre-generation, it defines output formats, constraints, success criteria, and behavioral boundaries. Post-generation, it validates outputs against those definitions, enforces structured formats, runs guardrails, and triggers corrective loops when criteria are not met. This dual positioning makes it the bridge between intent (what you want) and evaluation (whether you got it).
For enterprise search and RAG, specifications drive output format requirements, source attribution rules, accuracy thresholds, completeness criteria, compliance constraints like HIPAA or GDPR, freshness requirements, and quality gates. Without specifications, there is no principled way to determine whether a given answer is good enough.
Evaluation engineering is the connective tissue
Evaluation engineering is the discipline of designing, implementing, and managing evaluation processes for generative AI systems. Galileo AI, which formalized the term, defines the goal: measure whether your LLMs, RAG pipelines, and fine-tuned models actually work, and keep working as they scale.[27]
The distinction from traditional software testing is fundamental. Traditional testing checks if code does what it was told to do. Evaluation engineering asks whether the AI output is good enough, where "good enough" means accurate, safe, relevant, non-toxic, and on-brand.[27] There is no binary pass/fail. Every output exists on a spectrum, and the evaluation system must define where on that spectrum is acceptable for each use case.
The field matured rapidly. LLM-as-a-Judge, the practice of using one language model to evaluate another's output, was first tested after GPT-4's release in March 2023. The RAGAS framework for RAG evaluation launched in September 2023.[28] By 2025, evaluation-driven development (EDD) became a recognized methodology, the AI equivalent of test-driven development, with dedicated courses, tools, and emerging job titles.[29]
RAGAS introduced something that should have existed from the beginning: component-level evaluation for RAG. The framework separately assesses retrieval quality (context precision, context recall, noise sensitivity) and generation quality (faithfulness, answer relevancy, factual correctness, groundedness).[28] This decomposition matters because a RAG system can fail at retrieval, at generation, or at both, and treating these as a single pipeline obscures the root cause. If retrieval is pulling the wrong documents, no amount of generation tuning will fix the output. If retrieval is good but the model hallucinates, the generation layer needs attention.
LLM-as-a-Judge has become the dominant automated evaluation method. Sophisticated judge models achieve 80-90% agreement with human evaluators, which exceeds the typical human-to-human agreement rate of 81%.[30] Cost savings over human review run from 500x to 5,000x at $0.01 to $0.10 per assessment.[31]

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
Evaluation engineering is unique in spanning all three pipeline positions. Pre-deployment, it runs benchmark tests and CI/CD gates. At runtime, it scores outputs in real time with LLM-as-a-Judge and safety classifiers. Post-processing, it monitors production quality, detects drift, and collects user feedback. This ubiquity makes it the connective tissue of the entire AI engineering stack. Every other discipline, including context, retrieval, intent, and specification, requires its own evaluation methodology to improve systematically.
The feedback loop is the mechanism that ties everything together. Anaconda demonstrated that feeding evaluation results directly into LLM prompts for self-modification achieved 100% success on specific benchmark tasks.[29] Databricks formalized evaluation-driven development workflows using custom LLM judges within their Mosaic AI Agent Framework.[32]
Retrieval and memory engineering complete the stack for enterprise search
Beyond the five core disciplines, two additional engineering practices are particularly relevant to enterprise search.
Retrieval engineering governs the mechanics of document ingestion, chunking strategies, embedding model selection, hybrid search, and reranking. RAG adoption has reached 51% among enterprises according to Menlo Ventures' 2024 AI survey.[33] The market is projected at $9.86 billion by 2030.
The technical findings that matter most for practitioners: recursive or hierarchical chunking achieves 69% accuracy in benchmarks and is the production default.[34] Semantic chunking offers up to 9% recall improvement but at 10x the cost, making it economically viable only for high-value corpora. Reranking delivers 10-30% precision improvement at 50-100ms latency.[35] The "retrieve wide, rerank narrow" pattern is now standard in production deployments. Hybrid search combining vector similarity and keyword matching is essential because pure vector search misses exact token matches that keyword search handles (like product codes, acronyms, and proper names).
Memory engineering manages short-term, long-term, episodic, semantic, and procedural memory across agent sessions. A December 2025 survey titled "Memory in the Age of AI Agents" proposed a taxonomy by form (token-level, parametric, structured), function (factual, experiential, working), and dynamics (formation, evolution, retrieval).[36] Memory engineering is complementary to retrieval engineering: memory stores learned preferences and interaction history, while retrieval accesses document stores. For enterprise search, memory means the system learns from previous queries by the same user or team, improving result quality over time without manual tuning.
How these disciplines map to an enterprise AI search pipeline
No single widely-adopted framework maps all these disciplines into a unified stack yet, but the architecture is converging.[37] CIO Magazine documented the progression in October 2025: early generative AI was stateless, handling isolated interactions where prompt engineering was sufficient. Autonomous agents persist across interactions, make sequential decisions, and operate with varying human oversight.[38]
Here is how the disciplines layer in practice. When a query arrives, intent engineering classifies it before any retrieval happens. A factual query ("what was our Q3 revenue?") routes to a structured data source with high-precision retrieval. An analytical query ("how does our Q3 revenue compare to forecasted growth?") routes to a broader document set with different chunking and reranking. A generative query ("draft talking points for our board meeting on Q3 results") pulls from multiple sources and applies different output specifications.
Context engineering then assembles the input. It draws from retrieval engineering (pulling the right documents), memory engineering (loading relevant session history and user preferences), and prompt engineering (structuring the final LLM input). The context is compressed to stay under the 40% window threshold that Anthropic's research identified as a performance boundary.
Specification engineering defines the output requirements. For the factual query, the spec requires a single number with a cited source. For the analytical query, it requires comparison tables with variance calculations and citations for each data point. For the generative query, it requires an appropriate tone and structure with source attributions.
After generation, evaluation engineering scores the output against those specifications. If the faithfulness score falls below threshold, the system triggers re-generation. If source attribution is missing, the guardrails framework rejects the output and sends it back through the pipeline with additional constraints. Over time, these evaluation results feed back into every upstream component. Retrieval tuning improves because evaluation identifies which documents produce high-quality answers. Prompt refinement improves because evaluation tracks which instruction patterns reduce hallucination.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
The enterprise adoption data maps to this progression. McKinsey's 2025 State of AI survey found 88% of respondents report regular AI use, but only one-third are scaling enterprise-wide.[39] High performers distinguished themselves through systematic evaluation processes and defined human validation workflows.[39] The gap between experimentation and production is the gap between using one discipline (prompting) and orchestrating all seven.
The Economist's framework reveals the cost of missing disciplines
The Economist's February 2025 column on OpenAI's Deep Research identified three structural weaknesses in AI research tools. Each one maps to a gap in the discipline stack.
The first weakness is data creativity. The Economist's columnist found that the model could answer straightforward statistical questions but failed on queries requiring creative data sourcing. It wrongly estimated American household whisky spending despite the data being readily available in Bureau of Labour Statistics tables. The model consulted common sources rather than the right sources. This is a context engineering and retrieval engineering failure. When context assembly retrieves only the most commonly cited sources, when it defaults to popular databases rather than domain-specific repositories, the output reflects surface-level data. Better retrieval engineering (source diversity in chunking and indexing) and better context engineering (deliberately including non-obvious sources) would address this.
The second weakness is the tyranny of the majority. Deep Research's training on massive web corpora creates a popularity bias. The model defaults to consensus views even when specialist research contradicts them. The Economist's example: unless prompted otherwise, the model assumes American income inequality has soared since the 1960s (the conventional wisdom) rather than remained relatively flat (the view of many domain experts). It knows of Emma Rothschild's 1994 Harvard paper demolishing the popular interpretation of Adam Smith's "invisible hand," but repeats the popular misconception anyway. This is an intent engineering, context engineering, and evaluation engineering failure. Intent classification could identify queries requiring specialist depth and route them to domain-specific knowledge bases. Context engineering could deliberately include contrarian and specialist sources. Evaluation engineering could test for viewpoint diversity and nuance, catching consensus-biased outputs before they reach users.
The third weakness is intellectual shortcuts. The column quotes Paul Graham: "Writing is thinking. In fact there's a kind of thinking that can only be done by writing." The same applies to research. The risk of outsourcing all research to a tool is reducing opportunities for original insight. This is an intent engineering and specification engineering challenge. Intent classification that identifies when a query requires original analysis (rather than synthesis) could prompt the system to present evidence and frameworks rather than conclusions. Specification engineering could define where human judgment must remain in the loop, preventing full automation of tasks that benefit from human reasoning.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
The deeper point is that these three problems are not bugs in any single model. They are architectural failures that emerge when AI systems lack infrastructure for assembling diverse context, understanding nuanced intent, enforcing quality specifications, and evaluating outputs against those specifications. The problems are systemic, so the solutions must be systemic too.
Adnan Masood, a practitioner, published a rebuttal to The Economist arguing that the limitations identified are addressable through proper system architecture.[40] His argument supports the discipline stack view: the problem is not model capability but the engineering infrastructure surrounding the model.
Walden Yan's observation from the Cognition AI blog applies broadly: "Most agent failures are context failures, not model failures."[11] Extend that across the full discipline stack and you get a more complete statement: most answer quality failures trace to inadequate intent classification, insufficient context assembly, missing specifications, or absent evaluation loops. The model is rarely the bottleneck.
What this means for enterprise search evaluation and adoption
Three findings from this research have direct implications for anyone evaluating or building enterprise AI search.
First, intent-first architecture addresses the 72% first-attempt failure rate. Adding an intent classification step before retrieval, even a lightweight one, routes queries to appropriate retrieval strategies and knowledge sources. A factual lookup requires different retrieval than a comparative analysis. A system that treats every query identically will fail on the majority of real-world queries that fall outside simple lookup.
Second, context engineering's 40% threshold inverts the common assumption. The instinct is to retrieve more documents and provide more context. The research says the opposite: performance degrades as context windows fill. The problem is not getting enough information to the model. The problem is curating the smallest, most relevant set. This means retrieval precision (finding the right 3 documents) matters more than retrieval recall (finding all 30 potentially relevant documents).
Third, evaluation engineering as continuous monitoring transforms enterprise search from a static deployment to an improving system. Every query-answer pair, when evaluated, feeds back into retrieval tuning, prompt refinement, and context optimization. Without this loop, the system degrades over time as the document corpus changes and user patterns shift. With it, quality trends upward.
An enterprise search tool that automates the orchestration of these disciplines, classifying intent before retrieval, assembling context from diverse sources while staying within token budgets, enforcing output specifications, and running continuous evaluation, addresses the structural limitations that The Economist identified. Tricky Wombat's architecture, which combines proprietary data connection, guided questioning (a form of intent refinement), RAG-based retrieval, and feedback loops, maps to a simplified version of this stack.[41] The feedback loop described on their site, where questions and answers strengthen the system over time, is evaluation engineering in action, even if the terminology differs.
The companies that will win in enterprise AI search are not those with access to the best frontier models. Frontier models are increasingly commoditized. The differentiator is how effectively a tool automates the orchestration of intent understanding, context assembly, specification enforcement, and evaluation feedback. The discipline stack itself is becoming the product. And the teams that understand this, that stop thinking of "better prompts" as the path to better answers and start thinking in terms of the full engineering infrastructure, will build systems that produce answers worth trusting.
The gap between AI experimentation and production is the gap between one discipline and seven. McKinsey's 2025 data confirms it: 88% of companies use AI regularly, but only a third are scaling enterprise-wide. The bottleneck is not the model. It is the orchestration layer, intent classification before retrieval, context assembly under token budgets, specification enforcement, and evaluation feedback loops. Tricky Wombat's architecture maps to this stack. A 30-minute call will identify which disciplines your pipeline is missing and what it takes to close the gaps. Schedule a call.
▶References (41)
- ↩Karpathy, A. X post, June 2025. https://x.com/karpathy/status/1937902205765607626
- ↩Coveo, enterprise search benchmark study. https://www.coveo.com
- ↩Isobe, Y. "The Evolving Art and Science of Prompt Engineering." Medium. https://medium.com/@yujiisobe/the-evolving-art-and-science-of-prompt-engineering-a-chronological-journey-948c0a5a96f9
- ↩Schulhoff, S. et al. "Prompt engineering." Wikipedia. https://en.wikipedia.org/wiki/Prompt_engineering
- ↩DigitalOcean. "Prompt Engineering Best Practices." https://www.digitalocean.com/resources/articles/prompt-engineering-best-practices
- ↩Promptbuilder. "Prompt Engineering Best Practices 2025." https://promptbuilder.cc/blog/prompt-engineering-best-practices-2025
- ↩Stack AI. "Prompt Engineering for RAG Pipelines." https://www.stackai.com/blog/prompt-engineering-for-rag-pipelines-the-complete-guide-to-prompt-engineering-for-retrieval-augmented-generation
- ↩IBM. "RAG vs Fine-tuning vs Prompt Engineering." https://www.ibm.com/think/topics/rag-vs-fine-tuning-vs-prompt-engineering
- ↩CodeSignal. "Prompt Engineering Best Practices 2025." https://codesignal.com/blog/prompt-engineering-best-practices-2025/
- ↩LangChain Blog. "The Rise of Context Engineering." https://blog.langchain.com/the-rise-of-context-engineering/
- ↩Yan, W. "Don't Build Multi-Agents." Cognition AI Blog, June 2025. https://cognition.ai/blog/dont-build-multi-agents
- ↩Substack. "Bye Prompts, Hello Context." https://dcthemedian.substack.com/p/bye-prompts-hello-context-context
- ↩Anthropic. "Effective Context Engineering for AI Agents." September 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- ↩Gartner. "Context Engineering: Why It's Replacing Prompt Engineering." October 2025. https://www.gartner.com/en/articles/context-engineering
- ↩Ruan, J.T. "Context Engineering in LLM-Based Agents." Medium. https://jtanruan.medium.com/context-engineering-in-llm-based-agents-d670d6b439bc
- ↩LlamaIndex. "Context Engineering Guide: Techniques for AI Agents." https://www.llamaindex.ai/blog/context-engineering-what-it-is-and-techniques-to-consider
- ↩Towards Data Science. "Is RAG Dead? The Rise of Context Engineering." https://towardsdatascience.com/beyond-rag/
- ↩Profound. "AI Search Intent Study: What 50M+ ChatGPT Prompts Reveal." https://www.tryprofound.com/blog/chatgpt-intent-landmark-study
- ↩Huryn, P. "The Intent Engineering Framework for AI Agents." Product Compass, January 2026. https://www.productcompass.pm/p/intent-engineering-framework-for-ai-agents
- ↩Nate's Newsletter, Substack. https://natesnewsletter.substack.com/p/klarna-saved-60-million-and-broke
- ↩Huang, C. "Why Intent-First Architecture Fixes Conversational AI's Broken RAG Pattern." Techbuddies, January 2026. https://www.techbuddies.io/2026/01/28/why-intent-first-architecture-fixes-conversational-ais-broken-rag-pattern/
- ↩Tericsoft. "Intent Engineering in AI: The Shift Beyond Context Engineering." https://www.tericsoft.com/blogs/intent-engineering
- ↩Stoica, I. et al. "Specifications: The Missing Link to Making the Development of LLM Systems an Engineering Discipline." arXiv:2412.05299, November 2024. https://arxiv.org/abs/2412.05299
- ↩GitHub Blog. "Spec-Driven Development with AI." https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/
- ↩Devoteam. "AI Guardrails: Building a Foundation of Trust and Safety." https://www.devoteam.com/expert-view/ai-guardrails/
- ↩Bockeler, B. "Understanding Spec-Driven Development." Martin Fowler's Blog. https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html
- ↩Galileo AI. "What is Evals Engineering?" https://galileo.ai/blog/what-is-evals-engineering
- ↩Es, S. et al. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217, September 2023. https://arxiv.org/abs/2309.15217
- ↩Anaconda. "Introducing Evaluations Driven Development." https://www.anaconda.com/blog/introducing-evaluations-driven-development
- ↩Label Your Data. "LLM-as-a-Judge: Practical Guide to Automated Model Evaluation." https://labelyourdata.com/articles/llm-as-a-judge
- ↩Monte Carlo. "LLM-As-Judge: 7 Best Practices." https://www.montecarlodata.com/blog-llm-as-judge/
- ↩Databricks. "Evaluation-Driven Development Workflows." Data+AI Summit 2025. https://www.databricks.com/dataaisummit/session/evaluation-driven-development-workflows-best-practices-and-real-world
- ↩Databricks. "What is Retrieval Augmented Generation (RAG)?" https://www.databricks.com/glossary/retrieval-augmented-generation-rag
- ↩Firecrawl. "Best Chunking Strategies for RAG in 2025." https://www.firecrawl.dev/blog/best-chunking-strategies-rag
- ↩Neo4j. "Advanced RAG Techniques for High-Performance LLM Applications." https://neo4j.com/blog/genai/advanced-rag-techniques/
- ↩arxiv. "Memory in the Age of AI Agents." arXiv:2512.13564, December 2025. https://arxiv.org/abs/2512.13564
- ↩Atkinson, S. "Context Engineering Is Just One Piece." Medium. https://medium.com/@asatkinson/context-engineering-is-just-one-piece-the-evolution-to-ai-systems-1f2338e55368
- ↩CIO. "Context Engineering: Improving AI by Moving Beyond the Prompt." October 2025. https://www.cio.com/article/4080592/context-engineering-improving-ai-by-moving-beyond-the-prompt.html
- ↩McKinsey. "The State of AI in 2025." McKinsey Global Survey. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
- ↩Masood, A. "Deep Research, Shallow Critique." Medium. https://medium.com/@adnanmasood/deep-research-shallow-critique-a-practitioners-rebuttal-to-the-economist-6d5c005481c1
- ↩Tricky Wombat. Product website. https://www.trickywombat.ai/