Does prompt engineering actually work?

Yes, on well-scoped tasks with clean inputs. Prompt engineering produces measurable accuracy improvements. On classification tasks, prompt engineering and RAG achieved 40 to 68% accuracy.[^4] The gains are real but they have a documented ceiling that additional prompt refinement cannot break through.

What is chain-of-thought prompting, and is it still effective?

Chain-of-thought prompting instructs the model to show its reasoning step by step before answering. On older models, it improved accuracy on reasoning tasks. On newer reasoning models like o3-mini and o4-mini, Wharton researchers found marginal accuracy gains with 20 to 80% increases in time costs, making the technique "complicated and contingent."[^5]

What is the difference between prompt engineering and context engineering?

Prompt engineering optimizes the words you send to the model. Context engineering optimizes the entire information environment: retrieval systems, memory, tools, state management, and output structure. A mid-2025 framework defines seven components of context engineering, of which the user prompt is one.[^29] Gartner predicts 80% of AI tools will embed context engineering features by 2028.[^30]

Why do 95% of enterprise AI pilots fail to deliver P&L impact?

The MIT NANDA Initiative found that most pilots fail because they are layered onto existing workflows without the information infrastructure to support production-quality outputs. Externally facing applications with structured operational data succeeded at roughly twice the rate of internal tools.[^3] The failure traces to data and workflow design, not model or prompt quality.

How much do organizations spend verifying AI outputs?

Knowledge workers spend an average of 4.3 hours per week verifying AI outputs, costing approximately $14,200 per employee per year in lost productivity.[^24] For a 500-person organization with 50% AI adoption, that is $3.55 million annually in verification overhead.

What is retrieval-augmented generation, and how does it relate to prompt engineering?

Retrieval-augmented generation (RAG) connects a language model to an external knowledge base so it can ground its answers in specific, retrievable documents rather than relying solely on its training data. Multiple converging sources estimate RAG reduces hallucinations by up to 71%.[^26] RAG is a context engineering technique, not a prompt engineering technique, because it changes what the model knows, not what you ask.

Is the "prompt engineer" job role still viable?

The standalone prompt engineer role is declining. Microsoft's 2025 Work Trend Index ranked it second to last among AI-related roles in growth.[^32] The skill is migrating into broader roles that encompass information architecture, workflow design, and context engineering alongside prompt design.

What should organizations do instead of optimizing prompts?

Prioritize data readiness and information infrastructure as the largest share of your AI budget. Audit retrieval quality before prompt quality. Redesign workflows around AI rather than layering AI onto existing processes. McKinsey found that workflow redesign is the organizational attribute most correlated with AI EBIT impact.[^2]

How much more effective is context engineering than prompt engineering alone?

LinkedIn improved retrieval accuracy by 77.6% by restructuring information from flat text to a knowledge graph, with no changes to prompts.[^6] DoorDash reduced hallucinations 90% through layered context infrastructure.[^27] Gartner projects that by 2027, organizations prioritizing semantic data structures will increase model accuracy up to 80%.[^8]

What exactly IS prompt engineering?

Q: What is prompt engineering?

Prompt engineering is the practice of designing and structuring inputs to a large language model to get useful, accurate outputs. It includes techniques like specifying format, providing examples, adding constraints, and breaking complex tasks into steps. The prompt engineering market reached $505.43 million in 2025.[^9]

Prompt engineering is where AI results start, not where they peak

The same prompt, sent to the same model, produces different outputs depending on what information the model can see. That single fact explains more about AI performance than any guide to writing better prompts.

Prompt engineering is a real and necessary skill that produces measurable gains on well-scoped tasks. But better phrasing on instructions is only a single task in a much wider discipline, that is comprised of: defining the right problem, decomposing it into the right questions, and building the information infrastructure that gives the model what it needs to answer.

That is the current state of the art. Not just prompt engineering as a stand alone discipline.

Key Points

Worldwide AI spending will reach $2.52 trillion in 2026, but only 6% of organizations qualify as AI high performers with more than 5% of EBIT from AI.^[1]^[2]

Lessons Learned

Treat prompt engineering as one component of a larger system, not the system itself. It is the question layer. The answer depends on the information layer.

What is prompt engineering, and why does it matter now?

Prompt engineering is the practice of designing, structuring, and refining the inputs you give to a large language model to get useful outputs. You write a question, instruction, or set of constraints. The model generates a response based on that input and whatever context it has access to. The skill is real. The difference between a vague prompt and a well-structured one shows up immediately in output quality, consistency, and relevance.

The term entered mainstream business vocabulary around 2023, when generative AI tools became widely accessible. By 2025, the prompt engineering market reached $505.43 million, with projections of $6.7 billion by 2034.^[8] Organizations hired prompt engineers, built prompt libraries, and developed internal training programs. The investment was rational. On well-scoped tasks with clean inputs, prompt engineering works.

But the conversation has already moved. In June 2025, the CEO of Shopify posted publicly that "the most important skill in working with AI is not prompt engineering, it's context engineering." Within days, a former director of AI at Tesla endorsed the framing, drawing an analogy between prompt engineering and a single CPU instruction versus context engineering as managing the full RAM environment.^[11] By September 2025, Anthropic's applied AI team published a framework stating that "building with language models is becoming less about finding the right words" and more about building the information systems that feed those words their meaning.^[12] The distinction matters because it changes where you invest.

Two images showing the evolution from morse code to holographic workstation — Prompt engineering has moved from single text enhancement to a full technology stack

What does prompt engineering look like in practice?

Three organizations illustrate the range of what happens when prompt engineering meets real operational conditions. Each invested in prompts. Each got results. The differences in their outcomes trace back to what sat behind those prompts.

Duolingo: when prompts work because the pipeline works

Duolingo produced 148 new language courses in a single year using AI, compared to roughly 100 courses produced over the company's first 12 years of operation. Daily active users grew 51%. Subscribers grew 37%. Q2 2025 revenue reached $252.3 million.^[13]

This looks like a prompt engineering success story. It is also a context engineering story in disguise. Duolingo's AI engine, Birdbrain, processes 1.25 billion exercises daily. Every prompt the system generates draws from a structured pipeline of learner performance data, CEFR curriculum frameworks, and quality review workflows. The prompts were the visible surface. The data infrastructure underneath determined whether those prompts produced a useful Japanese honorific exercise or a grammatically correct sentence that no native speaker would ever say. When Duolingo expanded into languages with complex politeness systems, the prompt layer alone could not handle the contextual nuance. The curriculum structure and review feedback loops did the work.

LinkedIn: same prompts, different information, different results

LinkedIn's customer service engineering team ran an experiment that isolates the variable this article argues matters most. They took a standard retrieval-augmented generation system and kept everything the same: the same model, the same prompts, the same questions. The only change was the structure of the information behind the model. They replaced flat text chunks with a knowledge graph that preserved the relational structure between concepts.^[5]

Retrieval accuracy improved by 77.6% as measured by Mean Reciprocal Rank. Resolution time dropped by 28.6%. The prompts did not change. The information architecture did. The study was peer-reviewed and published at ACM SIGIR 2024, making it one of the highest-confidence data points available on the relative contribution of prompt design versus context structure.

Klarna: the ceiling in action

Klarna deployed an AI assistant that handled 2.5 million customer conversations in its first month, performing work equivalent to 700 full-time agents. The company reported $40 million in projected annual savings and 96% daily usage among support staff. Revenue per employee increased 152%.^[11]

Then the system hit a wall. Customer satisfaction scores declined. Quality issues surfaced in complex cases that required judgment beyond what the retrieval and prompt infrastructure could support. Bloomberg reported that Klarna began rehiring human agents for cases where the AI could not maintain quality.^[11] The initial deployment was not a failure. It was a demonstration of both prompt engineering's genuine value and its operational ceiling. The structured context, including knowledge graphs and retrieval-augmented generation, carried the system through high-volume, well-scoped tasks. The cases that broke it were the ones where the context pipeline could not deliver the information the model needed to reason correctly.

What patterns emerge across these cases?

The pattern across Duolingo, LinkedIn, and Klarna is the same one that shows up in the aggregate data. Prompt engineering produces real results within a defined range. The ceiling arrives when the task requires information, structure, or judgment that the prompt alone cannot encode.

McKinsey's 2025 global AI survey tested organizational attributes for their correlation with AI EBIT impact. The attribute that ranked highest was not model selection, prompting skill, or talent acquisition. It was fundamental workflow redesign. Organizations that qualified as AI high performers were 3x more likely to have redesigned their workflows around AI rather than layering AI onto existing processes.^[2]

The organizations that get disproportionate returns from AI are not writing better prompts. They are building better information environments for those prompts to operate in.

What separates organizations that get strong AI results from those that don't?

The data on organizational AI outcomes is brutal and consistent. Only 6% of organizations qualify as AI high performers seeing more than 5% of EBIT from AI.^[2] Only 25% of AI initiatives deliver expected ROI, according to an IBM survey of 2,000 CEOs across 33 countries.^[14] 72% of CIOs report breaking even or losing money on AI investments, according to a May 2025 Gartner survey of 506 CIOs and technology leaders.^[15] By some estimates, more than 80% of AI projects fail, twice the rate of non-AI IT projects, according to a RAND Corporation study that traced the root causes across 65 structured interviews.^[16]

The failure mode is not "AI doesn't work." The failure mode is "AI works in the demo and doesn't work in the workflow." The MIT NANDA Initiative studied enterprise AI deployments through 52 executive interviews, surveys of 153 leaders, and analysis of 300 publicly disclosed AI initiatives. 95% of GenAI pilots delivered zero measurable P&L impact. But the same study found that externally facing AI applications, those connected to customer data, product catalogs, and transactional systems, succeeded at roughly twice the rate of internal tools.^[17] The variable was not the model or the prompts. It was whether the AI system had access to structured, operational data.

What do practitioners consistently identify as the bottleneck?

Across multiple independent surveys, the answer converges. In Informatica's CDO Insights 2025 survey of 600 chief data officers, 43% identified data quality, completeness, and readiness as the top obstacle preventing GenAI initiatives from reaching production.^[10] McKinsey's analysis found that the organizations with the highest EBIT impact from AI were 3x more likely to have fundamentally redesigned workflows rather than layering AI onto existing processes.^[2]

This is not a talent problem. It is not a model problem. It is a plumbing problem. The organizations losing money on AI are not using worse models or writing worse prompts. They are feeding those models unstructured, incomplete, or poorly organized information, and no prompt can compensate for that.

What drives the gap between 3.7x and 10.3x ROI?

An IDC study of over 4,000 business leaders and AI decision makers found that AI leaders achieve $10.3x ROI on their AI investments versus an average of $3.7x.^[11] That is a 2.8x multiplier, and it compounds across every AI use case in the organization.

The math works in both directions. On the cost side, knowledge workers spend an average of 4.3 hours per week verifying AI outputs, costing approximately $14,200 per employee per year in lost productivity.^[11] For a 500-person company where half the workforce uses AI tools, that is $3.55 million annually spent checking the AI's work. 47% of enterprise AI users have made at least one major business decision based on hallucinated content.^[11] The cost of those decisions does not show up in the AI budget. It shows up in the P&L as bad strategy, misallocated resources, and lost customers.

On the returns side, retrieval-augmented generation reduces hallucinations by up to 71% when the retrieval layer is well-built.^[11] DoorDash reduced hallucinations by 90% and compliance issues by 99% through layered context infrastructure rather than prompt optimization.^[18] Gartner projects that by 2027, organizations prioritizing semantic data structures will increase model accuracy by up to 80% and reduce costs by up to 60%.^[7]

The difference between the 3.7x organizations and the 10.3x organizations is not prompting sophistication. It is infrastructure maturity.

Why does the same prompt produce different results in different systems?

This is the question that reframes everything. A systematic study of retrieval-augmented generation systems across three large language models and six datasets found that retrieval quality, not prompt design, was the dominant factor in output accuracy.^[19] The same prompt, the same model, different retrieval infrastructure, different results.

Anthropic's applied AI team documented this from the model provider side. In testing infrastructure configurations for agent-based systems, they found that environmental factors like terminal setup, dependency versions, and system prompts produced multi-percentage-point swings in benchmark scores, enough to change which model appears "best" on a leaderboard. Their conclusion: what organizations assume is a model capability gap is often an infrastructure behavior gap.^[12]

The concept that ties these findings together is context engineering. Where prompt engineering optimizes the words you type into the model, context engineering optimizes the entire information environment the model operates in. A framework published in mid-2025 defined seven components of that environment: the instructions, the user prompt, the conversation state and history, long-term memory, retrieved information, available tools, and structured output formatting.^[20] The user prompt, the part prompt engineering addresses, is one of seven.

Gartner predicts that by 2028, context engineering features will be embedded in 80% of AI development tools, improving agentic AI accuracy by 30%.^[21] A survey of over 1,400 academic papers formalized the field in July 2025, establishing context engineering as a distinct discipline with its own research agenda and methodology.^[22]

The standalone "prompt engineer" role is already declining. Microsoft's 2025 Work Trend Index ranked it second to last among AI-related job functions in growth.^[11] The skill that endures is problem decomposition: knowing what to ask, what information the model needs, and how to structure that information so the model can use it. That skill starts with prompt engineering and extends into context engineering.

What does the cost of getting this wrong actually look like?

The economics run in both directions and they compound.

Poor data quality costs organizations an average of $12.9 million per year, according to a Gartner study. That figure comes from a survey of enterprises already investing in data quality solutions, meaning the actual number for organizations without such investments is likely higher.^[11] Fragmented AI implementations can significantly reduce AI ROI as organizations struggle with technical debt from uncoordinated deployments.^[23] By some estimates, more than 80% of AI projects fail outright, twice the rate of non-AI IT projects.^[16]

On the other side of the ledger, the returns from getting the infrastructure right are disproportionate. Industries most exposed to AI saw productivity growth nearly quadruple, from 7% to 27%, with 3x higher revenue per employee and a 56% wage premium for AI-skilled workers.^[11] The 10.3x ROI achieved by AI leaders versus the 3.7x average is not a marginal difference.^[11] For a company investing $10 million in AI, that is the difference between $37 million in returns and $103 million.

64% of CEOs cite fear of falling behind as a driver of AI investment.^[14] The fear is rational. But the response, investing more in prompts and models while underfunding information infrastructure, is the pattern that produces 95% of pilots with zero P&L impact.

How do you fix the information pipeline behind your prompts?

The thesis is now backed by the evidence: prompt engineering is a necessary skill operating within a defined performance range. Beyond that range, the returns come from the information infrastructure feeding the model. The fix is not to stop investing in prompts. It is to invest proportionally in the context pipeline that determines whether those prompts can produce accurate, grounded results.

At Tricky Wombat, we build the context pipeline that sits between your information and the AI model. The system is designed around three requirements that most AI implementations get wrong.

1. Retrieval architecture that preserves meaning

Most retrieval systems chunk documents into flat text blocks and match them to queries by keyword or embedding similarity. This strips relational structure, context, and hierarchy. The result: the model receives fragments that contain relevant words but miss the relationships between concepts. Tricky Wombat's pipeline preserves document structure, entity relationships, and hierarchical context through the retrieval layer, so the model receives information it can reason over, not just text it can pattern-match against.

2. Structured context delivery at inference time

A well-retrieved document is not automatically well-structured context. The order, formatting, and scope of what the model sees at inference time directly affect output quality. Gartner's research shows accuracy drops significantly when relevant information is embedded within longer, poorly structured contexts, even when the right information is technically present.^[21] Tricky Wombat's pipeline compresses, orders, and structures retrieved information before it reaches the model, delivering the smallest possible set of high-signal tokens for each query.

3. Continuous verification and citation grounding

The pipeline does not stop at generation. Every output is checked against its source material. Citations are verified. Claims are grounded to specific documents. When the source material changes, the system re-processes affected outputs. This is not a one-time quality check. It is a continuous operation that catches drift, detects when source material has been updated, and flags outputs that no longer reflect current information. The system improves over time because the feedback loop between output quality and retrieval configuration tightens with each cycle.

The bottom line

The organizations spending the most on prompt engineering are often the same organizations struggling to get AI to work in production. The data shows a consistent pattern: prompt optimization produces gains within a range of 20 to 50 percentage points of improvement, and then stops. The next 30 to 50 points come from retrieval architecture, data structure, workflow redesign, and the information infrastructure that most teams underfund or skip entirely.

This is not a future state. LinkedIn already demonstrated a 77.6% accuracy improvement by changing information structure alone. DoorDash cut hallucinations 90% through layered context infrastructure. The 6% of organizations that McKinsey identifies as AI high performers are 3x more likely to have redesigned their workflows than to have invested in better prompts.

The question has never been whether prompt engineering matters. It does. The question is whether you stop there. Every organization that stops at the prompt is betting that the words they type matter more than the information the model can see. The data says that bet loses.

▶References (23)

↩Gartner, Inc., "Gartner Says Worldwide AI Spending Will Total $2.5 Trillion in 2026," Gartner Newsroom, January 15, 2026. https://www.gartner.com/en/newsroom/press-releases/2026-1-15-gartner-says-worldwide-ai-spending-will-total-2-point-5-trillion-dollars-in-2026
↩McKinsey & Company, "The State of AI: Global Survey 2025," QuantumBlack/McKinsey, November 5, 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
↩arXiv, "A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG," arXiv:2503.24307, March 31, 2025. https://arxiv.org/abs/2503.24307
↩The Wharton School / GAIL, "Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting," arXiv:2506.07142, June 8, 2025. https://arxiv.org/abs/2506.07142
↩ACM SIGIR 2024, "Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering," July 14-18, 2024. https://dl.acm.org/doi/10.1145/3626772.3661370
↩Stack Overflow, "2025 Developer Survey," July 29, 2025. https://survey.stackoverflow.co/2025/
↩Gartner, "Top Data & Analytics Predictions for 2025 and Beyond," Gartner Newsroom, June 17, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-17-gartner-announces-top-data-and-analytics-predictions
↩Fortune Business Insights, "Prompt Engineering Market Size, Industry Share | Forecast, 2026-2034," 2025. https://www.fortunebusinessinsights.com/prompt-engineering-market-109382
↩The Wharton School / GAIL, "Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy," December 5, 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5879722
↩Informatica, "CDO Insights 2025: Racing Ahead on GenAI and Data Investments While Navigating Potential Speed Bumps," January 2025. Survey of 600 CDOs across the U.S., Europe, and Asia. https://www.informatica.com/lp/cdo-insights-2025_5039.html
↩
↩Anthropic Applied AI Team, "Effective Context Engineering for AI Agents," Anthropic Engineering Blog, September 29, 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
↩Duolingo, Q2 2025 earnings and engineering blog, 2025. Publicly available via Duolingo investor relations. https://blog.duolingo.com/
↩IBM Institute for Business Value, "2025 CEO Study," May 6, 2025. https://newsroom.ibm.com/2025-05-06-ibm-study-ceos-double-down-on-ai-while-navigating-enterprise-hurdles
↩Gartner, Survey of 506 CIOs and technology leaders, May 2025. https://www.gartner.com/en/newsroom/press-releases/2025-10-20-gartner-survey-finds-all-it-work-will-involve-ai-by-2030-organizations-must-navigate-ai-readiness-and-human-readiness-to-find-capture-and-sustain-value
↩RAND Corporation, "The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed," RR-A2680-1, August 2024. https://www.rand.org/pubs/research_reports/RRA2680-1.html
↩MIT NANDA Initiative, "The GenAI Divide: State of AI in Business 2025," MIT Media Lab, Summer 2025. Via Fortune, August 18, 2025. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
↩DoorDash Engineering Blog, 2024-2025. First-party engineering documentation. https://careersatdoordash.com/blog/large-language-modules-based-dasher-support-automation/
↩arXiv, "Understanding the Design Decisions of Retrieval-Augmented Generation Systems," arXiv:2411.19463, November 2024. https://arxiv.org/abs/2411.19463
↩Philipp Schmid, "Context Engineering" framework, June-July 2025. https://www.philschmid.de/context-engineering
↩Gartner, "Lead the Shift to Context Engineering," via Gartner analyst research, 2025. https://www.gartner.com/en/articles/context-engineering
↩arXiv, Survey of 1,400+ papers formalizing context engineering, arXiv:2507.13334, July 17, 2025. https://arxiv.org/abs/2507.13334
↩IBM, "From AI projects to profits," June 9, 2025. https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/agentic-ai-profits

By Tricky Wombat

Last Updated: Mar 31, 2026

The real benefits of adding an AI chatbot to your website.

How people use AI matters more than which AI they use

What is the ROI for AI customer service?