What exactly IS prompt engineering?

Prompt engineering is where AI results start, not where they peak

What exactly IS prompt engineering?

The same prompt, sent to the same model, produces different outputs depending on what information the model can see. That single fact explains more about AI performance than any guide to writing better prompts.

Prompt engineering is a real and necessary skill that produces measurable gains on well-scoped tasks. But better phrasing on instructions is only a single task in a much wider discipline, that is comprised of: defining the right problem, decomposing it into the right questions, and building the information infrastructure that gives the model what it needs to answer.

That is the current state of the art. Not just prompt engineering as a stand alone discipline.

Key Points

  • Worldwide AI spending will reach $2.52 trillion in 2026, but only 6% of organizations qualify as AI high performers with more than 5% of EBIT from AI.[1][2]

Lessons Learned

  • Treat prompt engineering as one component of a larger system, not the system itself. It is the question layer. The answer depends on the information layer.

What is prompt engineering, and why does it matter now?

Prompt engineering is the practice of designing, structuring, and refining the inputs you give to a large language model to get useful outputs. You write a question, instruction, or set of constraints. The model generates a response based on that input and whatever context it has access to. The skill is real. The difference between a vague prompt and a well-structured one shows up immediately in output quality, consistency, and relevance.

The term entered mainstream business vocabulary around 2023, when generative AI tools became widely accessible. By 2025, the prompt engineering market reached $505.43 million, with projections of $6.7 billion by 2034.[8] Organizations hired prompt engineers, built prompt libraries, and developed internal training programs. The investment was rational. On well-scoped tasks with clean inputs, prompt engineering works.

But the conversation has already moved. In June 2025, the CEO of Shopify posted publicly that "the most important skill in working with AI is not prompt engineering, it's context engineering." Within days, a former director of AI at Tesla endorsed the framing, drawing an analogy between prompt engineering and a single CPU instruction versus context engineering as managing the full RAM environment.[11] By September 2025, Anthropic's applied AI team published a framework stating that "building with language models is becoming less about finding the right words" and more about building the information systems that feed those words their meaning.[12] The distinction matters because it changes where you invest.

Two images showing the evolution from morse code to holographic workstation
Prompt engineering has moved from single text enhancement to a full technology stack

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026

What does prompt engineering look like in practice?

Three organizations illustrate the range of what happens when prompt engineering meets real operational conditions. Each invested in prompts. Each got results. The differences in their outcomes trace back to what sat behind those prompts.

Duolingo: when prompts work because the pipeline works

Duolingo produced 148 new language courses in a single year using AI, compared to roughly 100 courses produced over the company's first 12 years of operation. Daily active users grew 51%. Subscribers grew 37%. Q2 2025 revenue reached $252.3 million.[13]

This looks like a prompt engineering success story. It is also a context engineering story in disguise. Duolingo's AI engine, Birdbrain, processes 1.25 billion exercises daily. Every prompt the system generates draws from a structured pipeline of learner performance data, CEFR curriculum frameworks, and quality review workflows. The prompts were the visible surface. The data infrastructure underneath determined whether those prompts produced a useful Japanese honorific exercise or a grammatically correct sentence that no native speaker would ever say. When Duolingo expanded into languages with complex politeness systems, the prompt layer alone could not handle the contextual nuance. The curriculum structure and review feedback loops did the work.

LinkedIn: same prompts, different information, different results

LinkedIn's customer service engineering team ran an experiment that isolates the variable this article argues matters most. They took a standard retrieval-augmented generation system and kept everything the same: the same model, the same prompts, the same questions. The only change was the structure of the information behind the model. They replaced flat text chunks with a knowledge graph that preserved the relational structure between concepts.[5]

Retrieval accuracy improved by 77.6% as measured by Mean Reciprocal Rank. Resolution time dropped by 28.6%. The prompts did not change. The information architecture did. The study was peer-reviewed and published at ACM SIGIR 2024, making it one of the highest-confidence data points available on the relative contribution of prompt design versus context structure.

Klarna: the ceiling in action

Klarna deployed an AI assistant that handled 2.5 million customer conversations in its first month, performing work equivalent to 700 full-time agents. The company reported $40 million in projected annual savings and 96% daily usage among support staff. Revenue per employee increased 152%.[11]

Then the system hit a wall. Customer satisfaction scores declined. Quality issues surfaced in complex cases that required judgment beyond what the retrieval and prompt infrastructure could support. Bloomberg reported that Klarna began rehiring human agents for cases where the AI could not maintain quality.[11] The initial deployment was not a failure. It was a demonstration of both prompt engineering's genuine value and its operational ceiling. The structured context, including knowledge graphs and retrieval-augmented generation, carried the system through high-volume, well-scoped tasks. The cases that broke it were the ones where the context pipeline could not deliver the information the model needed to reason correctly.

What patterns emerge across these cases?

The pattern across Duolingo, LinkedIn, and Klarna is the same one that shows up in the aggregate data. Prompt engineering produces real results within a defined range. The ceiling arrives when the task requires information, structure, or judgment that the prompt alone cannot encode.

McKinsey's 2025 global AI survey tested organizational attributes for their correlation with AI EBIT impact. The attribute that ranked highest was not model selection, prompting skill, or talent acquisition. It was fundamental workflow redesign. Organizations that qualified as AI high performers were 3x more likely to have redesigned their workflows around AI rather than layering AI onto existing processes.[2]

The organizations that get disproportionate returns from AI are not writing better prompts. They are building better information environments for those prompts to operate in.

What separates organizations that get strong AI results from those that don't?

The data on organizational AI outcomes is brutal and consistent. Only 6% of organizations qualify as AI high performers seeing more than 5% of EBIT from AI.[2] Only 25% of AI initiatives deliver expected ROI, according to an IBM survey of 2,000 CEOs across 33 countries.[14] 72% of CIOs report breaking even or losing money on AI investments, according to a May 2025 Gartner survey of 506 CIOs and technology leaders.[15] By some estimates, more than 80% of AI projects fail, twice the rate of non-AI IT projects, according to a RAND Corporation study that traced the root causes across 65 structured interviews.[16]

The failure mode is not "AI doesn't work." The failure mode is "AI works in the demo and doesn't work in the workflow." The MIT NANDA Initiative studied enterprise AI deployments through 52 executive interviews, surveys of 153 leaders, and analysis of 300 publicly disclosed AI initiatives. 95% of GenAI pilots delivered zero measurable P&L impact. But the same study found that externally facing AI applications, those connected to customer data, product catalogs, and transactional systems, succeeded at roughly twice the rate of internal tools.[17] The variable was not the model or the prompts. It was whether the AI system had access to structured, operational data.

What do practitioners consistently identify as the bottleneck?

Across multiple independent surveys, the answer converges. In Informatica's CDO Insights 2025 survey of 600 chief data officers, 43% identified data quality, completeness, and readiness as the top obstacle preventing GenAI initiatives from reaching production.[10] McKinsey's analysis found that the organizations with the highest EBIT impact from AI were 3x more likely to have fundamentally redesigned workflows rather than layering AI onto existing processes.[2]

This is not a talent problem. It is not a model problem. It is a plumbing problem. The organizations losing money on AI are not using worse models or writing worse prompts. They are feeding those models unstructured, incomplete, or poorly organized information, and no prompt can compensate for that.

What drives the gap between 3.7x and 10.3x ROI?

An IDC study of over 4,000 business leaders and AI decision makers found that AI leaders achieve $10.3x ROI on their AI investments versus an average of $3.7x.[11] That is a 2.8x multiplier, and it compounds across every AI use case in the organization.

The math works in both directions. On the cost side, knowledge workers spend an average of 4.3 hours per week verifying AI outputs, costing approximately $14,200 per employee per year in lost productivity.[11] For a 500-person company where half the workforce uses AI tools, that is $3.55 million annually spent checking the AI's work. 47% of enterprise AI users have made at least one major business decision based on hallucinated content.[11] The cost of those decisions does not show up in the AI budget. It shows up in the P&L as bad strategy, misallocated resources, and lost customers.

On the returns side, retrieval-augmented generation reduces hallucinations by up to 71% when the retrieval layer is well-built.[11] DoorDash reduced hallucinations by 90% and compliance issues by 99% through layered context infrastructure rather than prompt optimization.[18] Gartner projects that by 2027, organizations prioritizing semantic data structures will increase model accuracy by up to 80% and reduce costs by up to 60%.[7]

The difference between the 3.7x organizations and the 10.3x organizations is not prompting sophistication. It is infrastructure maturity.

Why does the same prompt produce different results in different systems?

This is the question that reframes everything. A systematic study of retrieval-augmented generation systems across three large language models and six datasets found that retrieval quality, not prompt design, was the dominant factor in output accuracy.[19] The same prompt, the same model, different retrieval infrastructure, different results.

Anthropic's applied AI team documented this from the model provider side. In testing infrastructure configurations for agent-based systems, they found that environmental factors like terminal setup, dependency versions, and system prompts produced multi-percentage-point swings in benchmark scores, enough to change which model appears "best" on a leaderboard. Their conclusion: what organizations assume is a model capability gap is often an infrastructure behavior gap.[12]

The concept that ties these findings together is context engineering. Where prompt engineering optimizes the words you type into the model, context engineering optimizes the entire information environment the model operates in. A framework published in mid-2025 defined seven components of that environment: the instructions, the user prompt, the conversation state and history, long-term memory, retrieved information, available tools, and structured output formatting.[20] The user prompt, the part prompt engineering addresses, is one of seven.

Gartner predicts that by 2028, context engineering features will be embedded in 80% of AI development tools, improving agentic AI accuracy by 30%.[21] A survey of over 1,400 academic papers formalized the field in July 2025, establishing context engineering as a distinct discipline with its own research agenda and methodology.[22]

The standalone "prompt engineer" role is already declining. Microsoft's 2025 Work Trend Index ranked it second to last among AI-related job functions in growth.[11] The skill that endures is problem decomposition: knowing what to ask, what information the model needs, and how to structure that information so the model can use it. That skill starts with prompt engineering and extends into context engineering.

What does the cost of getting this wrong actually look like?

The economics run in both directions and they compound.

Poor data quality costs organizations an average of $12.9 million per year, according to a Gartner study. That figure comes from a survey of enterprises already investing in data quality solutions, meaning the actual number for organizations without such investments is likely higher.[11] Fragmented AI implementations can significantly reduce AI ROI as organizations struggle with technical debt from uncoordinated deployments.[23] By some estimates, more than 80% of AI projects fail outright, twice the rate of non-AI IT projects.[16]

On the other side of the ledger, the returns from getting the infrastructure right are disproportionate. Industries most exposed to AI saw productivity growth nearly quadruple, from 7% to 27%, with 3x higher revenue per employee and a 56% wage premium for AI-skilled workers.[11] The 10.3x ROI achieved by AI leaders versus the 3.7x average is not a marginal difference.[11] For a company investing $10 million in AI, that is the difference between $37 million in returns and $103 million.

64% of CEOs cite fear of falling behind as a driver of AI investment.[14] The fear is rational. But the response, investing more in prompts and models while underfunding information infrastructure, is the pattern that produces 95% of pilots with zero P&L impact.

How do you fix the information pipeline behind your prompts?

The thesis is now backed by the evidence: prompt engineering is a necessary skill operating within a defined performance range. Beyond that range, the returns come from the information infrastructure feeding the model. The fix is not to stop investing in prompts. It is to invest proportionally in the context pipeline that determines whether those prompts can produce accurate, grounded results.

At Tricky Wombat, we build the context pipeline that sits between your information and the AI model. The system is designed around three requirements that most AI implementations get wrong.

1. Retrieval architecture that preserves meaning

Most retrieval systems chunk documents into flat text blocks and match them to queries by keyword or embedding similarity. This strips relational structure, context, and hierarchy. The result: the model receives fragments that contain relevant words but miss the relationships between concepts. Tricky Wombat's pipeline preserves document structure, entity relationships, and hierarchical context through the retrieval layer, so the model receives information it can reason over, not just text it can pattern-match against.

2. Structured context delivery at inference time

A well-retrieved document is not automatically well-structured context. The order, formatting, and scope of what the model sees at inference time directly affect output quality. Gartner's research shows accuracy drops significantly when relevant information is embedded within longer, poorly structured contexts, even when the right information is technically present.[21] Tricky Wombat's pipeline compresses, orders, and structures retrieved information before it reaches the model, delivering the smallest possible set of high-signal tokens for each query.

3. Continuous verification and citation grounding

The pipeline does not stop at generation. Every output is checked against its source material. Citations are verified. Claims are grounded to specific documents. When the source material changes, the system re-processes affected outputs. This is not a one-time quality check. It is a continuous operation that catches drift, detects when source material has been updated, and flags outputs that no longer reflect current information. The system improves over time because the feedback loop between output quality and retrieval configuration tightens with each cycle.

The bottom line

The organizations spending the most on prompt engineering are often the same organizations struggling to get AI to work in production. The data shows a consistent pattern: prompt optimization produces gains within a range of 20 to 50 percentage points of improvement, and then stops. The next 30 to 50 points come from retrieval architecture, data structure, workflow redesign, and the information infrastructure that most teams underfund or skip entirely.

This is not a future state. LinkedIn already demonstrated a 77.6% accuracy improvement by changing information structure alone. DoorDash cut hallucinations 90% through layered context infrastructure. The 6% of organizations that McKinsey identifies as AI high performers are 3x more likely to have redesigned their workflows than to have invested in better prompts.

The question has never been whether prompt engineering matters. It does. The question is whether you stop there. Every organization that stops at the prompt is betting that the words they type matter more than the information the model can see. The data says that bet loses.

References (23)
  1. Gartner, Inc., "Gartner Says Worldwide AI Spending Will Total $2.5 Trillion in 2026," Gartner Newsroom, January 15, 2026. https://www.gartner.com/en/newsroom/press-releases/2026-1-15-gartner-says-worldwide-ai-spending-will-total-2-point-5-trillion-dollars-in-2026
  2. McKinsey & Company, "The State of AI: Global Survey 2025," QuantumBlack/McKinsey, November 5, 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
  3. arXiv, "A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG," arXiv:2503.24307, March 31, 2025. https://arxiv.org/abs/2503.24307
  4. The Wharton School / GAIL, "Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting," arXiv:2506.07142, June 8, 2025. https://arxiv.org/abs/2506.07142
  5. ACM SIGIR 2024, "Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering," July 14-18, 2024. https://dl.acm.org/doi/10.1145/3626772.3661370
  6. Stack Overflow, "2025 Developer Survey," July 29, 2025. https://survey.stackoverflow.co/2025/
  7. Gartner, "Top Data & Analytics Predictions for 2025 and Beyond," Gartner Newsroom, June 17, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-17-gartner-announces-top-data-and-analytics-predictions
  8. Fortune Business Insights, "Prompt Engineering Market Size, Industry Share | Forecast, 2026-2034," 2025. https://www.fortunebusinessinsights.com/prompt-engineering-market-109382
  9. The Wharton School / GAIL, "Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy," December 5, 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5879722
  10. Informatica, "CDO Insights 2025: Racing Ahead on GenAI and Data Investments While Navigating Potential Speed Bumps," January 2025. Survey of 600 CDOs across the U.S., Europe, and Asia. https://www.informatica.com/lp/cdo-insights-2025_5039.html
  11. Anthropic Applied AI Team, "Effective Context Engineering for AI Agents," Anthropic Engineering Blog, September 29, 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
  12. Duolingo, Q2 2025 earnings and engineering blog, 2025. Publicly available via Duolingo investor relations. https://blog.duolingo.com/
  13. IBM Institute for Business Value, "2025 CEO Study," May 6, 2025. https://newsroom.ibm.com/2025-05-06-ibm-study-ceos-double-down-on-ai-while-navigating-enterprise-hurdles
  14. Gartner, Survey of 506 CIOs and technology leaders, May 2025. https://www.gartner.com/en/newsroom/press-releases/2025-10-20-gartner-survey-finds-all-it-work-will-involve-ai-by-2030-organizations-must-navigate-ai-readiness-and-human-readiness-to-find-capture-and-sustain-value
  15. RAND Corporation, "The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed," RR-A2680-1, August 2024. https://www.rand.org/pubs/research_reports/RRA2680-1.html
  16. MIT NANDA Initiative, "The GenAI Divide: State of AI in Business 2025," MIT Media Lab, Summer 2025. Via Fortune, August 18, 2025. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
  17. DoorDash Engineering Blog, 2024-2025. First-party engineering documentation. https://careersatdoordash.com/blog/large-language-modules-based-dasher-support-automation/
  18. arXiv, "Understanding the Design Decisions of Retrieval-Augmented Generation Systems," arXiv:2411.19463, November 2024. https://arxiv.org/abs/2411.19463
  19. Philipp Schmid, "Context Engineering" framework, June-July 2025. https://www.philschmid.de/context-engineering
  20. Gartner, "Lead the Shift to Context Engineering," via Gartner analyst research, 2025. https://www.gartner.com/en/articles/context-engineering
  21. arXiv, Survey of 1,400+ papers formalizing context engineering, arXiv:2507.13334, July 17, 2025. https://arxiv.org/abs/2507.13334
  22. IBM, "From AI projects to profits," June 9, 2025. https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/agentic-ai-profits

By Tricky Wombat

Last Updated: Mar 31, 2026