Harness engineering is the missing discipline in enterprise AI adoption
Why the companies winning with AI engineer the context pipeline, not the model, and how to apply the same approach across banking, retail, and software

Additional documents available for download
Enterprises poured an estimated $302 billion per IDC data into AI in 2025, and 88% of organizations now run it in at least one business function, up from 78% a year earlier.[2][1] Yet only about 6% qualify as high performers, the firms attributing more than 5% of EBIT to AI, and 95% of generative AI pilots stall before they touch a P&L.[2][3] The gap is not explained by who bought the better model. It is explained by harness engineering, the discipline of building the context pipeline and constraint architecture around the model. The companies winning with AI are not running better models than their competitors. They built the infrastructure that decides what the model gets to work with.
Key Points
Lessons Learned
Treat model selection as table stakes and the context pipeline as the differentiator. Fund retrieval, constraints, and feedback loops before you fund another model evaluation.
What is harness engineering, and why does it decide whether AI works?
Harness engineering is the discipline of designing, building, and maintaining the infrastructure that orchestrates AI agents. It spans context engineering, architectural constraints, and feedback loops. Practitioner guides published through 2025 and 2026 distinguish it from two adjacent terms by scope. Prompt engineering shapes a single interaction. Context engineering shapes a single context window. Harness engineering governs the entire agent system across many context windows and the full lifetime of a task.[10][11] The shorthand circulating among practitioners is simple: an agent is a model plus a harness.
The reason this matters is that the model is the part most teams can already buy. The same frontier models are available to every competitor in a market. What differs is the pipeline that feeds them, the constraints that keep them in bounds, and the loops that catch them when they drift. Hold the model constant and change the context, and output quality moves. That is the whole argument, and the evidence for it is now strong enough to act on.
Start with a case that looks like a model story and is not. Spotify built an internal coding agent called Honk, running on a commercially available model inside sandboxed containers with limited permissions. Developers start jobs from Slack and the system handles the rest. Since mid-2024, Honk has merged more than 1,500 pull requests into production and saves 60 to 90% of engineering time on complex code migrations.[12] The numbers are concrete and impressive. The cause is counterintuitive. Honk only works because of infrastructure Spotify began building years earlier: Fleet Management, the Backstage developer portal, standardized build systems, and comprehensive test suites. The model was fixed from the start. The scaffolding was the variable.

Tricky Wombat made with Google Gemini 3.1 Flash Image, Jun 2026
SourceWhat does the data show about where AI value actually comes from?
The cleanest evidence sits in retrieval research. A 2025 study presented at ICLR on "sufficient context" showed that you can determine when a model has enough information to answer correctly, and that given sufficient context models answer correctly while given insufficient context they confabulate.[13] Supporting deployment reviews report that context-graph-grounded retrieval reaches up to 5x the analyst accuracy of systems handed raw database schemas, with the model held constant.[14] Treat the precise multiplier as directional, since it traces to a deployment review rather than a controlled benchmark, but the direction is consistent across the literature: change the context, move the accuracy.
The strategic version of the same finding comes from BCG, which attributes roughly 10% of AI value to the algorithms, 20% to the technology, and 70% to rethinking people and process.[5] Attribute the exact split to BCG rather than treating it as settled fact, because it originates with one consultancy framework. McKinsey's 2025 research corroborates the shape of it: across every factor tested, workflow redesign correlated most strongly with value.[2] If 70% of the value lives in process and pipeline design and 10% in the algorithm, then a team pouring its energy into model selection is optimizing the smallest term in the equation.
Where the gaps persist is telling. Deloitte's 2026 survey of 3,235 leaders across 24 countries found 66% reporting productivity and efficiency gains but only 20% reporting revenue growth, against the 74% who hoped for it. Data management readiness reached only 40% across surveyed organizations, and just 21% had mature governance for autonomous agents.[15] AI is delivering tactical wins and missing strategic transformation, and the readiness numbers point at why.

Tricky Wombat made with Google Gemini 3.1 Flash Image, Jun 2026
SourceWhat are practitioners reporting from inside the work?
Developers are adopting AI faster and trusting it less. Stack Overflow's 2025 survey of 33,662 respondents found 84% using or planning to use AI tools, up from 76%, and 51% using them daily. Positive sentiment fell to 60% from above 70% in prior years. Only 3.1% of professional developers highly trust AI output accuracy, and 46% actively distrust it.[6] The most frequent users, early-career developers, also report the lowest trust, which means the people leaning on these tools hardest are the least equipped to catch their errors.
The frustration data names the mechanism. The single largest complaint, cited by 66%, is "AI solutions that are almost right, but not quite." The second, debugging AI-generated code being more time-consuming at 45%, is a direct downstream consequence of the first.[6] "Almost right, but not quite" is what a context failure feels like from the keyboard. The model reasons competently from information that was incomplete, so the output is plausible and wrong, which is the most expensive kind of wrong because it survives a quick read. The chief data officers responsible for the pipelines say the same thing in their own vocabulary, and we get to their numbers shortly.
What does harness engineering look like in production?
The pattern repeats across companies that look nothing alike. Each one made the same bet: invest in the operating environment around the model, not just the model. Here is what that looks like at three organizations in three different businesses.
Stripe runs 1,300 AI pull requests a week on constraint architecture, not a better model
Stripe needed to evolve a large, complex codebase while increasing shipping velocity. Its normal was human engineers maintaining every change by hand. The company built Minions, autonomous agents that each run in an isolated devbox, a fresh sandbox with a checkout of the relevant code that spins up in about ten seconds and cannot touch production, push to main, or act outside its defined scope. Each agent runs tests inside the sandbox, reads the output, and iterates. That feedback loop, act, observe, adjust, is what separates the system from a generate-and-paste workflow. When the agent finishes, the diff becomes a pull request a human reviews like any other. The result is more than 1,300 AI-generated pull requests merged every week.[16] Engineers trigger the whole thing from Slack. The sophistication lives underneath, in the constraints and the feedback architecture, not in the prompt.
JPMorgan Chase rebuilt the context, not the model, to reclaim 360,000 lawyer-hours
JPMorgan Chase processed commercial credit agreements the way every large bank did, with trained lawyers reading tens of thousands of pages a year. The bank built the Contract Intelligence platform, known as COiN, around a specific problem: surfacing the right clause, the right defined term, and the right precedent inside a contract review. According to widely reported figures, the system processes 12,000 commercial credit agreements a year, reclaimed 360,000 lawyer-hours, and cut error rates by 80% against manual review.[17] Treat the precise figures as secondary-sourced. The instructive part holds regardless: the work was not pointing a model at contracts. It was engineering the retrieval and extraction context the model needed to be accurate. That is the difference between deploying AI at a problem and engineering the context that solves it. The 360,000 recovered hours equal roughly 180 attorney-years of capacity.
General Mills found the data pipeline was the project, not the model
General Mills runs a global supply chain that generates thousands of routing and vendor decisions daily. It deployed an autonomous agent for logistics optimization that evaluates more than 5,000 shipments a day and has generated more than $20 million in savings since FY2024, according to reported figures.[17] The detail that matters for engineering leaders is the timeline. Customer-facing AI deployments in the same research reached measurable ROI in two weeks to 90 days, while supply chain work like this took 12 or more months, not because the model was harder but because the data pipeline and integration work was. The variable that predicted time-to-value was pipeline complexity, a harness engineering variable.
What pattern emerges across these cases?
Three businesses, three different problems, one shared answer. None of them won by acquiring a model their competitors could not. Stripe's edge is the sandbox-and-feedback architecture. JPMorgan's is the retrieval context. General Mills's is the data pipeline. The user-facing symptom developers report, "almost right, but not quite" at 66%, is the inverse of what these systems engineered away: they built the context and the feedback so the output would be right, not just plausible.[6] And the institutional pattern lines up. Among the surveys, the people closest to the pipeline already know where the constraint is. 67% of chief data officers report they cannot move even half their generative AI pilots to production, 43% name data quality and readiness as the top obstacle, and 92% keep accelerating adoption anyway.[18] The constraint is identified. It is just underfunded.
What separates organizations that get strong results from those that don't?
Zoom out from individual systems and the divergence is structural. Organizations at advanced AI maturity now report operating margins materially above early-stage peers, and the roughly 5% seeing substantial financial gains show three-year shareholder returns several times higher than laggards.[2] The gap is widening, not closing, which makes execution quality a competitive variable rather than a hygiene factor. RAND's research frames why the downside is so steep: AI projects fail at roughly twice the rate of non-AI IT projects, and the elevated rate is structurally distinct, a category-level design error rather than ordinary difficulty.[4]
What do users and practitioners consistently report?
The convergence across independent sources is the strongest part of the case. Developers report "almost right, but not quite" as their top frustration.[6] Chief data officers name data quality as their top obstacle.[18] Enterprise leaders in Deloitte's survey report broad efficiency gains but narrow revenue impact, with data readiness at 40%.[15] These are different populations using different language, and they describe the same constraint. The model is rarely the thing people complain about. The context, the data, and the integration are.
What drives the gap between strong and weak outcomes?
Data preparation runs 15 to 25% of total AI project cost yet is routinely underrepresented in initial business cases, and organizations frequently underestimate total AI investment, with the gap concentrated in data preparation and change management. So the single largest cost driver is also the most consistently omitted from the plan. Pair that with the readiness data: 63% of organizations do not have, or are unsure they have, the right data management practices for AI, and Gartner expects 60% of projects lacking AI-ready data to be abandoned through 2026.[9] The compounding works both ways. Underfund the pipeline and projects abandon. Fund it and, as the maturity data shows, the margin gap widens in your favor.
What actually decides whether enterprise AI works?
The variable is not the model. It is the pipeline that feeds it, the context the model can retrieve at inference time, the constraints that keep it in bounds, and the feedback loops that catch it when it drifts. MIT's 2025 research names this directly. Reviewing more than 300 implementations alongside 52 executive interviews and 153 survey responses, the study concluded that "the core issue isn't talent, infrastructure, or regulation. It's the lack of learning, integration, and contextual adaptation. Most GenAI systems do not retain feedback, adapt to context, or improve over time."[3] The systems that stall fail at contextual adaptation, not raw capability.
Once you see that, the spending pattern inverts. The energy most teams put into model selection and prompt tuning is aimed at the 10 to 30% of value those choices control, while the 70% that lives in pipelines, retrieval, and workflow design goes underfunded.[5] There is a second-order warning in the data worth naming. Security research cited in practitioner guides reports a sharp rise in findings from AI-generated code as adoption outpaced quality controls, and the DORA 2025 research found higher AI adoption correlating with both more throughput and more instability.[19] Read that correctly. It is not evidence that AI is the problem. It is evidence that capability without a harness amplifies risk. The faster the model, the more it matters what surrounds it.

Tricky Wombat made with Google Gemini 3.1 Flash Image, Jun 2026
SourceWhat does a successful implementation look like across geographies?
DBS Bank is the clearest proof, because it sequenced infrastructure before models on purpose. Its normal in the late 2010s was strong technology sitting on fragmented data, and early deployments showed model selection was not the binding constraint, data readiness and governance were. So the bank built in order. In 2018 it adopted the PURE governance framework, Purposeful, Unsurprising, Respectful, Explainable, with mandatory training for all employees. From 2021 it upskilled more than 9,000 staff over four years. In 2023 it consolidated 700 data professionals into a centralized Data Chapter. Only then did it scale models, reaching more than 2,000 AI and machine learning models across over 430 use cases. Because of that sequence, economic value from AI grew from S$370 million in 2023 to S$750 million in 2024 to S$1 billion in 2025, ahead of projections, with a credit-risk model flagging more than 95% of at-risk SME loans three months before default.[8][7] Competitors had the same models. The gap is what DBS built around them.
The same shape appears in Australia and Latin America. Commonwealth Bank of Australia migrated its entire data infrastructure, more than 61,000 data pipelines, to a cloud platform between mid-2024 and mid-2025, the prerequisite for the AI that followed. Its systems now make about 55 million decisions a day across 2,000-plus models, and customer fraud losses fell by more than 20% in the first half of FY2026, a number the bank attributes to serving models complete, consistent, real-time data.[20] Mercado Libre deployed an AI service assistant that handled more than 9 million inquiries in Q4 2025 and resolved 87% without a human operator, while net revenue grew 45% year over year, attributed to AI investment.[21] The 87% resolution rate is a context-quality metric. The 45% revenue growth is a data-infrastructure outcome. Three banks-and-retail giants on three continents, the same available models, different infrastructure, different results.

Tricky Wombat made with Google Gemini 3.1 Flash Image, Jun 2026
SourceWhat is the real economics of getting the pipeline wrong, and right?
Both sides of the ledger are large. On the cost side, MIT Sloan research estimates companies lose 15 to 25% of revenue annually to poor data quality, while data teams spend roughly half their time on remediation instead of analysis.[22] A quality issue caught at ingestion is far cheaper than the same issue caught in a boardroom dashboard. The hidden cost is the omission already noted: data preparation is the largest single cost driver and the one most often left out of the business case.
On the return side, organizations that get implementation right report materially higher returns from agentic deployments than from traditional automation, and direct financial impact is rising as the primary success metric heading into 2026.[23] The maturity data puts a number on the spread: the roughly 5% capturing substantial gains show shareholder returns several times higher than laggards, and the margin gap between advanced and early-stage organizations is widening.[2] Both sides of the ledger trace to the same variable. The cost of getting it wrong is a pipeline that feeds models incomplete data. The return on getting it right is a pipeline that feeds them complete data. The model line item is roughly the same in both columns.

Tricky Wombat made with Google Gemini 3.1 Flash Image, Jun 2026
SourceHow do you fix the AI implementation gap?
The thesis restated: model selection is necessary but not sufficient, and the differentiator is the harness around the model. So the fix is not a procurement decision. It is an engineering discipline applied to the pipeline. We build that pipeline at Tricky Wombat, and the work concentrates on three things most systems get wrong.
1. Retrieve enough context to answer, and know when you have not
Most systems retrieve a fixed number of chunks by vector similarity and pass whatever comes back to the model, with no check on whether that material is sufficient to answer the question. That is the origin of "almost right, but not quite." The model reasons well over thin context and produces a confident, wrong answer. We measure context sufficiency at inference time. When the retrieved material does not clear the threshold to answer, the system says so or escalates instead of confabulating. The research underwriting this is direct: given sufficient context, models answer correctly, and given insufficient context, they fail in predictable ways.[13]
2. Ground retrieval in structure, not raw text alone
Most systems treat the knowledge base as a flat pile of text and embeddings. Relationships between entities, the link from a clause to its defined term, from a customer to their contract, from a part to its supplier, are lost in the chunking. We ground retrieval in a context graph that preserves those relationships, so the model receives the connected context a human expert would assemble, not a bag of nearby sentences. Deployment reviews report context-graph grounding reaching up to 5x the accuracy of raw-schema retrieval with the model unchanged, and that gain comes entirely from the structure of what is retrieved.[14]
3. Constrain the agent and close the feedback loop
Most systems let an agent act and assume the output is correct because it is plausible. The production systems that work do the opposite. Stripe's agents run inside sandboxes that cannot touch production, execute tests, read the results, and iterate. We build the same constraint-and-feedback architecture: bounded permissions, defined scope, automated verification of every action, and a loop that feeds the result back so the agent corrects itself before a human sees the output.[16] The constraint is not a limit on capability. It is what makes capability safe to deploy.
None of this is set-and-forget. The pipeline runs continuously, monitoring retrieval quality, re-processing the knowledge base as sources change, and verifying that every citation an answer rests on points to real, current source material. Because the system retains feedback and re-grounds on fresh data, it gets more accurate as it runs, which is the exact capability MIT identified as missing from the systems that stall.[3]
The bottom line
Stripe, JPMorgan, General Mills, DBS, Commonwealth Bank, Mercado Libre, and Spotify share no industry, no geography, and no model that their competitors could not also buy. What they share is that each one engineered the context, the constraints, and the feedback around the model and won on that, while the 95% of stalled pilots and the 80% of failed projects optimized the part of the system that was never the bottleneck.
The broader principle is older than AI. The reliability of any system is set by its infrastructure, not its most visible component, and AI has simply made the infrastructure expensive to ignore. The model is the part everyone can see and the part everyone can buy. The pipeline is the part that decides whether the model is right or merely plausible. As agents move from under 5% of enterprise applications toward a far larger share, the firms that treated the harness as the real work will compound their lead, and the firms that kept shopping for a better model will keep wondering why the same technology that transformed a competitor produced a stalled pilot for them.
The model was never the variable. The infrastructure around it always was.
▶References (23)
- ↩Medha Cloud, "60 Enterprise AI Statistics for 2026," 2026. https://medhacloud.com/blog/enterprise-ai-statistics-2026
- ↩McKinsey & Company, "The State of AI in 2025: Agents, Innovation, and Transformation," November 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
- ↩MIT NANDA Initiative, "The GenAI Divide: State of AI in Business 2025," August 2025. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
- ↩RAND Corporation, "The Root Causes of Failure for Artificial Intelligence Projects," 2024. https://www.rand.org/pubs/research_reports/RRA2680-1.html
- ↩Boston Consulting Group, "AI Transformation Is a Workforce Transformation," 2026. https://www.bcg.com/publications/2026/ai-transformation-is-a-workforce-transformation
- ↩Stack Overflow, "2025 Developer Survey: AI," December 2025. https://survey.stackoverflow.co/2025/ai/
- ↩Singapore Economic Development Board, "How DBS, Southeast Asia's Largest Bank, Is Capturing the Full Value of AI and Machine Learning in Singapore," September 2024. https://www.edb.gov.sg/en/business-insights/insights/how-dbs-southeast-asias-largest-bank-is-capturing-the-full-value-of-ai-and-machine-learning-in-singapore.html
- ↩CNBC, "CEO of Southeast Asia's Top Bank DBS Says AI Adoption Already Paying Off," November 14, 2025. https://www.cnbc.com/2025/11/14/ceo-southeast-asias-top-bank-dbs-says-ai-adoption-already-paying-off.html
- ↩Gartner, "Lack of AI-Ready Data Puts AI Projects at Risk," February 26, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk
- ↩Augment Code, "Harness Engineering for AI Coding Agents: Constraints That Ship Reliable Code," 2025–2026. https://www.augmentcode.com/guides/harness-engineering-ai-coding-agents
- ↩Milvus, "What Is Harness Engineering for AI Agents?", April 2026. https://milvus.io/blog/harness-engineering-ai-agents.md
- ↩Spotify Engineering, "1,500+ PRs Later: Spotify's Journey with Our Background Coding Agent (Honk, Part 1)," November 2025. https://engineering.atspotify.com/2025/11/spotifys-background-coding-agent-part-1
- ↩Google Research, "Deeper Insights into Retrieval-Augmented Generation: The Role of Sufficient Context," ICLR 2025. https://research.google/blog/
- ↩Atlan, "What Is RAG? How Retrieval-Augmented Generation Works," 2026. https://atlan.com/know/what-is-rag/
- ↩Deloitte, "The State of AI in the Enterprise," March 2026. https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html
- ↩Stripe Dev Blog, "Minions: Stripe's One-Shot, End-to-End Coding Agents," February 2026. https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents
- ↩AI Monk, "12 Agentic AI Examples With Measurable ROI: Enterprise Case Studies From 2025–2026," April 2026. https://aimonk.com/agentic-ai-examples-enterprise-roi-case-studies/
- ↩Informatica, "CDO Insights 2025: Racing Ahead on GenAI and Data Investments," January 2025. https://www.informatica.com/blogs/cdo-insights-2025-global-data-leaders-racing-ahead-despite-headwinds-to-being-ai-ready-latest-supply-finds.html
- ↩DORA, "DevOps Research and Assessment 2025 Report," 2025. https://dora.dev/research/2025/
- ↩Computer Weekly, "Australia's CommBank Completes Migration of Data to AWS in AI Drive," June 2025. https://www.computerweekly.com/news/366625205/Australias-CommBank-completes-migration-of-data-to-AWS-in-AI-drive. Fraud reduction figure from: CommBank Newsroom, "CommBank Develops AI Agent That Spots New Fraud and Helps Build Defences," April 2026. https://www.commbank.com.au/articles/newsroom/2026/04/ai-agent-spots-fraud-in-real-time.html
- ↩PYMNTS, "Mercado Libre Says AI Investments Support 45% Revenue Surge," February 24, 2026. https://www.pymnts.com/earnings/2026/mercado-libre-says-ai-investments-support-45-revenue-surge/
- ↩MIT Sloan Management Review, "Seizing Opportunity in Data Quality," 2017. https://sloanreview.mit.edu/article/seizing-opportunity-in-data-quality/
- ↩Futurum Group, "Enterprise AI ROI Shifts as Agentic Priorities Surge," February 17, 2026. https://futurumgroup.com/press-release/enterprise-ai-roi-shifts-as-agentic-priorities-surge/
By Tricky Wombat
Last Updated: Jun 26, 2026