RAG beats fine-tuning for most enterprise use cases, and the data backs it
What Morgan Stanley, Klarna, and a $10 million Bloomberg model reveal about where to spend first
Enterprise teams keep framing model customization as a contest between RAG and fine-tuning, then spend months arguing the wrong question. In a survey of 600 U.S. IT decision-makers, RAG reached 51% of enterprise AI deployments in production while fine-tuning accounted for 9%.[2] Enterprise generative AI spending went from $2.3 billion in 2023 to $13.8 billion in 2024, and tripled again to $37 billion in 2025.[2][3] Yet 88% of organizations use AI in at least one business function and only 6% reach measurable EBIT impact.[10] The gap is not adoption and it is not the model. For most enterprise use cases RAG beats fine-tuning on cost and freshness, and fine-tuning is a premature optimization, because the variable that decides outcomes is the quality of your knowledge corpus, not the customization technique you bolt on top of it.
Key Points
RAG is in production at 51% of enterprise AI deployments versus 9% for fine-tuning, a near 6x adoption gap in a 600-respondent sample.[2]
Lessons Learned
Start with retrieval. Ship a RAG system over your existing documents before you consider training anything, because RAG delivers immediate accuracy gains without data-labeling and retraining overhead.[13]
What is the RAG vs fine-tuning decision really about?
RAG, retrieval-augmented generation, feeds a model relevant documents at query time from an external index. Fine-tuning bakes new behavior into the model's weights through additional training. The technique entered enterprise vocabulary through a 2020 Meta AI Research paper, became a standard AI budget line item by 2024, and by 2025 sat in Gartner's AI Hype Cycle as a named capability.[1] The global RAG market was valued at roughly $1.2 billion in 2023 and is projected to reach $11 billion by 2030, a 49.1% compound annual growth rate.[1]
The decision matters because the market has already voted with its budgets, and most teams still treat customization as a model problem. Prompt design is the dominant technique in enterprise deployments, RAG follows, and fine-tuning, tool calling, and reinforcement learning remain niche, used primarily by frontier teams.[3] That pattern held across both the 2024 and 2025 Menlo Ventures surveys, so this is not a transitional phase. Fine-tuning stays specialist even as the market matures.
What does the data show?
On raw capability the two techniques are not really competitors. A benchmark across Llama2-13B, GPT-3.5, and GPT-4 on an agriculture domain dataset found a base model at 75% accuracy, fine-tuning alone lifting it about 6 percentage points, and a combined fine-tuning plus RAG approach adding roughly 5 more.[13] The gains were additive and separable. That reframes the question. You are not picking a winner, you are picking what to reach for first, and for an enterprise starting cold, RAG delivers immediate gain without labeling data or running training jobs.
Cost is where the separation gets sharp. A May 2026 study assessing RAG and fine-tuning for industrial question answering in the automotive sector, using two closed BMW Group datasets, found RAG the most effective and cost-efficient adaptation method for both closed-source and open-source models.[14] Open-source models enhanced with RAG reached quality comparable to premium out-of-the-box models. That result lands in a domain where fine-tuning intuitively seems mandatory, high-specificity manufacturing data, and RAG still came out ahead on cost-efficiency.
What are practitioners reporting?
The people building these systems are using AI fast and trusting it slowly. Among 48,957 developers surveyed across 177 countries, 84% use or plan to use AI tools and 51% of professional developers use them daily, but only 3.1% highly trust AI tool accuracy and 45.7% somewhat or highly distrust the output.[12] The top frustration, cited by 66%, was "AI solutions that are almost right, but not quite."[12] That trust gap is the practical case for retrieval. RAG responses trace back to source documents. Fine-tuned model outputs cannot be audited the same way.
Buyers rank the criteria accordingly. Among 600 enterprise IT decision-makers, "measurable value delivery" was the top model selection criterion at 30% and "industry-specific customization" second at 26%, with price last at 1%.[2] Measurable value favors the approach that ships faster and produces auditable answers. Deloitte's survey of 3,235 leaders captured the wider mood as "rising AI spend but elusive ROI," with 66% reporting productivity gains but only 20% already growing revenue through AI against 74% who hope to.[11] When the burden of proof rises, expensive customization gets deferred.
What does this look like in practice?
The argument holds up best against named deployments. Three cases below run from a finance firm that could have fine-tuned anything, to a fintech that found RAG's ceiling, to a data company whose $10 million bet aged in months. Each shows a different face of the same pattern.
Morgan Stanley built a retrieval layer instead of a model
Morgan Stanley's roughly 16,000 financial advisors had a research library of more than 350,000 documents and no fast way to retrieve from it during client conversations, so follow-up tasks that needed a document lookup took days.[17] Every day, advisors worked around the bottleneck by hand. The firm deployed a RAG system on GPT-4 that indexed over 100,000 documents, with a human-in-the-loop design where advisors review AI output before it reaches a client.[20] It fine-tuned nothing. Because the retrieval layer ingested new research and regulatory changes immediately, the system stayed current without retraining, and because every answer traced to a source document, it satisfied compliance review. Over 98% of advisor teams now use the assistant, and the firm attributed $64 billion in net new assets in Q3 2024 and 100,000 new clients to the efficiency gains.[17][20] Enterprise software rarely passes 70% active adoption after sustained rollout campaigns, which makes 98% the detail worth remembering. A firm that could afford to train any model chose retrieval, because its knowledge changed faster than any training cycle could track.
Klarna automated two-thirds of support, then found the line
Klarna handles high-volume multilingual support across 23 markets and more than 35 languages, against a product surface of disputes, refund rules, and country-specific regulation that changes constantly.[21] Its AI assistant ran on a GPT-4-class model with three layers: a RAG knowledge layer over a vector database of clean documentation, an action layer with API access to execute transactions, and an orchestration layer routing between AI and humans.[18] Launched February 2024, it handled 2.3 million conversations in its first month, the workload of 700 full-time agents, automated 67% of conversations, and cut average resolution time from 11 minutes to under 2.[21] Klarna projected $40 million in profit improvement for the year. Then, by early 2025, it reintroduced human support for the roughly 5% of complex, emotionally sensitive cases where CSAT had slipped.[22] That 5% rollback is the most instructive moment in the whole case. RAG worked because most support is pattern-matching against clean, frequently-updated policy, and the team named the boundary instead of pretending it did not exist.
Bloomberg's $10 million model aged in months
In 2023 Bloomberg faced a real problem: existing LLMs handled financial terminology and reasoning poorly, which seemed to justify a custom model.[26] Every day, that gap cost the firm in specialized tasks general models fumbled. So Bloomberg invested roughly $10 million training a GPT-3.5-class model from scratch on its proprietary financial corpus and documented it in a widely-cited paper. Shortly after release, GPT-4 arrived and outperformed the domain-specific model on nearly all the financial tasks used to justify the project. The firm's own retrospective conclusion pointed away from from-scratch training toward fine-tuning or retrieval on top of strong base models as they appear. This was not a failure, it was a reasonable 2022 bet with a shelf life measured in months. Organizations that build RAG systems escape that obsolescence, because they swap the underlying model without rebuilding their knowledge infrastructure.
What patterns emerge across these cases?
Two patterns repeat. First, the winning deployments engineered information flow, not model internals. Morgan Stanley and Klarna both put a retrieval layer over clean, structured, continuously-updated data and let the model stay generic. Second, the technique choice was inseparable from governance. Source-traceable RAG output enabled the human review that compliance required, which a fine-tuned model's weights could never provide. The Klarna 5% and Bloomberg's obsolescence are the honest counterweights. RAG has a ceiling on complex reasoning, and even strong base models drift, but neither pattern argues for baking knowledge into weights. They argue for owning the pipeline that feeds the model.
What separates organizations that get results from those that don't?
Zoom out from individual cases and the divide is about foundations, not models. Among organizations using AI, only 6% qualify as high performers with measurable EBIT impact, and they are about three times as likely to have fundamentally redesigned workflows around AI.[10] Three blockers account for most of the gap: fragmented data and legacy infrastructure, workflows never redesigned for AI, and no clear scaling priorities.[10] All three are information architecture problems. None is a model problem.
The spending data makes the point in dollars. Organizations with successful AI initiatives invest up to four times more, as a share of revenue, in data quality, governance, and AI-ready infrastructure than organizations with poor outcomes, and those at the highest data maturity reach up to 65% better business outcomes.[8] The 4x differential sits in data foundations, not in model customization. If data investment predicts outcomes that strongly, a team debating RAG versus fine-tuning before fixing its corpus is optimizing the wrong variable.
What do practitioners consistently report?
The largest documented corpus of production work points the same direction. An analysis of 1,200+ production LLM deployments found that sophisticated retrieval architectures remain the most central element of successful systems, and named the governing principle outright: "context engineering > prompt engineering."[16] Fine-tuning showed up as a secondary optimization, not a primary architectural decision. Across that whole corpus, the discipline of architecting how information reaches the model mattered more than any specific retrieval method or model choice.
What drives the gap between strong and weak outcomes?
The gap compounds. Enterprise generative AI spending tripled to $37 billion in 2025.[3] Despite that, returns concentrate. McKinsey put the average AI return at $3.70 per dollar spent, while the top 6% of organizations exceed $10.30 per dollar, nearly 3x the average.[10] Those high performers are five times more likely to allocate over 20% of their digital budgets to AI and nearly three times more likely to have redesigned workflows around it. Doing this well pays back at a multiple. Doing it poorly funds a proof of concept that gets abandoned, and at least 50% of generative AI projects were abandoned after proof of concept by the end of 2025, against Gartner's original 30% forecast.[5][6]
Why does the same architecture produce different results?
Here is the reframe. The customization technique is not where reliability is decided. The corpus is.
A 2025 JMIR Cancer study ran the same RAG architecture across three corpus conditions and measured hallucination. With a curated, domain-specific knowledge base, hallucination ran 0-3%. With general web search, 13-23%. With no retrieval at all, 37-40%.[25] Same base models, same retrieval method, no architectural change. The only variable that moved was the quality of the indexed corpus, and it swung reliability by more than ten times. A second domain confirms the structural point: peer-reviewed evaluation of commercial legal AI tools found hallucinations in 17-33% of cases, traced primarily to source quality and retrieval edge cases rather than the model.[36] Two independent domains, clinical and legal, make the same argument. The architecture is not the variable. The corpus is.
This is why fine-tuning reads as a premature optimization. It pays to encode knowledge into model weights at a snapshot in time, when the leverage was in the information pipeline feeding the model all along. The organizational data closes the loop. Successful initiatives invest 4x more in data foundations, the production-deployment corpus ranks context engineering above model selection, and 63% of organizations either lack or are unsure they have the right data management practices for AI.[8][16][7] Gartner predicts organizations will abandon 60% of AI projects unsupported by AI-ready data through 2026.[7] Once you see that corpus quality, not customization technique, is the variable that moves outcomes, the entire RAG-versus-fine-tuning debate resolves into a question about your data pipeline.
What does a successful implementation look like?
A major European bank shows the reframe paying off. The institution carried heavy regulatory compliance obligations and deployed a RAG-based platform to automate risk detection, document analysis, and compliance documentation, with no fine-tuning and all customization in the retrieval and structured-output layers.[24] Because source-traceability of every generated output was preserved, a requirement RAG satisfies by design, the system met audit standards out of the box. The vendor-reported results: EUR 20 million saved over three years, the equivalent of 36 full-time employees freed from routine compliance review, and return on investment within two months.[24]
The cleaner technical proof comes from LinkedIn, which combined RAG with a knowledge graph built from historical support tickets, preserving both the structure inside each issue and the relationships between issues.[15] When a new ticket arrived, the system retrieved structurally relevant past resolutions, not just semantically similar text. Median per-issue resolution time dropped 28.6%, retrieval quality measured by Mean Reciprocal Rank improved 77.6%, validated across benchmark datasets and six months of live deployment.[15] Set the LinkedIn and bank results against Bloomberg's $10 million obsolescence and the variable announces itself. The cases that won engineered the information structure feeding a generic model. The case that struggled paid to encode knowledge into weights.
What are the economics of RAG vs fine-tuning?
Both sides of the ledger favor retrieval. On the cost of getting it wrong, fine-tuning's structure hides a multiplier. Documented enterprise fine-tuning cost breaks down as compute (35-50%), data preparation (20-30%), engineering time (15-25%), storage and infrastructure (10-15%), and ongoing maintenance and retraining (5-10%).[30] The maintenance line is the one teams underestimate, and some organizations spend more on retraining than on the original run, because every policy change, product update, or new feature forces a new training cycle.[30] Fine-tuning converts a one-time customization cost into a recurring operational one.
Per-query, the gap is wide. At full 128K context utilization a query costs about $0.40, while the same retrieval outcome from RAG, pulling roughly 2,000 relevant tokens, costs about $0.006, a 60x differential.[29] For a system handling 1 million queries a year, that is a meaningful annual gap in inference alone. The cost of inaction sits upstream in the data: among early GenAI adopters, 55% cite time-consuming data management, 52% report data quality issues, and only 11% have more than half their unstructured data ready for LLM use.[27]
On the return side, the numbers reward getting it right. Among 1,900 early-stage adopters, 92% report their AI investments already pay for themselves and 93% call their initiatives mostly or very successful, with South Korea leading RAG adoption at 82%.[27] McKinsey's high performers clear $10.30 per dollar invested.[10] Fine-tuning earns its place only at the extreme: above roughly 100,000 queries per day on a stable, narrow task where a small fine-tuned model runs at 10-50x lower per-query cost than a large model with appended context.[29] Below that volume, or for any knowledge that changes over weeks or months, RAG delivers lower total cost of ownership. The economics point where the outcome data already pointed, toward the corpus and the pipeline that serves it.
How do you fix the customization problem?
The thesis restated: outcomes ride on corpus quality and how information flows to the model, not on whether you fine-tune. That is the problem we built the Tricky Wombat pipeline to solve. We frame the work around the retrieval pipeline, not the model, because the model is the part you should be able to swap without rebuilding anything. Three requirements decide whether that pipeline holds up in production.
1. Ingest and structure the corpus before indexing
Most systems index raw documents and hope vector similarity sorts it out, which produces the naive-RAG floor where enterprise deployments hit only 10-40% task success.[29] We parse, transform, and index, cleaning and structuring source content and attaching metadata before anything reaches the vector store. The corpus condition is the variable that swings hallucination from 3% to 40%, so the pipeline treats corpus preparation as the primary engineering surface, not a preprocessing afterthought.
2. Retrieve with hybrid search and reranking, not vector similarity alone
Most systems run pure vector search and accept whatever comes back. Hybrid search that combines keyword and dense vector retrieval, merged with reciprocal rank fusion, outperforms pure vector search on enterprise tasks by 15-30%, and a cross-encoder reranking pass adds about 23.4% retrieval accuracy on top.[29] We layer both, then add structure-aware retrieval where the data warrants it. On schema-bound queries with complex aggregations, GraphRAG reached over 90% accuracy where standard vector RAG scored near zero.[31]
3. Inject structured state alongside retrieved documents
Most systems retrieve policy text and stop, which is why a pure knowledge layer underperforms. Klarna's results depended on injecting each user's account state, structured data, alongside retrieved policy.[18] The pipeline carries both the unstructured retrieval and the structured context the query needs, so the model answers against the actual situation, not a generic document match.
The pipeline runs continuously. It monitors retrieval quality, re-processes the corpus as source documents change, and verifies that every answer traces to a citable source. The phased path matters here: teams that skip the hybrid-search and reranking foundation ship systems stuck at the 10-40% success floor, while those that build it before adding agentic orchestration climb steadily.[29] Because the knowledge layer updates without retraining, the system gets more accurate as the corpus improves, with no model rebuild required.
The bottom line
Every case here, the finance firm with 350,000 documents, the fintech that automated two-thirds of support, the bank that cleared audit in two months, and the data company whose $10 million model aged in months, turns on the same variable: the quality and structure of the knowledge feeding the model, not the technique stamped on top of it. RAG wins for most enterprises because it lets you fix that variable without paying to re-encode it into model weights every time it changes.
The broader principle reaches past this one decision. Organizations that treat AI as a model problem keep buying bigger models and abandoning projects after proof of concept, and more than half of generative AI projects met that fate by the end of 2025. Organizations that treat it as an information-infrastructure problem invest 4x more in data foundations and clear $10.30 per dollar. The technology is converging. The same foundation models sit behind everyone's deployment, which means the model is the part that no longer differentiates you.
What differentiates you is the corpus you point it at and the pipeline that serves it. The companies that understand this are building retrieval infrastructure now. The ones that don't are fine-tuning a snapshot that the next base model will erase.
By Tricky Wombat
Last Updated: Jun 21, 2026