Prompt Rewriting
A 2025 error classification study found that over 60% of RAG errors originate in chunking and retrieval, not generation. The most frequent failure type was missed retrieval, where the system never found the right documents in the first place. Most teams respond by upgrading the model. The model is not the problem. Prompt rewriting addresses retrieval failures directly: before the query reaches the vector database, an LLM reformulates it into a form that better matches how information is stored. That means decomposing complex questions into sub-queries, generating multiple phrasings to activate different regions of embedding space, or expanding short keyword inputs into the topical density that dense retrieval requires. Microsoft's AI Search team reported a 22-point NDCG improvement when combining query rewriting with semantic reranking. The technique is not new. Treating it as a first-class infrastructure component in the retrieval pipeline, rather than an optional optimization, is.
Additional documents available for download
Most teams troubleshooting a bad RAG answer start with the model. They swap in a larger LLM, tweak the system prompt, adjust the temperature. A 2025 error classification study found that over 60% of RAG errors originate in the chunking and retrieval stages, not generation.[1]
The single most frequent error type was "Missed Retrieval," where the system never found the right documents in the first place. Generation-stage errors accounted for 31.7%. The retrieval pipeline is the primary failure surface, and query rewriting is one of the most direct ways to address it.
Why does retrieval fail on well-indexed data?
The gap between a user's query and the stored information is wider than most teams assume. Production data from an enterprise presales assistant showed 20% of queries were under five words.[2]
Short, keyword-style inputs share little vocabulary with the dense, structured documents they need to match against. Barnett et al. studied three real RAG deployments[2] spanning 15,000 documents and 1,000 validated question-answer pairs. They catalogued seven failure points and found that the first two, "Missing Content" and "Missed in Top Results," are both query-retrieval mismatch failures.
The mismatch runs deeper than term overlap. Embedding models encode short queries and long documents into different regions of vector space.
A five-word question about quarterly revenue projections lands far from the 800-word analyst memo that contains the answer, even when both discuss the same topic. The query lacks the topical density, structure, and supporting vocabulary that the document carries.
The business cost is straightforward. If 60% of wrong answers trace back to retrieval, then improving model quality addresses at most 40% of failures. You can upgrade from one version of a foundation LLM to the newest and still get bad answers when the context window is filled with irrelevant documents.
How does prompt rewriting change what the LLM finds?
LLM-based query reformulation is the general technique: an LLM rewrites a user's query into a form that better matches how information is stored before the retrieval step runs. Several specific strategies fall under this umbrella, each targeting a different aspect of the query-document gap.
Query decomposition breaks a complex question into self-contained sub-queries, each retrieving independently. Results are aggregated and deduplicated. This works well for multi-hop reasoning where the answer spans multiple documents. Sub-queries can run sequentially (chained, where each builds on the previous result) or in parallel.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
Multi-query generation, sometimes called RAG-Fusion, takes the opposite approach. Instead of splitting the question, the LLM generates multiple rephrasings of the same question from different angles. Each variant retrieves independently, and results merge through reciprocal rank fusion. Different phrasings activate different embedding space regions, which increases recall.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
There are many prompt rewriting techniques. Different techniques reflect different performance enhancements. These techniques share a common principle: the query that reaches the retriever should resemble how the answer is stored, not how the user asked the question. Whether that means, splitting the question apart (decomposition), or asking it five different ways (multi-query), the goal is the same. Close the semantic gap before retrieval, and the model gets better context to work with.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
What does this mean for production RAG systems?
The most significant production evidence comes from Microsoft's AI Search team, who reported that query rewriting combined with a new semantic ranker delivered a 22-point NDCG@3 improvement over their hybrid baseline.9[3]
Latency is the primary tradeoff. Every rewriting technique adds at least one LLM inference call before retrieval starts.
Two broader trends emerge:
- The generation model is becoming a commodity. As open-source and commercial LLMs converge on capability, the differentiator shifts to what goes into the context window.
Teams that treat retrieval quality as a first-class engineering problem, not an afterthought, will accelerate faster. - The research consistently shows that combined strategies (rewriting + reranking + hybrid search) outperform any single technique.
How should you evaluate query rewriting for your pipeline?
- Measure retrieval quality independently from generation quality. If you only evaluate end-to-end answer accuracy, you cannot tell whether a bad answer came from a retrieval miss or a generation error. Track retrieval metrics (nDCG, MRR, hit rate) at the retrieval stage before the LLM runs.
- Test rewriting against a no-rewrite baseline on your own data. Technique performance is dataset-dependent.
- Start with two to three rewrites and measure from there. You likely do not need 10 query variants.
- Consider which model to use for rewriting. This is a highly technical decision, so ensure the team making the choice is highly aware of rewriting downstream effects.
- Combine rewriting with reranking and hybrid search rather than relying on any single technique. The highest composite RAG scores in systematic comparisons came from combined pipelines, not standalone optimizations.10[4]
The research on query rewriting converges on a point that should change how teams allocate their RAG optimization budgets. Consistent 15-45% relative improvements appear across independent studies, different datasets, and different techniques. Those gains compound when combined with reranking and hybrid retrieval.
The model is not the bottleneck for most production RAG systems. The context pipeline is. If you are spending engineering cycles on prompt tuning and model upgrades while your retrieval stage silently returns the wrong documents, you are optimizing the wrong layer.
If your RAG pipeline passes the user's raw query straight to the vector database, retrieval quality depends entirely on how well the user phrases the question. Most users phrase questions in five words or fewer. Prompt rewriting closes the gap between how users ask and how documents are stored, before retrieval runs. Tricky Wombat's context engineering pipeline applies query reformulation, decomposition, and reranking as standard infrastructure, not afterthoughts. A 30-minute call will show where your retrieval stage is silently returning the wrong documents and what rewriting would change. Schedule a call.
▶References (4)
- ↩"Classifying and Addressing the Diversity of Errors in RAG." arXiv:2510.13975, 2025. https://arxiv.org/abs/2510.13975
- ↩Barnett, S. et al. "Seven Failure Points When Engineering a RAG System." IEEE/ACM CAIN 2024. arXiv:2401.05856. https://arxiv.org/abs/2401.05856
- ↩Microsoft Azure AI Search team. "Introducing query rewriting and semantic ranker v2." Microsoft Tech Community Blog, November 2024. ⚠️ Vendor-produced.
- ↩Wang, X. et al. "Searching for Best Practices in RAG." EMNLP 2024. (Fudan University)