Query intent
Every major AI platform now stores facts between sessions. ChatGPT, Claude, Gemini, and Copilot all remember your name, your job title, your language preferences. The blank context window is gone. What replaced it is a subtler problem: these systems remember what you told them, not what that information means in context. They store that you are a platform engineer but not that your last three infrastructure proposals were rejected for missing cost-per-query projections. That gap between fact storage and contextual adaptation is user memory, the persistent infrastructure layer that retains decision history, project constraints, communication patterns, and evolving priorities across conversations. BCG estimates $2 trillion in revenue will shift over the next five years to companies that close it.
Additional documents available for download
Your RAG system treats every question the same. That is why it hallucinates.
State-of-the-art retrieval-augmented generation systems answer only 63% of questions without hallucinating.[2] At the same time, 65% of organizations now use generative AI regularly, and inaccuracy is the most commonly reported risk across all of them.[1]
The cause is specific: most RAG systems process every query through one retrieval pipeline, regardless of what the user is actually asking. A user asking "What's our parental leave policy?" gets the same processing as "Draft a competitive analysis of our Q3 positioning." One needs a document lookup. The other needs multi-step reasoning across multiple sources. Treating them identically wastes compute on the simple query and starves the complex one of the retrieval depth it needs.
Query intent classification, a lightweight decision layer that determines the type of answer needed before choosing how to produce it, closes that gap. Systems that implement it have significantly reduced hallucination rates for in-domain questions.[3]
The fix is not a bigger model. It is a 5-to-50-millisecond classifier that asks a question the current pipeline ignores: what does this user actually need?
Why does a single retrieval pipeline fail on mixed-intent queries?
The Meta CRAG benchmark tested state-of-the-art RAG solutions across diverse question types and found a 37% hallucination rate.[2] That number is worth sitting with. These are not demos or edge-case stress tests. They are the best retrieval-augmented systems available, failing on more than a third of questions.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
The standard enterprise deployment embeds the query, searches a vector store, stuffs the top-k results into the context window, and generates. This works for factual lookups where the answer sits in a single chunk. It falls apart for queries that require multi-hop reasoning, structured data access, or no retrieval at all.
Gartner's 2024 Digital Worker Survey found that 34% of employees struggle to find the information they need, even when using advanced AI tools.[4] The problem is not missing data or weak models. The system cannot tell the difference between a query that needs dense vector search over a knowledge base and a query that needs a SQL join, a graph traversal, or a polite refusal.
Every misrouted query has a cost. A factual question sent to a creative model hallucinates. A navigational request processed through full vector search burns compute for a lookup that could have been a direct link. A transactional command ("reset my password") treated as open-ended chat produces a helpful paragraph when the user wanted a button. McKinsey found that 47% of organizations using generative AI experienced at least one negative consequence, with inaccuracy topping the list.[1]
These failures compound because nothing in the pipeline distinguishes a factual lookup from a creative request, a navigational query from an analytical one. The fix is architectural, not model-level. You need a classifier that sits in front of the retrieval pipeline and routes each query to the right processing strategy.
How does query intent classification work?
The taxonomy
Broder's 2002 taxonomy of web search classified queries into three types: informational (seeking knowledge), navigational (seeking a specific resource), and transactional (seeking to perform an action).[5] Studies have validated this and found a consistent distribution: roughly 80% informational, 10% navigational, 10% transactional.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026
That taxonomy was built for web search. RAG systems handle a wider range of query types, and routing all "informational" queries to the same retrieval strategy defeats the purpose of classification.
A practical intent taxonomy for enterprise LLM systems distinguishes at least four query types: factual lookup (retrievable from a single source), analytical reasoning (requires synthesis across sources), creative generation (original content creation), and procedural/how-to (step-by-step guidance). Each demands a different retrieval strategy, different context window management, and often a different model.
The classification methods
The methods for classifying intent span a wide accuracy-latency spectrum. Rule-based approaches (regex patterns, keyword matching) run in microseconds but break on any query that deviates from expected phrasing. LLM-based classification (few-shot or zero-shot prompting) handles novel phrasings well but runs 50 to 390 times slower than a dedicated classifier.
The production sweet spot sits between these extremes. A fine-tuned lightweight classifier trades at most 2% accuracy for 50% lower latency compared to LLM-based classification. That tradeoff is worth it for every production system processing real-time queries.
How routing uses the classifier
Once the classifier tags a query's intent, routing logic maps it to a retrieval strategy. This has been formalized as Adaptive-RAG, a three-tier routing system: queries go to no retrieval, single-step retrieval, or multi-step retrieval based on classified complexity.[2] Adaptive-RAG demonstrated that a trained classifier routing queries to the appropriate retrieval depth outperforms any fixed-strategy baseline on open-domain question answering.
In agentic RAG architectures, the routing layer grows more sophisticated. A routing agent selects among vector search, SQL queries, graph traversal, or web search based on classified intent.
The classification step is what makes this selection possible. Without it, the system defaults to vector search for everything, which is the one-pipeline problem restated at the component level.
Where should you start?
Start with a fine-tuned lightweight classifier, not LLM-based classification. Reserve LLM-based classification for bootstrapping new intent categories where you lack labeled training data.
Build your intent taxonomy from observed query patterns, not theoretical categories. Log a representative sample of queries, cluster them, and define intent classes that map to distinct retrieval strategies. If two intent classes would route to the same pipeline, they are one class.
Implement explicit out-of-scope detection from day one. Production systems where users ask unexpected questions (which is all of them) need a classifier that says "I don't know what this is" rather than forcing every query into the closest available intent.
Set confidence thresholds and build fallback paths. CRAG's three-tier model (proceed, fallback, combine) provides a reasonable starting framework.[6] The evaluator adds latency, but catching a wrong retrieval before it hits the LLM is cheaper than generating and serving a hallucinated answer.
Plan for multi-intent queries even if you don't solve them immediately. Log compound queries, measure how often they occur, and design your taxonomy so it can be extended to handle them.
The highest-leverage decision in your LLM stack
Most conversations about improving RAG focus on the retrieval layer: better embeddings, reranking, hybrid search. Those matter. They all assume the system has already decided what kind of retrieval to do. Intent classification is the decision that makes every downstream component more effective, because it ensures the right strategy is applied to the right query.
The gap between a system that answers 63% of questions correctly and one that answers 95%+ is not a better model. It is a classifier that runs before retrieval and asks the question no one else in the pipeline is asking.
Your model is not the bottleneck. The persistence layer between conversations is. Most teams discover this after proof of concept, when the system that performed well in demos starts every real user interaction from zero. A 30-minute architecture review with Tricky Wombat's team will map where your pipeline loses context, what a memory layer needs to retain for your specific use case, and what it takes to build one. Schedule a call.
▶References (6)
- ↩McKinsey & Company. "The State of AI in Early 2024." May 2024. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-2024
- ↩Jeong, S. et al. "Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity." NAACL 2024. https://aclanthology.org/2024.naacl-long.389/
- ↩Nishisako, H. et al. "Reducing Hallucinations in Generative AI Chatbots." JMIR Cancer 11:e70176, 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12425422/
- ↩Gartner. "Digital Worker Survey 2024." https://www.gartner.com/reviews/market/enterprise-ai-search
- ↩Broder, A. "A Taxonomy of Web Search." SIGIR Forum 36(2), 2002 https://dl.acm.org/doi/10.1145/792550.792552
- ↩Yan, S. et al. "Corrective Retrieval Augmented Generation." https://arxiv.org/abs/2401.15884