Agglomerative clustering

Standard RAG pipelines retrieve the most relevant chunks, concatenate them, and pass the result to an LLM. The retrieval step has received enormous investment. The step between retrieval and generation, where raw search results become the information environment the model reasons over, has been largely ignored. That gap is where answer quality breaks down. Stanford's "Lost in the Middle" research found LLM performance drops more than 30% based on how information is organized in the context window, not whether it is present at all. Agglomerative clustering addresses this directly: it groups retrieved chunks by thematic similarity, eliminates redundancy, and selects a representative from each cluster so the model sees the full landscape of relevant information rather than the same narrow slice repeated. The academic results are specific. Token reductions of 46-90% with no degradation in answer quality. Quality improvements up to 54% over unstructured context. The algorithm dates to the 1960s. Its application as an intelligent curation layer between retrieval and generation is what makes it newly relevant.

Detailed Explanation

Additional documents available for download

Most enterprise RAG pipelines optimize the wrong thing. They retrieve the chunks with the highest relevance scores, concatenate them, and pass the result to an LLM. The assumption is straightforward: better retrieval means better answers.

The research says otherwise. Stanford's "Lost in the Middle" study found that LLM performance drops more than 30% based on where information sits in the context window, not whether it's present at all.[1] Chroma's 2025 benchmarks across 18 frontier models showed that performance degrades at every context length increment, with semantically similar but redundant content appearing most frequently in hallucinated responses.[2] Enterprise search tools succeed on the first try about 10% of the time, according to a 2025 survey by Slite (a knowledge management vendor), compared to roughly 95% for a consumer Google query.[3]

The gap isn't in the models. It's in what happens between retrieval and generation, the step where raw search results become the information environment the LLM reasons over. Andrej Karpathy, former Senior Director of AI at Tesla and OpenAI founding member, calls this "context engineering."[4] It's the least-invested, least-understood layer in the modern AI search stack. And agglomerative clustering turns out to be one of the most effective tools for getting it right.

The enterprise search problem is worse than you think

An estimated 80-90% of enterprise data is unstructured, according to IDC: documents, emails, Slack messages, presentations, PDFs scattered across dozens of applications. A customer support question might require information from a product manual in Confluence, a policy update in SharePoint, and a known-issue thread in Jira. Traditional keyword search cannot bridge these silos, let alone handle the vocabulary mismatch problem where users ask about "PTO for remote workers" while documents say "time-off policy for telecommute employees."

Enterprise Search Success
Enterprise search tools succeed on the first try about 10% of the time, compared to roughly 95% for a consumer Google query. A 9.5x performance gap.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026

Source

The business impact is measurable. The Slite 2025 survey found that 45% of respondents cited productivity drain from poor search, 15% reported customer-facing delays, and 11% experienced duplicate work, with teams unknowingly rebuilding solutions that already exist somewhere in the knowledge base. 73% of companies don't know enterprise search solutions exist, despite paying for the SaaS tools that create the very information silos search is supposed to bridge.[3]

Enter RAG, Retrieval-Augmented Generation, the architecture that was supposed to fix everything. By combining vector search with large language models, RAG promised to let employees ask natural-language questions and get accurate, sourced answers drawn from company data. In demos, it works well. But as Douwe Kiela, co-author of the original 2020 RAG paper and CEO of Contextual AI, has admitted: production-scale RAG on real enterprise data with real enterprise constraints is a fundamentally different problem than a single-document demo.[5][6]

The failures are systematic. A study from Deakin University identified seven distinct failure points in production RAG systems, from irrelevant chunk retrieval to consolidation errors.[7] IBM SVP Dinesh Nirmal has stated plainly that pure RAG is not delivering the results enterprises expected.[8] The culprit in most cases isn't the language model. It's what happens before the model ever sees a single token.

Why "top-K" retrieval is a surprisingly crude approach

To understand why enterprise RAG underperforms, you need to understand how most pipelines work. The standard architecture, used by virtually every major vendor in the space (Glean, Coveo, Elastic, Microsoft, Google) follows the same pattern: documents are chunked, embedded as vectors, and stored in a vector database. When a user asks a question, the query is embedded, the system finds the K most similar chunks via approximate nearest-neighbor search, optionally reranks them, and passes them to an LLM to generate an answer.

This approach has a flaw that rarely gets discussed: top-K retrieval optimizes for individual chunk relevance, not collective context quality. If your query is about a company's approach to data privacy, and seven of the top ten retrieved chunks all discuss the same GDPR compliance paragraph from slightly different documents, the LLM receives a narrow, repetitive view. It sees one facet of the topic seven times rather than seven different facets once each.

The "Lost in the Middle" findings, referenced above, quantify just one dimension of this damage. But the deeper issue is structural. Adding more documents barely helps: performance gains from using 50 documents instead of 20 were negligible (about 1.5% for GPT-3.5 Turbo in that same Stanford study). More context doesn't mean better answers. Better-organized context does.[1]

Chroma's "context rot" research reinforces the point. Semantically similar but irrelevant content, what researchers call "distractors," caused degradation beyond context length alone and appeared most frequently in hallucinated responses. The context window isn't a container you fill. It's an information environment, and polluting it with redundancy and noise directly degrades output quality.[2]

Context Rot
Performance degradation as input context length increases across frontier LLMs. Even the best models show declining reliability at every length increment.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026

Source

This is why the industry is shifting from "prompt engineering" (optimizing the question you ask) to "context engineering" (optimizing the information ecosystem you provide).

How agglomerative clustering actually works

Agglomerative clustering is a bottom-up algorithm that starts by treating every data point as its own cluster, then iteratively merges the two closest clusters until a stopping criterion is reached. Imagine a room full of people, each standing alone. At each step, the two people standing closest together join into a group. Then the two closest groups merge. This continues until you have a desired number of groups, or until the remaining groups are too far apart to justify merging.

The result is a dendrogram, a tree diagram that records every merge and the distance at which it occurred. This is the algorithm's structural advantage: you can "cut" the tree at different heights to get coarser or finer groupings without rerunning anything. Need three broad topic clusters? Cut high. Need twelve granular subtopics? Cut low. The same computation supports both.

Dendrogram illustration
A dendrogram records every merge in agglomerative clustering. Cutting the tree at different heights produces different numbers of clusters from the same computation. Higher cuts yield broad topic groups. Lower cuts yield granular subtopics.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026

The algorithm's behavior depends on how "closeness" between clusters is measured, known as the linkage criterion. Ward linkage minimizes total within-cluster variance, producing the most evenly sized clusters. Average linkage uses the mean distance between all pairs of points across two clusters, a strong default for text applications. Complete linkage uses the maximum distance, favoring compact clusters. Single linkage uses the minimum distance, which handles unusual cluster shapes but is sensitive to noise.

For text and embeddings, two properties make agglomerative clustering a strong fit. First, it natively supports cosine similarity through precomputed distance matrices, the standard metric for comparing text embeddings and the one used by major embedding models like OpenAI's text-embedding series. K-means, by contrast, is tied to Euclidean distance, which performs poorly in the high-dimensional spaces where text embeddings live. Second, agglomerative clustering is deterministic: the same input always produces the same output. K-means depends on random initialization and can produce different results on every run, a real concern in enterprise environments where reproducibility and auditability matter.

The tradeoff is computational: agglomerative clustering runs in O(n² log n) time compared to K-means' roughly linear scaling. But in RAG pipelines, where you're typically clustering tens to hundreds of retrieved chunks rather than millions of documents, this tradeoff is irrelevant. Clustering 50-100 chunks with scipy's agglomerative implementation runs in single-digit milliseconds on commodity hardware. The scale where agglomerative clustering operates in enterprise context assembly is exactly where its advantages are strongest and its computational costs don't matter.

From "retrieve" to "retrieve, cluster, curate, generate"

The standard RAG pipeline follows a three-step workflow: retrieve, assemble, generate. The clustering-based approach inserts an intelligent organization layer, creating a four-step pipeline: retrieve, cluster, curate, generate. This additional step transforms context assembly from a mechanical process into an intelligent one.

Side-by-side RAG comparison
Standard RAG follows three steps: retrieve, assemble, generate. The clustering-based approach inserts an intelligent curation layer, creating a four-step pipeline that eliminates redundancy and ensures topical diversity before the LLM sees a single token.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026

Here's how it works in practice. Instead of retrieving exactly K chunks and passing them straight to the LLM, the system deliberately over-fetches, pulling perhaps 50 relevant chunks instead of 10. It then computes the cosine similarity between all retrieved chunks and applies agglomerative clustering to group them by thematic similarity. Chunks discussing the same subtopic naturally cluster together. The system then selects a representative chunk (or a summary) from each cluster to assemble the final context. The result: a context window that is semantically diverse, non-redundant, and topically comprehensive, covering the full landscape of relevant information rather than echoing the same narrow slice.

The research backing this approach is substantial. The CRAG (Clustered Retrieval Augmented Generation) paper formalized the pipeline and demonstrated 46-90%+ token reduction compared to standard RAG without degrading answer quality, tested across GPT-4, Llama2-70B, and Mixtral8x7B.[9] A separate study by Alessio et al., published in CEUR-WS Proceedings, showed that clustering and reordering sentences in the RAG context improved response quality by up to 54% over random ordering and 8.66% over the best baseline, using the TREC CAsT 2022 conversational search benchmark. Their finding: sentences with similar meaning placed closer together in the LLM prompt generate, on average, better quality output.[10]

RAG v CRAG
Token count comparison between standard RAG and CRAG (Clustered Retrieval Augmented Generation) as the number of input documents increases. CRAG achieves 46-90%+ token reduction while maintaining answer quality.

Credit: Tricky Wombat made with Google Gemini 3.1 Flash Image, Mar 2026

Source

The WEClustering framework, combining BERT embeddings with agglomerative clustering, outperformed every compared technique (including K-means, HDBSCAN, and several specialized methods) across seven benchmark datasets. The agglomerative variant showed silhouette coefficient improvements ranging from 36% to over 370% depending on dataset, with the improved WEClustering++ version (2025) reporting a 67% median silhouette score increase over prior methods.[12][11]

What makes this powerful in enterprise settings goes beyond raw performance metrics. Enterprise questions are inherently multi-faceted. "What is our company's approach to data residency for EU customers?" might require information from a legal policy document, an architecture diagram, a customer-facing FAQ, and an internal Slack discussion about a recent regulatory change. Top-K retrieval might surface four variations of the legal policy and miss the architecture context entirely. Clustering ensures that each distinct information facet is represented, producing answers that are accurate across all relevant dimensions rather than one.

Why no major vendor does this, and why that's telling

After analyzing how the major enterprise AI search vendors describe their technology, a pattern emerges: none of them specifically position clustering as a differentiator in context assembly. Glean relies on knowledge graphs and behavioral signals. Coveo uses a two-stage retrieval pipeline with embedding-based passage extraction. Elastic emphasizes its hybrid BM25-plus-vector search with reciprocal rank fusion. Microsoft's newest "agentic retrieval" adds query decomposition and parallel retrieval. Google leans on its search heritage and Gemini integration. Perplexity uses multi-agent verification.

Every one of these approaches shares the same fundamental architecture: embed, search for nearest neighbors, optionally rerank, pass top-K to the model. The retrieval step has received enormous investment and innovation. The context assembly step, what happens between retrieval and generation, has been largely ignored.

This gap exists for understandable reasons. Vector databases and embedding models are where the market momentum and venture capital have concentrated. Retrieval is measurable and well-understood. Context assembly is harder to benchmark, harder to sell, and requires a different kind of engineering intuition, what Karpathy describes as "the delicate art and science of filling the context window with just the right information." He notes that too little context or the wrong form leaves the LLM without what it needs for good performance, while too much or too irrelevant content drives up costs and drives down quality.[4]

The industry is beginning to recognize this gap. Forrester's Q4 2025 Wave on cognitive search platforms evaluated 14 vendors and, according to multiple vendors cited in the report, positions cognitive search as the foundation for agentic AI.[13][15][14] Microsoft has started positioning Azure AI Search as a "context engineering" platform. Elastic uses similar language. The academic world is further ahead: the Hierarchical RAG research area has produced dozens of papers in 2024-2025 alone, with frameworks like RAPTOR, HiRAG, MacRAG, and ArchRAG all exploring clustering-based approaches to context organization. Gartner predicts traditional search volume will drop 25% by 2026 as AI agents take over, and those agents will need context assembled with precision, not bulk.[16]

The companies that productize intelligent context assembly today are positioning themselves for where the market is heading, not where it is.

The Economist's warning applies to enterprise AI too

In February 2025, The Economist published an analysis of AI research tools identifying three structural limitations that apply directly to enterprise search: lack of data creativity, tyranny of the majority, and intellectual shortcuts.[17]

Data creativity describes AI's tendency to retrieve commonly cited information while missing less obvious but potentially more relevant sources. In enterprise search, this manifests as systems that surface the most-viewed documents rather than the most-relevant ones. A clustering-based approach partially addresses this by ensuring topical diversity: even if the "popular" information dominates retrieval scores, clustering identifies and preserves minority topics that contain the differentiating insight.

Tyranny of the majority is perhaps the most dangerous failure mode for enterprises. AI systems trained on broad web data gravitate toward mainstream narratives over specialist knowledge. In enterprise contexts, the most valuable knowledge is the most specialized: niche internal policies, domain-specific procedures, expert institutional knowledge. An enterprise search system that defaults to majority-rules information retrieval will systematically undervalue exactly the proprietary knowledge that differentiates one company from another. Intelligent context curation, which ensures representation across thematic clusters rather than weighting by popularity, provides a structural defense against this bias.

Intellectual shortcuts are an organizational risk. When employees stop critically evaluating AI-generated answers and treat them as authoritative, organizational knowledge quality degrades. The best mitigation is transparency: showing users not just an answer but how the answer was assembled, which sources contributed, how they were grouped, and what topics were covered. A clustering-based pipeline naturally supports this kind of explainability because each cluster represents a distinct information facet that can be surfaced and inspected.

What to look for in your AI search pipeline

The shift from prompt engineering to context engineering reflects a maturing understanding of where AI value actually comes from. The model matters, but models are commoditizing rapidly. Chroma's Context Rot research makes the case plainly: what matters is not whether relevant information is present in the context, but how that information is presented.[2]

For teams evaluating enterprise AI search solutions, or building their own RAG pipelines, the clustering-based context assembly approach points to several questions worth asking.

Does your pipeline optimize for collective context quality, or just individual chunk relevance? If the system returns the top-K most similar chunks and concatenates them, it's optimizing the wrong objective. Look for evidence that the system considers redundancy, diversity, and topical coverage in context assembly, not just relevance scores.

How does the system handle multi-faceted questions? Enterprise questions rarely have single-source answers. Ask how the pipeline ensures coverage across different information facets rather than over-representing the most common one. Agglomerative clustering's dendrogram structure naturally reveals when retrieved information clusters into distinct subtopics and ensures each subtopic gets representation.

What happens between retrieval and generation? This is where most pipelines have a gap. The most sophisticated approaches add an intelligent curation layer (clustering, deduplication, thematic organization, summarization) that transforms raw retrieval results into a coherent information environment for the LLM. This step is where context engineering lives.

Can you see how context was assembled? Deterministic algorithms like agglomerative clustering produce consistent, explainable results. The dendrogram provides a visual audit trail showing exactly how information was grouped and which sources contributed to each thematic cluster. In enterprise environments where trust and governance matter, this transparency is not optional.

Is the system getting smarter over time? The feedback loop between user questions, answer quality, and context assembly tuning represents the long-term competitive advantage. Systems that learn which clustering granularity, which linkage criteria, and which representation strategies work best for specific query types will continuously outperform static pipelines.

The curation layer is the next frontier

The enterprise AI search market is converging on a homogeneous pipeline: hybrid search, reciprocal rank fusion, cross-encoder reranking, top-K selection, and LLM generation. This architecture was a meaningful step forward from keyword search, but it has reached diminishing returns. Adding a better embedding model or a more powerful LLM to the same pipeline produces marginal improvements because the bottleneck has shifted from retrieval to context assembly.

Agglomerative clustering is one of the most promising approaches to this bottleneck. Not because the algorithm itself is new (it dates to the 1960s), but because applying it as an intelligent curation layer between retrieval and generation addresses failure modes that the rest of the pipeline cannot. It eliminates redundancy, ensures topical diversity, supports adaptive granularity, and produces deterministic, explainable results. The academic evidence is strong and growing: token reductions of 46-90%[9], quality improvements up to 54%[10], and significant outperformance of agglomerative methods over K-means and density-based alternatives on text clustering benchmarks[12][11] .

The companies that will win in enterprise AI search are not the ones with the biggest models or the fastest vector databases. They are the ones that master context engineering, treating what goes into the LLM's context window not as an afterthought but as the primary engineering challenge. When every vendor has access to the same foundation models, the context curation layer becomes the true differentiator.

Every major enterprise search vendor uses the same pipeline: embed, retrieve, rerank, generate. The retrieval step is well-engineered. The context assembly step, where answer quality actually breaks down, is an afterthought. Tricky Wombat's architecture inserts an intelligent curation layer between retrieval and generation, using clustering to eliminate redundancy, ensure topical diversity, and assemble context the model can reason over effectively. A 30-minute call will map how your current pipeline handles context assembly and where a curation layer would change output quality. Schedule a call.

References (17)
  1. Liu, N.F., Lin, K., Hewitt, J., et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2024. https://arxiv.org/abs/2307.03172
  2. Hong, K., Troynikov, A., Huber, J. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research, July 2025. https://research.trychroma.com/context-rot
  3. Slite Enterprise Search Survey Report 2025. https://slite.com/en/learn/enterprise-search-survey-findings
  4. Karpathy, A. Post on X, June 25, 2025. https://x.com/karpathy/status/1937902205765607626
  5. Kiela, D. Quoted in "Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI." Towards Data Science, October 2025. https://towardsdatascience.com/beyond-rag/
  6. Lewis, P., Perez, E., Piktus, A., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. https://arxiv.org/abs/2005.11401
  7. Barnett, S., Kurniawan, S., Thudumu, S., et al. "Seven Failure Points When Engineering a Retrieval Augmented Generation System." CAIN 2024 (IEEE/ACM). https://arxiv.org/abs/2401.05856
  8. IBM. "RAG Problems Persist. Here Are Five Ways to Fix Them." IBM Think, November 2025. https://www.ibm.com/think/insights/rag-problems-five-ways-to-fix
  9. Akesson, S. and Santos, F.A. "Clustered Retrieved Augmented Generation (CRAG)." arXiv, May 2024. https://arxiv.org/abs/2406.00029
  10. Alessio, M., Faggioli, G., Ferro, N., et al. "Improving RAG Systems via Sentence Clustering and Reordering." CEUR-WS Proceedings, Vol. 3784, pp. 34-43, 2024. https://ceur-ws.org/Vol-3784/paper4.pdf
  11. Sutrakar, V.K. and Mogre, N. "WEClustering: word embeddings based text clustering technique for large datasets." Complex & Intelligent Systems, 8, 1847-1860, 2021. https://link.springer.com/article/10.1007/s40747-021-00512-9
  12. Sutrakar, V.K. and Mogre, N. "An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets." arXiv, February 2025. https://arxiv.org/abs/2502.16139
  13. Forrester. "The Forrester Wave: Cognitive Search Platforms, Q4 2025." https://www.forrester.com/report/the-forrester-wave-tm-cognitive-search-platforms-q4-2025/RES186654
  14. Kore.ai press release, October 2025. https://www.businesswire.com/news/home/20251014583423/en/Kore.ai-Named-a-Leader-in-Cognitive-Search-Platforms-Q4-2025-Evaluation
  15. Elastic blog, October 2025. https://www.elastic.co/blog/forrester-leader-cognitive-search-platforms-2025
  16. Gartner prediction reported February 2024. Coverage: E-Commerce Times, September 2025. https://www.ecommercetimes.com/story/gartner-predicts-25-dip-in-search-volumes-by-2026-177921.html
  17. The Economist. "The danger of relying on OpenAI's Deep Research." February 13, 2025. https://www.economist.com/finance-and-economics/2025/02/13/the-danger-of-relying-on-openais-deep-research