Why Tricky Wombat

Your AI gives wrong answers because it’s asking the wrong questions. We fix the questions so the answers fix themselves.

You have a search problem. Or an AI chatbot problem. Or an enterprise knowledge problem. You’ve described it a dozen different ways to a dozen different vendors, and they’ve all responded with the same pitch: a better model, more connectors, a larger knowledge graph.

And the answers are still wrong.

Not wrong in an obvious, dramatic way. Wrong in the way that erodes trust over weeks and months. An AI assistant that confidently cites a policy document from 2021. A search result that returns twelve links when you needed one number. A chatbot that gives your customer a technically accurate response to a question they didn’t actually ask.

You’ve spent money on this. You’ve burned internal credibility on this. The team that championed the AI rollout is now the team quietly explaining why adoption stalled at 30%.

That failure has a specific cause, and it’s not the model.

The question is the problem

A user types “Q3 revenue” into your enterprise search. The system retrieves the Q3 revenue figure. Technically correct. The user wanted a comparison against the forecast, with variance by region. They got a number. They needed an analysis.

That gap, between what the user typed and what the user actually needed, is where every AI system fails. Not at the model layer. Not at the retrieval layer. At the question layer.

This is not the user’s fault. Humans ask compressed questions. They use shorthand. They leave out context that seems obvious to them but is invisible to a machine. Every day, your team types queries that are three words long and expects answers that require understanding ten paragraphs of context. The traditional approach to this problem is to make the user work harder: write longer queries, use specific keywords, learn the system’s quirks. Add more cognitive load to people who are already overloaded.

Tricky Wombat takes the opposite approach. We make the machine work harder.

The quality of an AI-generated result is determined before the model produces a single token. Change what the model sees before it reasons, and the result changes entirely.

What happens between the question and the answer

When your team asks a question, seven things need to happen before a language model should produce a single word of output. Most AI systems skip five of them.

The first two are the most consequential, and the most ignored. Intent classification determines what the user actually needs. “Q3 revenue” could be a factual lookup, a comparative analysis, a request for a presentation-ready summary, or a conversational opener before a deeper question. Each intent routes to a different retrieval strategy, different source documents, a different output format. Coveo’s research found that 72% of enterprise search queries fail to return meaningful results on the first attempt. The fix is understanding what the user needs before retrieving anything. Then prompt rewriting takes the user’s compressed, three-word query and expands it into a precise, context-rich instruction for the language model. Not by asking the user to type more. By inferring what’s missing from the user’s history, their role, the documents they’ve accessed recently, the patterns in how their team uses the system. The user typed three words. The model receives three paragraphs of carefully assembled context.

Most vendors stop here, if they get here at all. Two steps down, five to go.

Retrieval engineering governs what documents the model sees. Not “the top ten results.” The three most relevant chunks from the right documents, at the right granularity, with the right recency. Hybrid search combining vector similarity and keyword matching. Reranking that delivers 10-30% precision improvement. The “retrieve wide, rerank narrow” pattern that separates production systems from demos. Context assembly then compresses those documents under strict token budgets, because Anthropic’s research identified a counterintuitive threshold: model accuracy degrades when more than 40% of the context window fills up. The instinct is to stuff more documents in. The research says the opposite.

The last three steps are where trust is built or broken. Specification enforcement defines what the answer should look like: format, citations, confidence scores, and whether the system says “I don’t know” when the evidence falls short. Evaluation scores every answer against those specifications before it reaches the user, rejecting outputs that fail and feeding results back into every upstream component. Memory makes the system learn from previous interactions in a measurable way: this team asks comparative questions 70% of the time, so route their queries to the analytical pipeline by default. That document was flagged as outdated by three users last month, so deprioritize it.

These seven steps are the context pipeline. Skip any one of them and the model produces output that is fluent, confident, and wrong.

72%of enterprise search queries fail on the first attempt

40%context window threshold before accuracy degrades

7engineering disciplines between your question and a correct answer

Why most AI search fails the same way

The enterprise AI search market is full of well-funded companies that all made the same architectural bet: connect to everything, index everything, and let the model figure it out. The result is 100+ connectors, a giant knowledge graph, and a language model sitting on top of a pile of unprocessed data.

The model does its best. It produces answers that sound right. Some of them are right. Many are confidently wrong. Gartner reviewers have called the AI features of leading platforms “a gimmick.” G2 users report hallucinations where AI-generated responses don’t align with source data. Enterprise buyers describe results that come back “muddy, confusing, often flat-out wrong.”

The industry’s response has been to build bigger indexes, add more connectors, and upgrade to the latest model. We find this maddening. None of it addresses the root cause. The problem is not what the model can do. The problem is what the model sees. Pour dirty water through an expensive filter and you still get dirty water. The billion-dollar platforms keep upgrading the filter. Nobody is cleaning the water.

This is the structural difference between a model-first approach and a context-first approach. Model-first assumes a more capable model will compensate for noisy, unstructured, poorly retrieved input. Context-first assumes the input determines the output, and engineers every stage of the pipeline to ensure the model receives exactly what it needs, nothing more, nothing less.

Andrej Karpathy described context engineering as “the delicate art and science of filling the context window with just the right information for the next step.” He contrasted it with prompt engineering, which he said most people associate with short task descriptions. Within two weeks of his post, the term had been amplified by Shopify’s CEO, formalized by Anthropic, and picked up by Gartner. The speed of adoption reflected something practitioners already knew: the model is rarely the bottleneck.

What Tricky Wombat actually does

We build the context pipeline. The engineering layer between your data and the model’s input that determines whether answers are trustworthy.

The user types three words. Underneath, the system classifies intent and routes the query to the right retrieval strategy. It rewrites the query automatically, expanding three-word inputs into precision-targeted instructions without asking the user to change anything about how they work. Hybrid search with metadata filtering and reranking pulls the three right documents instead of thirty tangentially related ones, then context assembly compresses them under strict token budgets. The output is checked against specifications before delivery. Citations, confidence signals, appropriate “I don’t know” responses. And the system learns from every interaction, so it measurably improves over time.

The user gets the right answer. They don’t see any of this. They don’t need to. The cognitive load stays with the machine.

A good executive assistant doesn’t hand you a stack of documents when you ask a question. They hand you the answer, with the source, and a note about what changed since last quarter. They do the work of figuring out what you actually need from the shorthand of what you said. The context pipeline does this for every query, for every user, at machine speed.

Why this requires a different kind of company

Large platforms build for the average query across millions of users. That’s the economics of scale: optimize for the middle, serve the edges poorly, and let the brand carry the rest. A $7 billion platform with Fortune 500 accounts doesn’t tune its pipeline for a 500-person engineering firm’s specific documentation structure. Your feature requests compete with the largest enterprises in the world for roadmap priority. Your support ticket enters a queue.

Tricky Wombat is built for a different economics. We work project by project. When we deploy for a customer, we tune the pipeline for that customer’s data, that customer’s query patterns, that customer’s definition of a correct answer. A specific pipeline built for a specific need, not a generic model applied to a broad market.

This is possible because the pipeline is the product, not the model. Models are interchangeable. GPT-4, Claude, Gemini. Swap one for another and output quality shifts by single-digit percentages. The context pipeline is where the real variance lives, and context pipelines are specific to the data, the domain, and the users they serve.

Our team works on the seven engineering disciplines that sit between your question and a correct answer: prompt engineering, context engineering, intent engineering, specification engineering, evaluation engineering, retrieval engineering, and memory engineering. Every day. That focus, and only that focus, is the reason the answers are better.

The gap between experimentation and production

McKinsey’s 2025 State of AI survey found that 88% of respondents use AI regularly. Only one-third are scaling enterprise-wide. That gap, between “we’re experimenting” and “this works at scale,” is the gap between using one engineering discipline (prompting) and orchestrating all seven.

Most companies that try AI search go through a predictable arc. The demo looks great. The pilot shows promise on a curated dataset. Production deployment hits real data: outdated documents, ambiguous queries, edge cases the demo never encountered. Accuracy drops. Adoption stalls. The budget is spent. The team that championed the project quietly shelves it.

40-60% of RAG implementations fail to reach production because of retrieval quality issues. The retrieval pipeline returns the wrong documents, the model reasons over them faithfully, and the output is confidently wrong. Not model issues. Not infrastructure issues. Pipeline issues.

That’s the company we built. An engineering team focused on the 95% of the problem that sits upstream of the model, because that’s where answers are won or lost.

Would it be worth 30 minutes to find out what your pipeline is missing?

A short conversation will identify which of the seven engineering disciplines are absent from your current setup, and what it takes to close the gaps.

Schedule a call

What this means for your team

Your team has a search tool, a chatbot, an AI assistant, or some combination. Right now, they’re working around it. They know which queries work and which don’t. They’ve developed instincts about how to phrase things to get usable results. They add extra keywords. They try multiple variations. They give up and ask a colleague instead.

That workaround behavior is invisible in your analytics. Usage metrics look fine. Queries per day, sessions per week. The numbers don’t capture the 40% of queries where the user looked at the result, decided it was useless, and went elsewhere. They don’t capture the decision made on incomplete information because finding the right document would have taken another twenty minutes.

Every one of those moments is a cost. Engineers rebuilding work that already exists somewhere in the knowledge base. New hires spending their first two weeks asking colleagues questions the system should have answered on day one. Customer service agents giving inconsistent answers because the system surfaced different documents depending on who asked and how they phrased it.

These aren’t AI problems. They’re context pipeline problems. The model is fine. What it sees is not.

Tricky Wombat exists because someone needs to fix the pipe, not polish the faucet.

Harness engineering is the missing discipline in enterprise AI adoption

RAG beats fine-tuning for most enterprise use cases

Good knowledge = good engineering