We built a system for Moe Delo, an online accounting service, that shows what its AI agents actually do when answering a client: what they asked the AI, how much each answer cost, and whether it was correct. Everything runs on the company's own servers, client data never leaves, and its engineers reuse the same setup on every new AI service they build.
An LLM agent that answers questions in a developer’s notebook looks finished. The same agent in production turns into a black box. You can’t see which prompt went to the model, what the knowledge-base search returned, how much one answer cost, or whether it was even correct. For a company already running agents, that blindness is the real problem, and the agent itself is secondary.
Moe Delo is an online accounting and bookkeeping-outsourcing service, and agents were already running there. What they ordered was not another agent, but the layer underneath: the ability to see, cost, and check every call to a model, and to do it inside their own infrastructure without sending data out.
We delivered a working reference implementation — a small agent wired with the full production layer: tracing for every call, cost accounting, and automatic quality evaluation. All of it stands up inside their own Kubernetes. From there the team takes the reference as a template and wires their real services into the layer.
Production-readiness here begins where the demo ends: you have to see every call to a model, understand what it cost, and be able to prove the answer was correct. None of that comes for free, and almost every one of those tasks ran into a specific engineering decision.
The most important decision in the project we got wrong at first.
Tracing records the whole chain of an agent’s actions: which prompt went to the model, what it returned, how many tokens it spent, which tools it called. The easiest way to wire it up is with the ready-made SDK from Langfuse, the very observability system we were deploying. That’s what we did in the first version, and it worked on the first try.
And we threw it out. The client copies the reference onto every one of their services, so any coupling inside it multiplies along with the template. Had we tied the app to one specific observability system, that vendor lock would have ridden into every service: swapping Langfuse for Grafana or Datadog would then be impossible without rewriting code across all services at once.
So the app emits its telemetry in the open OpenTelemetry standard and pulls in nobody’s SDK. What goes where is decided by infrastructure: between the app and observability sits a separate collector, and it routes the streams. How that path works end to end is the next section.
This doesn’t buy full independence: Langfuse stays the backend for traces, scores, and annotation, and replacing it is a separate job. But the most expensive part stays vendor-neutral: the applications themselves, which a company will eventually have dozens of and which get copied most often.
The idea is simple. Every action the agent takes while answering is recorded separately. The records flow into one place, get split there by destination, and are reassembled into a clear picture of one answer — what the agent did, what it cost, and how good the answer was. Here is that path, step by step.
gen_ai. and additionally sends them to a system built for LLMs.Before, once an agent had answered, nothing was left but the text itself. Now any answer can be opened and walked through step by step, costed, and compared with others.
The request context attaches to the records at the infrastructure level, around the agent’s business code. A developer wiring their service into the layer doesn’t place these markers by hand: they show up in the records on their own.
To make cost visible, calls to models go through a single LiteLLM gateway. When every call goes through one point, token and money accounting collects in that same place instead of being smeared across services, and quotas are set centrally. Inside the client’s perimeter the requests go to their own models, running on their own hardware.
Cost came with an honest annoyance. Langfuse computes the price of a call from a built-in model catalog, and the client’s own models aren’t in that catalog, so the cost shows up as zero. To make money accounting work on your own models, prices have to be entered by hand. You only see a detail like this on a real stack with your own models; on someone else’s cloud models it simply doesn’t come up, which is why it rarely makes the checklists.
Answer quality is scored automatically and by hand at the same time. On the automatic side, LLM judges run in Langfuse — separate scoring models that look at the agent’s steps and rate helpfulness, conciseness, toxicity, whether the answer’s language matches the question’s, and the relevance of the retrieved context. Context relevance is measured two ways, against the current question and against the whole conversation, because those are different things.
Scores land at the level of a single step, not just the final answer. That matters: when an answer is bad, you can see where exactly it broke, in retrieval or in generation.
Manual annotation runs alongside. People rate answers on their own scales, for example whether the answer is correct and how well it is grounded in the retrieved fragments, and Langfuse has annotation queues for that. The client stands up the whole set of evaluators and scales with a single script. The subtlety is that Langfuse has no public API for configuring evaluators, so the script logs in as a regular user and sets everything up through the internal interface, and a repeated run duplicates nothing. A one-time manual setup turns into a repeatable loop that lives in their cluster.
The whole stack — observability, the model gateway, background-task orchestration, evaluation — is deployed in the client’s own Kubernetes. Traces settle in their Langfuse and ClickHouse, model requests go through their LiteLLM. In their production perimeter the models run on their own hardware, and data does not leave. We ran the demo on harmless board-game data, so for the show the model could even be an external one, but for a domain where bookkeeping and clients’ personal data sit behind the agent, the data perimeter becomes a hard requirement.
One decision grew straight out of that requirement. Ready-made filtering services from external providers send the request out for inspection, which is not allowed here. So we built the guardrail — the protective filter — ourselves and showed it as an example: here’s the platform, you can plug a ready-made filter into it or write your own, here’s your own. Unlike the basic ones, it can look at the whole conversation rather than the last message alone. The guardrail itself was an optional part of this project; we added it to the demo as a bonus.
So that the app image wouldn’t depend on private access to an internal library, we vendored that library, with its automatic instrumentation, straight into the repository. The reference stays self-contained, and there is no need to build it by reaching around the closed perimeter.
Behind the dull phrase “observability layer” sits a concrete new capability. The agent in production used to be a black box; now the client’s engineers see every call to a model, its cost, and its quality score, and they can wire in the next service by copying the ready-made reference, with no vendor lock and without taking data out of their perimeter.
That is where the economics of the solution lives. A single working agent is still just a demo, and what lets you trust it in production is the unglamorous layer underneath it, the one we built. You build it once and reuse it on every agent after that, which is why it is worth taking on before the company has many agents.
Tracing for every call to a model: the call tree, tokens, cost, and a link to the user and session. Request context is written into the telemetry by a separate processor, so the business code knows nothing about the observability system.
A single gateway to the models, so token and cost accounting lives in one place and routing and quotas are set centrally. Inside their perimeter the requests go to their own models.
LLM judges score every step of the agent (helpfulness, conciseness, toxicity, language, context relevance), not just the final answer. Alongside them sit manual annotation queues. The whole set of evaluators stands up from one repeatable script.
Document ingestion into search is wrapped in a Temporal workflow: retries and timeouts per step, per-document visibility, and a failed step retried without restarting the whole pipeline.
A pydantic-ai demo agent as a reference for integrating with the platform. Deliberately trivial in what it does — the value is the production scaffolding around it, which the client copies onto their own services.
A hand-written protective filter as an example: you can plug a ready-made filter into the platform or write your own. It looks at the whole conversation. External services were rejected because they send the request out of the perimeter.
We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.
Inquiry sent
We will reply within one business day to the email you provided.
or write directly to ilya@manaraga.ai