Accounting SaaS LLM Observability

Moe Delo: observability for LLM agents inside their own perimeter

We built a system for Moe Delo, an online accounting service, that shows what its AI agents actually do when answering a client: what they asked the AI, how much each answer cost, and whether it was correct. Everything runs on the company's own servers, client data never leaves, and its engineers reuse the same setup on every new AI service they build.

What separates a demo from production

An LLM agent that answers questions in a developer’s notebook looks finished. The same agent in production turns into a black box. You can’t see which prompt went to the model, what the knowledge-base search returned, how much one answer cost, or whether it was even correct. For a company already running agents, that blindness is the real problem, and the agent itself is secondary.

Moe Delo is an online accounting and bookkeeping-outsourcing service, and agents were already running there. What they ordered was not another agent, but the layer underneath: the ability to see, cost, and check every call to a model, and to do it inside their own infrastructure without sending data out.

We delivered a working reference implementation — a small agent wired with the full production layer: tracing for every call, cost accounting, and automatic quality evaluation. All of it stands up inside their own Kubernetes. From there the team takes the reference as a template and wires their real services into the layer.

Production-readiness here begins where the demo ends: you have to see every call to a model, understand what it cost, and be able to prove the answer was correct. None of that comes for free, and almost every one of those tasks ran into a specific engineering decision.

The app shouldn’t know what’s watching it

The most important decision in the project we got wrong at first.

Tracing records the whole chain of an agent’s actions: which prompt went to the model, what it returned, how many tokens it spent, which tools it called. The easiest way to wire it up is with the ready-made SDK from Langfuse, the very observability system we were deploying. That’s what we did in the first version, and it worked on the first try.

And we threw it out. The client copies the reference onto every one of their services, so any coupling inside it multiplies along with the template. Had we tied the app to one specific observability system, that vendor lock would have ridden into every service: swapping Langfuse for Grafana or Datadog would then be impossible without rewriting code across all services at once.

So the app emits its telemetry in the open OpenTelemetry standard and pulls in nobody’s SDK. What goes where is decided by infrastructure: between the app and observability sits a separate collector, and it routes the streams. How that path works end to end is the next section.

This doesn’t buy full independence: Langfuse stays the backend for traces, scores, and annotation, and replacing it is a separate job. But the most expensive part stays vendor-neutral: the applications themselves, which a company will eventually have dozens of and which get copied most often.

How observability works

The idea is simple. Every action the agent takes while answering is recorded separately. The records flow into one place, get split there by destination, and are reassembled into a clear picture of one answer — what the agent did, what it cost, and how good the answer was. Here is that path, step by step.

The path of one answer: from agent to dashboard

The agent gets a question

A single user question starts a single agent run. Take, for example, "How many cards can I play in one turn?" During the run it parses the question, goes to the knowledge base for fragments, and calls the model for the wording.

one run = one agent run

Every action records itself

The app logs nothing by hand. The instrumentation library opens a separate record for every step: a model call, a database query, an incoming request to the service.

OpenTelemetry · auto-instrumentation

The model-call record carries the numbers

Which model answered and how many tokens went in and out — that's where the cost of the answer comes from. In the record itself it looks like this:

gen_ai.systemopenai gen_ai.operation.namechat gen_ai.request.modelvllm-fast-generation gen_ai.usage.input_tokens1843 gen_ai.usage.output_tokens52

Context gets attached to the record

Who asked, in which session, the question itself, and which fragment the search returned. Without this you can't find the answer later or understand what it was built on:

user.idu-4821 session.idthread-9f3c langfuse.trace.input"How many cards can I play in one turn?" langfuse.observation.metadata.rag_context"…a player plays one card from their hand per turn…"

All records flow into one collector

The app sends everything to a single point and doesn't know where it will be stored. Which stream goes where is decided by the collector.

OpenTelemetry Collector

The collector splits the streams

Everything goes to a shared store for infrastructure monitoring. The collector recognizes model calls by the names prefixed with gen_ai. and additionally sends them to a system built for LLMs.

ClickHouse · Langfuse

The records reassemble into a tree

From the separate records the whole answer is rebuilt, and inside it the nested steps: the answer, inside it a model call, a search, the model again. That's why it's a tree and not a flat list, and on every node you see the model, the tokens, and the cost.

trace

Quality scores go on top

Judge models and real people score every step of the answer, and the score attaches to the same record.

LLM judges · manual annotation

Before, once an agent had answered, nothing was left but the text itself. Now any answer can be opened and walked through step by step, costed, and compared with others.

The request context attaches to the records at the infrastructure level, around the agent’s business code. A developer wiring their service into the layer doesn’t place these markers by hand: they show up in the records on their own.

What one answer costs

To make cost visible, calls to models go through a single LiteLLM gateway. When every call goes through one point, token and money accounting collects in that same place instead of being smeared across services, and quotas are set centrally. Inside the client’s perimeter the requests go to their own models, running on their own hardware.

Cost came with an honest annoyance. Langfuse computes the price of a call from a built-in model catalog, and the client’s own models aren’t in that catalog, so the cost shows up as zero. To make money accounting work on your own models, prices have to be entered by hand. You only see a detail like this on a real stack with your own models; on someone else’s cloud models it simply doesn’t come up, which is why it rarely makes the checklists.

How we know the answer is any good

Answer quality is scored automatically and by hand at the same time. On the automatic side, LLM judges run in Langfuse — separate scoring models that look at the agent’s steps and rate helpfulness, conciseness, toxicity, whether the answer’s language matches the question’s, and the relevance of the retrieved context. Context relevance is measured two ways, against the current question and against the whole conversation, because those are different things.

Scores land at the level of a single step, not just the final answer. That matters: when an answer is bad, you can see where exactly it broke, in retrieval or in generation.

Manual annotation runs alongside. People rate answers on their own scales, for example whether the answer is correct and how well it is grounded in the retrieved fragments, and Langfuse has annotation queues for that. The client stands up the whole set of evaluators and scales with a single script. The subtlety is that Langfuse has no public API for configuring evaluators, so the script logs in as a regular user and sets everything up through the internal interface, and a repeated run duplicates nothing. A one-time manual setup turns into a repeatable loop that lives in their cluster.

It all runs inside their perimeter

The whole stack — observability, the model gateway, background-task orchestration, evaluation — is deployed in the client’s own Kubernetes. Traces settle in their Langfuse and ClickHouse, model requests go through their LiteLLM. In their production perimeter the models run on their own hardware, and data does not leave. We ran the demo on harmless board-game data, so for the show the model could even be an external one, but for a domain where bookkeeping and clients’ personal data sit behind the agent, the data perimeter becomes a hard requirement.

One decision grew straight out of that requirement. Ready-made filtering services from external providers send the request out for inspection, which is not allowed here. So we built the guardrail — the protective filter — ourselves and showed it as an example: here’s the platform, you can plug a ready-made filter into it or write your own, here’s your own. Unlike the basic ones, it can look at the whole conversation rather than the last message alone. The guardrail itself was an optional part of this project; we added it to the demo as a bonus.

So that the app image wouldn’t depend on private access to an internal library, we vendored that library, with its automatic instrumentation, straight into the repository. The reference stays self-contained, and there is no need to build it by reaching around the closed perimeter.

Under the hood — engineering map of the layer

APPLICATION

emits all telemetry in standard OpenTelemetry · pulls in no Langfuse SDK · filters nothing on its own

Auto-instrumentation FastAPI · SQLAlchemy · pydantic-ai

spans for HTTP requests, database calls, and model calls are created on their own, by the vendored library

Context enrichment SpanProcessor + ContextVar

who asked, the session, the question are written into the span at creation — through a FastAPI dependency, after route resolution, not in middleware

the answer and the retrieved context are written later: RAG context goes into the still-open "agent run" span via a live reference (observation-level)

OpenTelemetry Collector — single routing point

ALL TELEMETRY → ClickHouse

infrastructure monitoring: HTTP, SQL, model calls — unfiltered

MODEL ONLY → Langfuse

LLM observability: model calls only

Filter instrumentation_scope + gen_ai attributes

no wildcard, the markers are listed one by one; a tool call carries only the tool name

the trace shows up in Langfuse →

call tree tokens and cost model quality scores

DOCUMENT INGESTION INTO SEARCH Temporal

the pipeline must survive a failure and not start from scratch

Workflow by steps retry + timeout per step

download from S3 · parse and chunk · generate embeddings and write to Qdrant

Embeddings and write merged into one step payload limit

large vectors can't be passed through the Temporal server, so generation and the write to Qdrant go together

Per-document visibility

each document's status is visible in Temporal; a failed step is retried, not the whole pipeline

QUALITY EVALUATION Langfuse

LLM judges at the step level observation-level

helpfulness · conciseness · toxicity · answer language · context relevance (by question and by history)

Manual annotation annotation queues

answer correctness and grounding in fragments — on human scales, via the public API

Repeatable provisioning idempotent script

no public API for evaluators → log in as a user; a repeated run duplicates nothing and restores the setup

the whole stack in the client's own Kubernetes · models on their hardware · data never leaves the perimeter
OpenTelemetry · Langfuse · ClickHouse · Temporal · LiteLLM · Qdrant · pydantic-ai · FastAPI

What the client is left with

Behind the dull phrase “observability layer” sits a concrete new capability. The agent in production used to be a black box; now the client’s engineers see every call to a model, its cost, and its quality score, and they can wire in the next service by copying the ready-made reference, with no vendor lock and without taking data out of their perimeter.

That is where the economics of the solution lives. A single working agent is still just a demo, and what lets you trust it in production is the unglamorous layer underneath it, the one we built. You build it once and reuse it on every agent after that, which is why it is worth taking on before the company has many agents.

What we learned in the pilot

A working reference implementation the client's team copies onto their own services

The app emits standard OpenTelemetry and depends on no observability vendor's SDK

Telemetry routing and filtering live in infrastructure, not in code

End-to-end per-token cost accounting — even for the client's own models

Quality is scored at every step of the agent, not just the final answer

The whole stack runs in their cluster — data never leaves the perimeter

A working reference implementation the client's team copies onto their own services

The app emits standard OpenTelemetry and depends on no observability vendor's SDK

Telemetry routing and filtering live in infrastructure, not in code

End-to-end per-token cost accounting — even for the client's own models

Quality is scored at every step of the agent, not just the final answer

The whole stack runs in their cluster — data never leaves the perimeter

Platform modules used in this project

Observability OpenTelemetry · Langfuse · ClickHouse

Tracing for every call to a model: the call tree, tokens, cost, and a link to the user and session. Request context is written into the telemetry by a separate processor, so the business code knows nothing about the observability system.

LLM Router LiteLLM

A single gateway to the models, so token and cost accounting lives in one place and routing and quotas are set centrally. Inside their perimeter the requests go to their own models.

Evaluation Langfuse

LLM judges score every step of the agent (helpfulness, conciseness, toxicity, language, context relevance), not just the final answer. Alongside them sit manual annotation queues. The whole set of evaluators stands up from one repeatable script.

Documents Temporal · Qdrant

Document ingestion into search is wrapped in a Temporal workflow: retries and timeouts per step, per-document visibility, and a failed step retried without restarting the whole pipeline.

Chat & Agents pydantic-ai

A pydantic-ai demo agent as a reference for integrating with the platform. Deliberately trivial in what it does — the value is the production scaffolding around it, which the client copies onto their own services.

Guardrails

A hand-written protective filter as an example: you can plug a ready-made filter into the platform or write your own. It looks at the whole conversation. External services were rejected because they send the request out of the perimeter.

All platform modules →

Tell us which process you want to break down.

We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.

or write directly to ilya@manaraga.ai