This layer grew out of four deployments where the same work kept repeating: pull rules out of documents and chats, lift similar real cases, and see before release where the agent is still acting under the old scenario.
In one deployment, 47% of all changes came from live conversations. In another, corporate tone went through 60+ iterations. In a third, the confidence formula grew to more than 30 parameters. In a fourth, every intermediate step needed its own verification set.
After several deployments the same pattern became obvious: the real cost is not model hosting or RAG wiring. The expensive part starts later, when rules have to be pulled out of PDFs, DOCX files, spreadsheets, and chats, production mistakes must become test cases, and the whole quality loop cannot be rebuilt from scratch after every contract or scenario change. Datasetgen grew out of that repeated work.
The full platform map lives on the platform overview →
A patient writes in the DMS chat: “I need an MRI, tomorrow afternoon works for me.” Until yesterday the agent could prepare a clinic booking right away. Now the program says this route must go through telemedicine first, and no guarantee letter can be issued before that step.
Without a layer like this, the change lives in a PDF, in messages from the business team, and in the analyst’s head. The agent can easily keep following the old route in some scenarios: trying to book immediately or issuing the guarantee letter too early.
Datasetgen takes the new rule, pulls similar real chats, and builds the verification set from them: where direct booking is allowed, where the flow must stop at telemedicine, and where the case must go to an operator.
The team then runs the new agent version through that set and gets not an abstract “quality loop”, but a concrete list of scenarios where the agent still confuses the route, the clinic choice, or the timing of the guarantee letter.
This was not invented in isolation. Every block appeared after a concrete pain point in a live deployment.
In the investment project, 47% of all changes came from live conversations. That made one thing clear: production conversations must become new rules and test cases, not a postmortem discussion on a call.
In telecom, corporate tone and phrasing went through 60+ iterations. We needed a reusable way to lock scenarios down and verify quality after each change instead of rebuilding that loop from scratch.
On the transport project, the confidence formula grew to 30+ parameters and every penalty came from a concrete production failure. That showed that failure often lives in the data and checks before it reaches the model.
In the DMS workflow, we had to keep separate verification sets for chat parsing, service matching, visits, notes, and operator QA. One final score was useless; the loop had to be step-by-step.
We pull requirements out of documents, spreadsheets, diagrams, and examples so the team has one place describing rules, constraints, and edge cases.
The sets are not created “for the domain in general” but under the precise input/output schema, scenario types, and boundaries the agent must hold.
The target is not merely valid YAML or JSON. We check requirement coverage, negative and boundary cases, contradictions in expected output, and signs of quality drift.
When the agent schema or product contract changes, we try not to rebuild everything from scratch. The sets are patched, extended, and moved to the new contract where possible.
The challenge is not getting a file out. The challenge is making expected output reflect the real process logic and hidden constraints instead of merely looking plausible.
The hardest part is translating analyst, QA, and operator judgment into fields, negative cases, tolerances, and verifiable scenarios. That is exactly where ordinary open-source setup stops being enough.
And this is not one-off work. Documents change, agent contracts change, scenarios expand. The test sets and the quality loop must evolve with the system instead of being thrown away after every change.
This is not a separate product standing next to the platform. It is the junction of three modules that already exist on the platform map.
The main loop: synthetic and test sets, LLM judges, regressions, and checks for degradation after changes.
The intake into the loop: documents, spreadsheets, guidelines, and examples become one rule base for generation and verification.
BA and QA workflows on top of the agent runtime: intake, requirement normalization, set preparation, and targeted fixes after changes.
A new project no longer starts its quality loop from a blank page. Repeated steps are already packaged, so the team spends time on domain specifics instead of mechanical artifact assembly.
After changes in the agent or the documents, the whole loop does not have to be rebuilt blindly. The sets can be updated selectively and checked against the exact place where drift appeared.
For the client this means something simple: after an agent change, the team does not guess what broke. It gets updated rules, a verification set, and a report showing which scenarios no longer pass.
The four deployments that created this layer live on the cases page. See case studies →
We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.
Inquiry sent
We will reply within one business day to the email you provided.
or write directly to ilya@manaraga.ai