Datasetgen

Datasetgen: rules, eval suites, and regressions after every change

This layer grew out of four deployments where the same work kept repeating: pull rules out of documents and chats, lift similar real cases, and see before release where the agent is still acting under the old scenario.

In one deployment, 47% of all changes came from live conversations. In another, corporate tone went through 60+ iterations. In a third, the confidence formula grew to more than 30 parameters. In a fourth, every intermediate step needed its own verification set.

After several deployments the same pattern became obvious: the real cost is not model hosting or RAG wiring. The expensive part starts later, when rules have to be pulled out of PDFs, DOCX files, spreadsheets, and chats, production mistakes must become test cases, and the whole quality loop cannot be rebuilt from scratch after every contract or scenario change. Datasetgen grew out of that repeated work.

The full platform map lives on the platform overview →

How it looks in one real change

A patient writes in the DMS chat: “I need an MRI, tomorrow afternoon works for me.” Until yesterday the agent could prepare a clinic booking right away. Now the program says this route must go through telemedicine first, and no guarantee letter can be issued before that step.

Without a layer like this, the change lives in a PDF, in messages from the business team, and in the analyst’s head. The agent can easily keep following the old route in some scenarios: trying to book immediately or issuing the guarantee letter too early.

Datasetgen takes the new rule, pulls similar real chats, and builds the verification set from them: where direct booking is allowed, where the flow must stop at telemedicine, and where the case must go to an operator.

The team then runs the new agent version through that set and gets not an abstract “quality loop”, but a concrete list of scenarios where the agent still confuses the route, the clinic choice, or the timing of the guarantee letter.

Documents and chats turned into one rule base

Test sets under the exact agent contract

Fixes and migration instead of one-off generation

BA and QA inside the same loop

A reusable layer grown out of four deployments

Documents and chats turned into one rule base

Test sets under the exact agent contract

Fixes and migration instead of one-off generation

BA and QA inside the same loop

A reusable layer grown out of four deployments

Which case studies created it

This was not invented in isolation. Every block appeared after a concrete pain point in a live deployment.

AI agent replaces the personal manager for small investors

In the investment project, 47% of all changes came from live conversations. That made one thing clear: production conversations must become new rules and test cases, not a postmortem discussion on a call.

Case →

Operator of an Urban Transport System

On the transport project, the confidence formula grew to 30+ parameters and every penalty came from a concrete production failure. That showed that failure often lives in the data and checks before it reaches the model.

Case →

Luchi: a decision system for the VHI service workflow

In the DMS workflow, we had to keep separate verification sets for chat parsing, service matching, visits, notes, and operator QA. One final score was useless; the loop had to be step-by-step.

Case →

What datasetgen actually does

Builds one rule base

We pull requirements out of documents, spreadsheets, diagrams, and examples so the team has one place describing rules, constraints, and edge cases.

Builds test sets under the exact agent contract

The sets are not created “for the domain in general” but under the precise input/output schema, scenario types, and boundaries the agent must hold.

Checks meaning, not just format

The target is not merely valid YAML or JSON. We check requirement coverage, negative and boundary cases, contradictions in expected output, and signs of quality drift.

Keeps the sets alive after changes

When the agent schema or product contract changes, we try not to rebuild everything from scratch. The sets are patched, extended, and moved to the new contract where possible.

What is harder here than ordinary open-source setup

The challenge is not getting a file out. The challenge is making expected output reflect the real process logic and hidden constraints instead of merely looking plausible.

The hardest part is translating analyst, QA, and operator judgment into fields, negative cases, tolerances, and verifiable scenarios. That is exactly where ordinary open-source setup stops being enough.

And this is not one-off work. Documents change, agent contracts change, scenarios expand. The test sets and the quality loop must evolve with the system instead of being thrown away after every change.

Where it sits inside the platform

This is not a separate product standing next to the platform. It is the junction of three modules that already exist on the platform map.

Evaluation

The main loop: synthetic and test sets, LLM judges, regressions, and checks for degradation after changes.

Documents

The intake into the loop: documents, spreadsheets, guidelines, and examples become one rule base for generation and verification.

Chat & Agents

BA and QA workflows on top of the agent runtime: intake, requirement normalization, set preparation, and targeted fixes after changes.

platform overview →

How we assemble this loop

Show the workflow

INPUTS

PDF / DOCX / spreadsheets · production conversations · QA artifacts · input / output schemas

in parallel

CONTEXT

turn fragmented evidence into one source of truth

Requirement normalization documents → requirements

rules, constraints, examples, and edge cases are extracted from documents

Research and intake BA workflow

when context is still raw, the layer first exposes uncertainty and locks down what is actually known

CONTRACT

fix the exact contract the agent must satisfy

Schema and examples strict shape

input / output structure, field order, allowed values, and invariants

Domain rules project-specific

RAG, subscriber mocks, Excel scenarios, and other inputs get their own generation logic

one source of truth →

requirements.md agent schema examples and constraints

GENERATION dataset creation

positive, negative, edge, and cross-entity cases are generated under the exact schema

expected outputs must stay grounded in source documents or known process rules

light checks stop broken format from moving to the next stage

EVALUATION quality proof

Dataset review coverage + semantics

coverage, suspicious cases, repetition, and internal contradictions are checked

Fast and detailed reports judge loop

a quick sanity check during generation and a deeper offline report before release

EVOLUTION after the agent changes

Targeted fixes patch, not rewrite

problematic cases are fixed or added without rebuilding the whole set

Schema migration contract drift

existing eval suites survive a new agent contract instead of dying after every refactor

Reverse spec handover

when code outruns documentation, the layer reconstructs the spec from the actual implementation

documents → requirements → datasets → eval reports → fixes / migrations · BA + QA workflows in one loop · reusable across projects while domain logic stays client-specific

What changes in the project

A new project no longer starts its quality loop from a blank page. Repeated steps are already packaged, so the team spends time on domain specifics instead of mechanical artifact assembly.

After changes in the agent or the documents, the whole loop does not have to be rebuilt blindly. The sets can be updated selectively and checked against the exact place where drift appeared.

For the client this means something simple: after an agent change, the team does not guess what broke. It gets updated rules, a verification set, and a report showing which scenarios no longer pass.

The four deployments that created this layer live on the cases page. See case studies →

Tell us which process you want to break down.

We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.

or write directly to ilya@manaraga.ai