The legal team of a government customer litigates against contractors under Russia's public procurement law (44-FZ): recovering penalties and terminating contracts. We built a system that prepares the lawyer's position on a dispute — the applicable rules, case law, risks, and the other side's counterarguments — a draft that the lawyer reviews and takes to court.
A large government customer has hundreds of contracts with contractors, and some of them end up in court. Every dispute lands on the legal team.
The contractor missed the deadlines, failed to deliver, broke the terms. The customer has to recover a penalty, terminate the contract, or fend off a counterclaim.
the position is assembled by hand from four sourcesWe built a system that prepares a draft of that position for the lawyer. The genuinely hard part turned out to be nothing like what you would expect from a task like this.
Take a penalty for late delivery: can it be recovered, and what is the ceiling? Any legal database will return the relevant articles in a second — liability applies, here is the formula, go to court. That is how a first-week intern would answer.
A lawyer who has actually run these disputes asks something else first: did the customer itself breach anything, hand over the site on time, sign off what was required, accept the milestones? If the customer failed to cooperate, the court can cut the penalty down or refuse to award it at all. These claims fall apart more often on counterclaims, on the customer’s own delay, or on a botched termination procedure than on a weak position on the merits.
The law here is public, it sits in open databases, and anyone can find an article. The lawyer’s value is elsewhere: remembering what gets used against you under that same law, and not repeating the mistake that tripped you up last time. The system assembles the position on a dispute — applicable rules, case law, risks, the other side’s counterarguments — and hands it over as a draft memorandum: a finished opinion with a stated position that the lawyer reviews and takes to court.
The obvious first move for a product like this is document search: take the question, find the most similar passages from laws and rulings, hand them to the model, and let it write the answer. For a quick reference question that works. For a court dispute it breaks on the very first case.
Back to the article on the customer’s duty to cooperate. In a court ruling about a contractor’s delay those words may not appear at all: cooperation is the contractor’s defense, something it raises in response, and it does not feature in the subject of the claim. Semantic search will not pull it up, because it is not in the text of the case. And the lawyer who forgot about it will hear it in the courtroom from the other side.
That is where the central decision of the whole system came from: when it comes to completeness, do not trust search. A rule being absent from the retrieved documents does not mean it is not needed. More often the opposite is true: the rules most dangerous to the customer are exactly the ones it would never write into its own claim.
So half of the engineering here lives outside search, in what the system adds to the retrieved material on its own. From the wording of the question it identifies the type of dispute and pulls in the rules the lawyer is obliged to check, even when they are nowhere in the case.
Ask about a penalty for delay, and the system lays out what the contractor will defend with before the lawyer even thinks of it. The links are written by hand, like the memory of someone who has sat through hundreds of hearings and knows which rules travel to court together. It is manual work, and for every new type of dispute the map is written out again.
There is caution in the other direction too: the system just as deliberately drops a rule that does not belong. The Civil Code and the public procurement law run on different regimes, and mixing their sanctions in one argument is a mistake the opponent spots immediately. So on a purely civil question the system does not drag in procurement articles, and when the question is specifically about reducing a penalty it removes the cooperation rules: that is a different axis of the dispute, and the extra material only adds noise.
In an ordinary chatbot the worst that happens is an invented fact. In a legal assistant something else is more dangerous: a smooth, well-phrased untruth about the law.
The same cooperation article again. Its text says “the customer is obliged,” and the model cheerfully concludes that under this article the customer has the right to recover from the contractor. It sounds confident, on point, with the right citations. The meaning is the reverse: the article protects the contractor, and a lawyer who believes the model walks into court with an argument against himself.
So a layer of checks sits on top of generation, and each one grew out of a specific mistake the model had already made. The system learned to spot these failures phrase by phrase and cut or soften the sentence they appear in — whether it is the cooperation article turned against the customer, or a penalty reduction that has drifted into a 44-FZ argument.
The most important articles have a safety net: if the model went silent or got it wrong, a wording vetted in advance by a lawyer goes into the answer. There must be no empty space where a key rule should be.
The answer is built like a memorandum a senior lawyer would write. Each block is assembled separately and under its own constraints, so the recommendations do not dissolve into polite mush. Here is what such a draft looks like on the running penalty example.
The benchmark here is set by real answers from lawyers. For each of the seven dispute categories there is a model analysis in the same memorandum structure, and the system measures itself against them: did it match on substance, what did it miss, what did it add that did not belong. The same benchmarks drive a regular run so that quality does not drift after the next round of edits.
The whole system runs inside the customer’s perimeter. Court cases, correspondence, and internal materials never leave it, and model inference runs in the same place, inside the boundary. For a government body this is the condition without which the conversation does not even start.
And all of it was assembled on the customer’s real cases, not on a clean training corpus. The data arrives dirty:
All of this has to be normalized to a single form, otherwise the customer’s own case cannot be found by its own number. The work is thankless, but there is no way around it: a citation to a rule or a case has to be exact down to the article and the number, and a wrong citation in a memorandum is worse than a missing one. A lawyer notices an absence right away; a wrong reference he catches only in court.
The system works as an assistant to the customer’s legal team. The lawyer used to start a dispute from a blank page, assembling the position by hand across several databases. Now he opens a ready draft that already brings together the applicable rules, including the ones the other side will strike with, the customer’s relevant past cases, the risks, and the recommendations. From there he reviews, edits, and decides what to take to court. The final word stays with him.
The system does not know the law better than the lawyer, and it does not need to. It holds what a human holds poorly: it remembers the inconvenient rule that was left out, and it does not confuse who each article actually helps. Those are the two things that lose procurement disputes.
Three corpora in Qdrant: statutes (44-FZ, the Civil Code, government decrees, letters from the ministries), general case law (reviews and plenary rulings of the Supreme and Higher Commercial Courts), and the customer's own cases. On top of retrieval, a manual injection of rules that are not in the case text but the lawyer is obliged to check
pydantic-ai over the legal pipeline. The answer is assembled section by section — bottom line, rules, case law, the customer's experience, risks, recommendations — each with its own constraints. Conversation history lives in threads
Phrase-level filters against inverted interpretation: the cooperation article without false recovery rights for the customer, the Civil Code rule on reducing a penalty kept out of a 44-FZ argument. For key articles, lawyer-vetted wordings in case the model fails
Benchmark answers from lawyers across seven dispute categories. Word-level similarity plus a judge-model assessment: what matched, what was missed, what was added that did not belong. A regular run so quality does not drift after edits
DeepSeek for section-by-section memorandum assembly, Qwen3-Embedding-0.6B for vectorizing the three corpora. Inference runs inside the customer's perimeter
We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.
Inquiry sent
We will reply within one business day to the email you provided.
or write directly to ilya@manaraga.ai