We are building a development harness for the live party-game platform PARTYstation, written in PHP and Node: wrapping the working system in tests, maps, and rules so that a couple of people with agents can rewrite and maintain it without breaking anything in production. The goal is to cover the whole system with the harness; how much of it gets rewritten is secondary.
PARTYstation is a platform for party games. The game runs on a big screen — a TV at home, a birthday, a corporate event — and players join from their own phones as controllers: quizzes, charades, associations, meme rounds. Around ten game types, up to a few dozen players in one room, web-TV, mobile apps, Android TV, Samsung. The same PARTYstation games are also embedded inside streaming services, Kinopoisk and Wink, where you launch them right from the TV. The platform earns through subscriptions and paid games, so every interrupted game hits revenue directly: a guest didn’t finish playing and cancelled the subscription.
By the time we joined, the platform had been in production for several years. Its backend had drifted across two languages: a REST API in PHP and game servers in Node.js that hold a websocket connection to every phone in the room. On top of that, years of accumulated legacy: code no one holds in their head anymore, integrations left “just in case,” and behavior that works but nobody remembers why. Technical debt piled up faster than new features shipped, and a large team could no longer keep it under control.
What such legacy costs shows up in a New Year incident. Tournament results actually live in the database, but the game server read them only from a fast Redis cache and couldn’t fall back to the database when the cache was gone. On New Year’s night the shared Redis restarted mid-tournament, the cache of results emptied, and players never saw their standings until the cache refilled. The data was safe the whole time; the system simply couldn’t reach it.
The deploy, meanwhile, is run by hand: a shell script enters each repository, runs git reset --hard, pulls master, and restarts the processes. There is no application-level observability; incidents are dug through over SSH and by reading logs by hand.
The company chose not to bring the large team back. Instead it is changing the development model itself: keep a small group of engineers and run the product through coding agents. The bet is simple — a couple of people with agents should hold and develop the system that a whole team used to carry.
Our job is to build what makes that model possible. Internally we call it a development harness: the scaffolding around the code that lets an agent understand someone else’s system, avoid harming it, and carry a task to completion with minimal human involvement.
We designated the system’s current behavior as the spec. A product company will not read and sign off technical documentation for a refactor: they have no time to validate what looks to them like an internal reshuffle. The reference is the code on their production branch and how it responds right now.
The rewritten version has to pass the same regression as the original, and that is the acceptance criterion. A document that would be stale by the next release was replaced with a check you can run at any moment.
That decision is where the whole method comes from. It fits into four steps, each building on the last, and the rest of this case takes them one at a time.
If the spec is behavior, you have to pin it down before you touch the code. We run the same test suite against two stands: the old one on PHP and Node and the new one on Python. The test doesn’t know which system it’s running on, and branches like “if this is the new stand, check differently” are forbidden: divergence in behavior between the stands is exactly the error signal.
First the same suite runs against their live, not-yet-rewritten system. A green run there confirms the test describes real behavior, and only then does it become the reference for the new code.
We test with a ladder of levels, from small to large:
The hardest level is a real-time game for several people. To check one scenario, the test spins up several isolated browsers at once: the big screen where the game runs, and players’ phones — the host plays from a phone too. Each has its own cookies, and they’re all brought together into one room through the shared pool of game servers over websockets.
Then the harness plays the real game protocol — pick a game, answer a question, move to the next round, pause — and checks the room state that is broadcast to all participants. There are no mock stand-ins for real parts of the system anywhere here. The only thing we’re allowed to change in the product itself is adding markup identifiers to interface elements so the tests can latch onto them.
There are two things to keep apart here: what the system does on the outside and how it’s built on the inside. On the outside — the endpoint’s response, its status, and the side effects other parts of the system depend on. That’s the contract, and we hold it one to one. How it’s done on the inside we rebuild cleanly, and we don’t drag architectural stupidity into the new code.
So parity is not only a matching endpoint response (a separate API address). The old code has side effects something already relies on: writes to the database, rows in the audit log, calls to neighboring services, cache invalidation. A faithful port reproduces those too. The admin login, for example, writes a row into the logs table with a specific category, level, and message text — we write such details out for every endpoint under a “side effects” heading, otherwise any naively rewritten code silently loses them.
The most counterintuitive decision grows from the same logic. A bug in the business logic that mobile clients, an admin proxy, and internal jobs already depend on, we carry over as is: quietly fixing it means breaking those who relied on the old shape. But an internal stupidity like dozens of extra database queries per player action we do not reproduce, because no one sees it from the outside and rebuilding it cleanly is safe.
Every defect we find goes into a ledger that travels with the code, and each one gets its own decision: carry the behavior over or rebuild it from scratch. We truly fix only where the change is cheap and safe, risk on the order of ten to fifteen percent or below.
| Defect in legacy | Decision | Why |
|---|---|---|
| A service endpoint returns player data with no access check | Carry as is, fix on a separate plan | Closing access would change the contract for everyone who calls it: the admin proxy, internal jobs. The risk is above the threshold |
| Dozens of extra DB queries per player action for achievements | Rebuild from scratch | It’s an internal hack no one sees from the outside. A fan-out like that can’t be reproduced in the new code |
| A request with an unknown content-type returns 200 instead of an error | Leave as is | The old mobile app depends on the current error code, so a fix would break live clients |
The main trap of agentic development on legacy surfaced early: when an agent reads the old PHP directly, it starts writing Python in PHP’s style and drags the foreign structure into the new code instead of clean layers. So we isolated access to the legacy.
The only way for the main agent to look into the old code is a separate archaeologist sub-agent. It works read-only, it’s handed the exact path to the repository (guessing and roaming the disk are forbidden), and it returns a short summary with references to specific lines and a mandatory account of side effects; the files themselves never reach the main context. That way the main agent’s context stays clean and the new code’s architecture doesn’t drift toward legacy.
Then come the rules the agent writes our own code by. Each service has its own layering instruction with a strict decision table and one honest exit: if a change doesn’t fit the layers, the agent must stop and ask rather than invent a new layer.
The same instructions damp the models’ usual habit of over-engineering — no abstract interfaces where there is only one implementation, no per-entity repositories “for the future.” Separate rules require failing immediately when a setting is missing rather than substituting a silent default that surfaces as a bug later.
The rollout is built so you can roll back at any moment by removing one line in a config. Two stands are brought up locally, the old and the new, identical at first. On the new one a container with the rewritten service is added, and nginx, the web server out front, decides where each request goes.
| Request hits | Where it goes now |
|---|---|
| Rewritten endpoints — right now, authorization | The new Python service |
| Everything else | The old PHP, as before |
The migration goes one endpoint at a time, under the protection of tests, with no “big bang” where everything is switched at once. The new code runs on the same production database as the old system, so any change to its structure has to coexist with what legacy keeps writing into that same database.
Today, authorization already works through this routing: ten endpoints for players and admins, rewritten to Python and covered by tests before the rewrite. For now this is a working contour beside the live system, brought up in the client’s Russian cloud so player data doesn’t leave its perimeter; onto production itself the new code rolls out the same endpoint-by-endpoint way.
The next large chunk is the game server, the websocket one: in legacy that’s about 32 thousand lines, a dozen state machines (essentially a separate engine for each game type, running a match by its rules), and more than eighty message types in both directions. It’s in progress now: the skeleton, the routing, and a check that Redis doesn’t lose data on restart (the very thing that left players without their tournament results that night) are already in Python; by the team’s estimate that’s about a fifth of the server.
The main goal of the project is to cover the whole system with a development harness: the scaffolding through which any task on it can be driven by an agent with minimal human involvement. The rewrite to Python is secondary. As much gets rewritten as we manage, while the harness has to cover everything, including the parts we won’t touch.
So the result of this first part is measured not in lines of rewritten code. Work that used to require a whole team is now carried by a couple of people with agents, and the legacy stays whole in production the whole time.
| Piece of the system | State |
|---|---|
| Legacy map, bug ledger, agent harness | Done |
| Authorization — 10 endpoints | Rewritten, running on a working contour |
| Game server | In progress, about a fifth |
| Coverage map — 1032 scenarios / 15 domains | Being automated, a small part so far |
| The rest of the clients, observability | Next on the plan |
The deal itself is structured honestly for unfinished work: the timeline and the harness coverage of the system are fixed, while the depth of the rewrite stays variable — as deep as we manage, starting with the heaviest pieces of legacy. The order is the same everywhere: a test first, then a line of new code.
We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.
Inquiry sent
We will reply within one business day to the email you provided.
or write directly to ilya@manaraga.ai