← All cases
Entertainment Development harness

PARTYstation: a development harness for a live legacy platform

We are building a development harness for the live party-game platform PARTYstation, written in PHP and Node: wrapping the working system in tests, maps, and rules so that a couple of people with agents can rewrite and maintain it without breaking anything in production. The goal is to cover the whole system with the harness; how much of it gets rewritten is secondary.

A live product you can’t stop

PARTYstation is a platform for party games. The game runs on a big screen — a TV at home, a birthday, a corporate event — and players join from their own phones as controllers: quizzes, charades, associations, meme rounds. Around ten game types, up to a few dozen players in one room, web-TV, mobile apps, Android TV, Samsung. The same PARTYstation games are also embedded inside streaming services, Kinopoisk and Wink, where you launch them right from the TV. The platform earns through subscriptions and paid games, so every interrupted game hits revenue directly: a guest didn’t finish playing and cancelled the subscription.

PARTYstation: a quiz on the big screen, players' phones as controllers
The game runs on the big screen; phones are the controllers

By the time we joined, the platform had been in production for several years. Its backend had drifted across two languages: a REST API in PHP and game servers in Node.js that hold a websocket connection to every phone in the room. On top of that, years of accumulated legacy: code no one holds in their head anymore, integrations left “just in case,” and behavior that works but nobody remembers why. Technical debt piled up faster than new features shipped, and a large team could no longer keep it under control.

What such legacy costs shows up in a New Year incident. Tournament results actually live in the database, but the game server read them only from a fast Redis cache and couldn’t fall back to the database when the cache was gone. On New Year’s night the shared Redis restarted mid-tournament, the cache of results emptied, and players never saw their standings until the cache refilled. The data was safe the whole time; the system simply couldn’t reach it.

The deploy, meanwhile, is run by hand: a shell script enters each repository, runs git reset --hard, pulls master, and restarts the processes. There is no application-level observability; incidents are dug through over SSH and by reading logs by hand.

The company chose not to bring the large team back. Instead it is changing the development model itself: keep a small group of engineers and run the product through coding agents. The bet is simple — a couple of people with agents should hold and develop the system that a whole team used to carry.

Our job is to build what makes that model possible. Internally we call it a development harness: the scaffolding around the code that lets an agent understand someone else’s system, avoid harming it, and carry a task to completion with minimal human involvement.

The product itself becomes the spec

We designated the system’s current behavior as the spec. A product company will not read and sign off technical documentation for a refactor: they have no time to validate what looks to them like an internal reshuffle. The reference is the code on their production branch and how it responds right now.

The rewritten version has to pass the same regression as the original, and that is the acceptance criterion. A document that would be stale by the next release was replaced with a check you can run at any moment.

The whole process in four steps

That decision is where the whole method comes from. It fits into four steps, each building on the last, and the rest of this case takes them one at a time.

01
Map the legacy
An architecture map across 11 repositories. The agent reads the foreign code only through an isolated archaeologist sub-agent. Every bug found goes into a ledger.
understand it
02
Pin behavior down with tests
One test suite on two stands, old and new. A big screen and players' phones in one room for the multiplayer flow. A divergence between the stands is the error.
the reference
03
Rewrite at parity
Behavior identical, down to side effects. We mirror legacy bugs on purpose and fix only the cheap, safe ones.
Python
04
Roll out one endpoint at a time
nginx sends the rewritten addresses to the new service, the rest to the old PHP. Rollback is one line of config.
no big bang

A test first, then a line of new code

If the spec is behavior, you have to pin it down before you touch the code. We run the same test suite against two stands: the old one on PHP and Node and the new one on Python. The test doesn’t know which system it’s running on, and branches like “if this is the new stand, check differently” are forbidden: divergence in behavior between the stands is exactly the error signal.

First the same suite runs against their live, not-yet-rewritten system. A green run there confirms the test describes real behavior, and only then does it become the reference for the new code.

We test with a ladder of levels, from small to large:

The hardest level is a real-time game for several people. To check one scenario, the test spins up several isolated browsers at once: the big screen where the game runs, and players’ phones — the host plays from a phone too. Each has its own cookies, and they’re all brought together into one room through the shared pool of game servers over websockets.

Then the harness plays the real game protocol — pick a game, answer a question, move to the next round, pause — and checks the room state that is broadcast to all participants. There are no mock stand-ins for real parts of the system anywhere here. The only thing we’re allowed to change in the product itself is adding markup identifiers to interface elements so the tests can latch onto them.

Parity on the outside, clean architecture inside

There are two things to keep apart here: what the system does on the outside and how it’s built on the inside. On the outside — the endpoint’s response, its status, and the side effects other parts of the system depend on. That’s the contract, and we hold it one to one. How it’s done on the inside we rebuild cleanly, and we don’t drag architectural stupidity into the new code.

So parity is not only a matching endpoint response (a separate API address). The old code has side effects something already relies on: writes to the database, rows in the audit log, calls to neighboring services, cache invalidation. A faithful port reproduces those too. The admin login, for example, writes a row into the logs table with a specific category, level, and message text — we write such details out for every endpoint under a “side effects” heading, otherwise any naively rewritten code silently loses them.

The most counterintuitive decision grows from the same logic. A bug in the business logic that mobile clients, an admin proxy, and internal jobs already depend on, we carry over as is: quietly fixing it means breaking those who relied on the old shape. But an internal stupidity like dozens of extra database queries per player action we do not reproduce, because no one sees it from the outside and rebuilding it cleanly is safe.

Every defect we find goes into a ledger that travels with the code, and each one gets its own decision: carry the behavior over or rebuild it from scratch. We truly fix only where the change is cheap and safe, risk on the order of ten to fifteen percent or below.

Defect in legacyDecisionWhy
A service endpoint returns player data with no access checkCarry as is, fix on a separate planClosing access would change the contract for everyone who calls it: the admin proxy, internal jobs. The risk is above the threshold
Dozens of extra DB queries per player action for achievementsRebuild from scratchIt’s an internal hack no one sees from the outside. A fan-out like that can’t be reproduced in the new code
A request with an unknown content-type returns 200 instead of an errorLeave as isThe old mobile app depends on the current error code, so a fix would break live clients

Teaching an agent to work in someone else’s code

The main trap of agentic development on legacy surfaced early: when an agent reads the old PHP directly, it starts writing Python in PHP’s style and drags the foreign structure into the new code instead of clean layers. So we isolated access to the legacy.

The only way for the main agent to look into the old code is a separate archaeologist sub-agent. It works read-only, it’s handed the exact path to the repository (guessing and roaming the disk are forbidden), and it returns a short summary with references to specific lines and a mandatory account of side effects; the files themselves never reach the main context. That way the main agent’s context stays clean and the new code’s architecture doesn’t drift toward legacy.

Then come the rules the agent writes our own code by. Each service has its own layering instruction with a strict decision table and one honest exit: if a change doesn’t fit the layers, the agent must stop and ask rather than invent a new layer.

The same instructions damp the models’ usual habit of over-engineering — no abstract interfaces where there is only one implementation, no per-entity repositories “for the future.” Separate rules require failing immediately when a setting is missing rather than substituting a silent default that surfaces as a bug later.

One endpoint at a time

The rollout is built so you can roll back at any moment by removing one line in a config. Two stands are brought up locally, the old and the new, identical at first. On the new one a container with the rewritten service is added, and nginx, the web server out front, decides where each request goes.

Request hitsWhere it goes now
Rewritten endpoints — right now, authorizationThe new Python service
Everything elseThe old PHP, as before

The migration goes one endpoint at a time, under the protection of tests, with no “big bang” where everything is switched at once. The new code runs on the same production database as the old system, so any change to its structure has to coexist with what legacy keeps writing into that same database.

Today, authorization already works through this routing: ten endpoints for players and admins, rewritten to Python and covered by tests before the rewrite. For now this is a working contour beside the live system, brought up in the client’s Russian cloud so player data doesn’t leave its perimeter; onto production itself the new code rolls out the same endpoint-by-endpoint way.

The next large chunk is the game server, the websocket one: in legacy that’s about 32 thousand lines, a dozen state machines (essentially a separate engine for each game type, running a match by its rules), and more than eighty message types in both directions. It’s in progress now: the skeleton, the routing, and a check that Redis doesn’t lose data on restart (the very thing that left players without their tournament results that night) are already in Python; by the team’s estimate that’s about a fifth of the server.

Under the hood — how the harness is built
THE LIVE LEGACY
11 repositories · REST API in PHP · game servers in Node.js · web-TV, mobile, Android TV, Samsung · one Postgres, Redis, deploy by shell scripts
two parallel tracks
MAP: UNDERSTAND WHAT EXISTS
build a shared picture of someone else's system before the first line of new code
Architecture map of the legacy C4 notation
across 11 repositories: context · containers · components, plus a read-only inspection of production
Archaeologist sub-agent read-only
the path to the repository is handed in, guessing is forbidden
returns a summary with references to lines and an account of side effects
Legacy-bug ledger travels with the code
a status and a decision for each: carry as is, deviate, or fix
fix only if the risk of the change is below the threshold
SPEC IN TESTS: PIN DOWN BEHAVIOR
behavior, not a document, becomes the acceptance criterion
One suite on two stands old and new
the test doesn't know where it runs; a divergence in behavior is the error
first a run against the live system — a check that the test is right
Multiplayer in browsers websocket
a host on web-TV plus phone-controllers in separate contexts in one room
no mocks; in the product we change only the markup identifiers
Coverage map 1032 scenarios / 15 domains
the single source of truth for scenarios and the work plan
one reference for behavior →
map of the system bug ledger tests on two stands
PORT TO PYTHON parity down to side effects
a uv monorepo · FastAPI + Pydantic v2 · async SQLAlchemy
schema changes made safe to re-run so they coexist with what legacy keeps writing into the same database
contracts are the single source of message shape
snake_case in code, camelCase on the wire; frontend types are derived from them
layers per service plus rules for the agent
fail when a setting is missing instead of silent defaults, no over-engineering "for the future"
ROLLOUT one endpoint at a time
nginx rules route the rewritten addresses to the new container, everything else to the old PHP
rollback — remove the route line in the config, no "big bang"
authorization runs on a working contour beside the live system · game server: skeleton and routing in Python (~a fifth) · 1032 scenarios, a small part automated
Python · FastAPI · Pydantic v2 · SQLAlchemy async · uv monorepo · archaeologist sub-agent · bug ledger · coverage map
the agent scaffolding stays with the developers and is not part of the client's product

What changes and where we are now

The main goal of the project is to cover the whole system with a development harness: the scaffolding through which any task on it can be driven by an agent with minimal human involvement. The rewrite to Python is secondary. As much gets rewritten as we manage, while the harness has to cover everything, including the parts we won’t touch.

So the result of this first part is measured not in lines of rewritten code. Work that used to require a whole team is now carried by a couple of people with agents, and the legacy stays whole in production the whole time.

Piece of the systemState
Legacy map, bug ledger, agent harnessDone
Authorization — 10 endpointsRewritten, running on a working contour
Game serverIn progress, about a fifth
Coverage map — 1032 scenarios / 15 domainsBeing automated, a small part so far
The rest of the clients, observabilityNext on the plan

The deal itself is structured honestly for unfinished work: the timeline and the harness coverage of the system are fixed, while the depth of the rewrite stays variable — as deep as we manage, starting with the heaviest pieces of legacy. The order is the same everywhere: a test first, then a line of new code.

What we learned in the pilot

Legacy as an executable spec instead of documentation to sign off
One test suite passes on both the old and the new code
Parity down to side effects: DB writes, audit logs, cache invalidations
We mirror legacy bugs on purpose — a ledger with a risk threshold for fixing
An architecture map of 11 repositories before the first line of new code
The agent touches the legacy only through an isolated archaeologist sub-agent
Authorization is already rewritten and serving through nginx routing
Legacy as an executable spec instead of documentation to sign off
One test suite passes on both the old and the new code
Parity down to side effects: DB writes, audit logs, cache invalidations
We mirror legacy bugs on purpose — a ledger with a risk threshold for fixing
An architecture map of 11 repositories before the first line of new code
The agent touches the legacy only through an isolated archaeologist sub-agent
Authorization is already rewritten and serving through nginx routing

Tell us which process you want to break down.

We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.

or write directly to ilya@manaraga.ai