PARTYstation: a development harness for a live legacy platform

A live product you can’t stop

PARTYstation is a platform for party games. The game runs on a big screen — a TV at home, a birthday, a corporate event — and players join from their own phones as controllers: quizzes, charades, associations, meme rounds. Around ten game types, up to a few dozen players in one room, web-TV, mobile apps, Android TV, Samsung. The same PARTYstation games are also embedded inside streaming services, Kinopoisk and Wink, where you launch them right from the TV. The platform earns through subscriptions and paid games, so every interrupted game hits revenue directly: a guest didn’t finish playing and cancelled the subscription.

PARTYstation: a quiz on the big screen, players' phones as controllers — The game runs on the big screen; phones are the controllers

By the time we joined, the platform had been in production for several years. Its backend had drifted across two languages: a REST API in PHP and game servers in Node.js that hold a websocket connection to every phone in the room. On top of that, years of accumulated legacy: code no one holds in their head anymore, integrations left “just in case,” and behavior that works but nobody remembers why. Technical debt piled up faster than new features shipped, and a large team could no longer keep it under control.

What such legacy costs shows up in a New Year incident. Tournament results actually live in the database, but the game server read them only from a fast Redis cache and couldn’t fall back to the database when the cache was gone. On New Year’s night the shared Redis restarted mid-tournament, the cache of results emptied, and players never saw their standings until the cache refilled. The data was safe the whole time; the system simply couldn’t reach it.

The deploy, meanwhile, is run by hand: a shell script enters each repository, runs git reset --hard, pulls master, and restarts the processes. There is no application-level observability; incidents are dug through over SSH and by reading logs by hand.

The company chose not to bring the large team back. Instead it is changing the development model itself: keep a small group of engineers and run the product through coding agents. The bet is simple — a couple of people with agents should hold and develop the system that a whole team used to carry.

Our job is to build what makes that model possible. Internally we call it a development harness: the scaffolding around the code that lets an agent understand someone else’s system, avoid harming it, and carry a task to completion with minimal human involvement.

The product itself becomes the spec

We designated the system’s current behavior as the spec. A product company will not read and sign off technical documentation for a refactor: they have no time to validate what looks to them like an internal reshuffle. The reference is the code on their production branch and how it responds right now.

The rewritten version has to pass the same regression as the original, and that is the acceptance criterion. A document that would be stale by the next release was replaced with a check you can run at any moment.

The whole process in four steps

That decision is where the whole method comes from. It fits into four steps, each building on the last, and the rest of this case takes them one at a time.

Map the legacy

An architecture map across 11 repositories. The agent reads the foreign code only through an isolated archaeologist sub-agent. Every bug found goes into a ledger.

understand it

Pin behavior down with tests

One test suite on two stands, old and new. A big screen and players' phones in one room for the multiplayer flow. A divergence between the stands is the error.

the reference

Rewrite at parity

Behavior identical, down to side effects. We mirror legacy bugs on purpose and fix only the cheap, safe ones.

Python

Roll out one endpoint at a time

nginx sends the rewritten addresses to the new service, the rest to the old PHP. Rollback is one line of config.

no big bang

A test first, then a line of new code

If the spec is behavior, you have to pin it down before you touch the code. We run the same test suite against two stands: the old one on PHP and Node and the new one on Python. The test doesn’t know which system it’s running on, and branches like “if this is the new stand, check differently” are forbidden: divergence in behavior between the stands is exactly the error signal.

First the same suite runs against their live, not-yet-rewritten system. A green run there confirms the test describes real behavior, and only then does it become the reference for the new code.

We test with a ladder of levels, from small to large:

Unit tests check business logic in isolation: everything external, from the database to neighboring services, is replaced with fakes. The developer writes them next to the code.
Integration tests call a real endpoint on a running server that genuinely hits the database and the neighboring services. Only the network to external login providers like VK and Yandex is mocked.
End-to-end tests via Playwright spin up a real browser and walk a scenario the way a user would: open, click, type, wait for the right screen.

The hardest level is a real-time game for several people. To check one scenario, the test spins up several isolated browsers at once: the big screen where the game runs, and players’ phones — the host plays from a phone too. Each has its own cookies, and they’re all brought together into one room through the shared pool of game servers over websockets.

Then the harness plays the real game protocol — pick a game, answer a question, move to the next round, pause — and checks the room state that is broadcast to all participants. There are no mock stand-ins for real parts of the system anywhere here. The only thing we’re allowed to change in the product itself is adding markup identifiers to interface elements so the tests can latch onto them.

Parity on the outside, clean architecture inside

There are two things to keep apart here: what the system does on the outside and how it’s built on the inside. On the outside — the endpoint’s response, its status, and the side effects other parts of the system depend on. That’s the contract, and we hold it one to one. How it’s done on the inside we rebuild cleanly, and we don’t drag architectural stupidity into the new code.

So parity is not only a matching endpoint response (a separate API address). The old code has side effects something already relies on: writes to the database, rows in the audit log, calls to neighboring services, cache invalidation. A faithful port reproduces those too. The admin login, for example, writes a row into the logs table with a specific category, level, and message text — we write such details out for every endpoint under a “side effects” heading, otherwise any naively rewritten code silently loses them.

The most counterintuitive decision grows from the same logic. A bug in the business logic that mobile clients, an admin proxy, and internal jobs already depend on, we carry over as is: quietly fixing it means breaking those who relied on the old shape. But an internal stupidity like dozens of extra database queries per player action we do not reproduce, because no one sees it from the outside and rebuilding it cleanly is safe.

Every defect we find goes into a ledger that travels with the code, and each one gets its own decision: carry the behavior over or rebuild it from scratch. We truly fix only where the change is cheap and safe, risk on the order of ten to fifteen percent or below.

Defect in legacy	Decision	Why
A service endpoint returns player data with no access check	Carry as is, fix on a separate plan	Closing access would change the contract for everyone who calls it: the admin proxy, internal jobs. The risk is above the threshold
Dozens of extra DB queries per player action for achievements	Rebuild from scratch	It’s an internal hack no one sees from the outside. A fan-out like that can’t be reproduced in the new code
A request with an unknown content-type returns 200 instead of an error	Leave as is	The old mobile app depends on the current error code, so a fix would break live clients

Teaching an agent to work in someone else’s code

The main trap of agentic development on legacy surfaced early: when an agent reads the old PHP directly, it starts writing Python in PHP’s style and drags the foreign structure into the new code instead of clean layers. So we isolated access to the legacy.

The only way for the main agent to look into the old code is a separate archaeologist sub-agent. It works read-only, it’s handed the exact path to the repository (guessing and roaming the disk are forbidden), and it returns a short summary with references to specific lines and a mandatory account of side effects; the files themselves never reach the main context. That way the main agent’s context stays clean and the new code’s architecture doesn’t drift toward legacy.

Then come the rules the agent writes our own code by. Each service has its own layering instruction with a strict decision table and one honest exit: if a change doesn’t fit the layers, the agent must stop and ask rather than invent a new layer.

The same instructions damp the models’ usual habit of over-engineering — no abstract interfaces where there is only one implementation, no per-entity repositories “for the future.” Separate rules require failing immediately when a setting is missing rather than substituting a silent default that surfaces as a bug later.

One endpoint at a time

The rollout is built so you can roll back at any moment by removing one line in a config. Two stands are brought up locally, the old and the new, identical at first. On the new one a container with the rewritten service is added, and nginx, the web server out front, decides where each request goes.

Request hits	Where it goes now
Rewritten endpoints — right now, authorization	The new Python service
Everything else	The old PHP, as before

The migration goes one endpoint at a time, under the protection of tests, with no “big bang” where everything is switched at once. The new code runs on the same production database as the old system, so any change to its structure has to coexist with what legacy keeps writing into that same database.

Today, authorization already works through this routing: ten endpoints for players and admins, rewritten to Python and covered by tests before the rewrite. For now this is a working contour beside the live system, brought up in the client’s Russian cloud so player data doesn’t leave its perimeter; onto production itself the new code rolls out the same endpoint-by-endpoint way.

The next large chunk is the game server, the websocket one: in legacy that’s about 32 thousand lines, a dozen state machines (essentially a separate engine for each game type, running a match by its rules), and more than eighty message types in both directions. It’s in progress now: the skeleton, the routing, and a check that Redis doesn’t lose data on restart (the very thing that left players without their tournament results that night) are already in Python; by the team’s estimate that’s about a fifth of the server.

Under the hood — how the harness is built

THE LIVE LEGACY

11 repositories · REST API in PHP · game servers in Node.js · web-TV, mobile, Android TV, Samsung · one Postgres, Redis, deploy by shell scripts

two parallel tracks

MAP: UNDERSTAND WHAT EXISTS

build a shared picture of someone else's system before the first line of new code

Architecture map of the legacy C4 notation

across 11 repositories: context · containers · components, plus a read-only inspection of production

Archaeologist sub-agent read-only

the path to the repository is handed in, guessing is forbidden

returns a summary with references to lines and an account of side effects

Legacy-bug ledger travels with the code

a status and a decision for each: carry as is, deviate, or fix

fix only if the risk of the change is below the threshold

SPEC IN TESTS: PIN DOWN BEHAVIOR

behavior, not a document, becomes the acceptance criterion

One suite on two stands old and new

the test doesn't know where it runs; a divergence in behavior is the error

first a run against the live system — a check that the test is right

Multiplayer in browsers websocket

a host on web-TV plus phone-controllers in separate contexts in one room

no mocks; in the product we change only the markup identifiers

Coverage map 1032 scenarios / 15 domains

the single source of truth for scenarios and the work plan

one reference for behavior →

map of the system bug ledger tests on two stands

PORT TO PYTHON parity down to side effects

a uv monorepo · FastAPI + Pydantic v2 · async SQLAlchemy

schema changes made safe to re-run so they coexist with what legacy keeps writing into the same database

contracts are the single source of message shape

snake_case in code, camelCase on the wire; frontend types are derived from them

layers per service plus rules for the agent

fail when a setting is missing instead of silent defaults, no over-engineering "for the future"

ROLLOUT one endpoint at a time

nginx rules route the rewritten addresses to the new container, everything else to the old PHP

rollback — remove the route line in the config, no "big bang"

authorization runs on a working contour beside the live system · game server: skeleton and routing in Python (~a fifth) · 1032 scenarios, a small part automated
Python · FastAPI · Pydantic v2 · SQLAlchemy async · uv monorepo · archaeologist sub-agent · bug ledger · coverage map
the agent scaffolding stays with the developers and is not part of the client's product

What changes and where we are now

The main goal of the project is to cover the whole system with a development harness: the scaffolding through which any task on it can be driven by an agent with minimal human involvement. The rewrite to Python is secondary. As much gets rewritten as we manage, while the harness has to cover everything, including the parts we won’t touch.

So the result of this first part is measured not in lines of rewritten code. Work that used to require a whole team is now carried by a couple of people with agents, and the legacy stays whole in production the whole time.

Piece of the system	State
Legacy map, bug ledger, agent harness	Done
Authorization — 10 endpoints	Rewritten, running on a working contour
Game server	In progress, about a fifth
Coverage map — 1032 scenarios / 15 domains	Being automated, a small part so far
The rest of the clients, observability	Next on the plan

The deal itself is structured honestly for unfinished work: the timeline and the harness coverage of the system are fixed, while the depth of the rewrite stays variable — as deep as we manage, starting with the heaviest pieces of legacy. The order is the same everywhere: a test first, then a line of new code.