NextRealm Bubbles Fun Debugging
Little Bug That Almost Went Unnoticed
Prologue
Recently, I’ve been working on the tech behind my Next Realm multiplayer games portal. Part of that was replacing ECS library I use from Kevin’s Experimental ECS written in Haxe, to the new popular kid on the block – bitECS.
Of course, bitECS being JS and based around using Typed Arrays instead of objects, the usage is quite different. So obviously I’ve had to wrap it around using Haxe’s macros, to achieve similar syntax to what I’ve had before. Just compare this, the “bitECS” way:
for (eid in query(world)) {
world.position.x[eid] += world.velocity.x[eid] * dt;
world.position.y[eid] += world.velocity.y[eid] * dt;
}
Versus the “Haxe” way:
for (eid => e in query.on(world)) {
e.position.x += e.velocity.x * dt;
e.position.y += e.velocity.y * dt;
}
Might not look like too much of a difference, but there’s more. I only get the components that the query has, it’s all fully type-checked. It also handles defining components, mapping fields that are not bitECS (or rather Typed Array) compatible. Defining queries, worlds, etc. It’s all open source and you can find it at hxbitecs.
Oh, and bitECS also improved the performance. Apart from much lower GC pressure (almost no allocations), the server logic got around 30% faster. Yay.
The Testing
Because replacing a library that drives the whole game is a major undertaking, I’ve had to test that everything works as it should. The game I was upgrading, NextRealm Bubbles, is quite a simple one, but being multiplayer means that no, not really that simple.
I might write about my approach to testing multiplayer games some day, but shortly: I run simple bots that mimic real player behavior (connect, disconnect, play, die, play again), and I let it run for a few hours. That generally catches most server bugs. And because the bot is written in a way that uses as much of the real client code as it can, it tests almost all of the client logic too (not the rendering though).
On top of that, inspired by (Re)Actor system as described in IT Hare’s books, the test run of my server generates logs that allow reproducible replay. I.e. if it crashes, I can replay it again, all the way to the crash. That’s no small feat, and very useful for bunch of things.
The Result
So, one day I left the testing to run with 60 bots for ~8 hours. When I was going to sleep, I went to turn it off. To my surprise I was welcomed with empty desktop – freshly restarted. Damn Windows Updates. I forgot to disable them, and it was outside the “active hours”, so Windows just decided to update. I guess that’s my fault for snoozing them a bit too often.
The update happened just a few minutes ago. Alright, that means the testing ran for quite long. I quickly started replaying the logs (1GB of data), and while I brushed my teeth I felt happy about the whole system. Replaying 8 hours of logs should take around 20 minutes for this game, and it did. Successfully reaching 99% of logs, it crashed. Wait. It crashed right at the end of the replay. Curious. What bug could take 8 hours to manifest?
I went to bed, not feeling happy anymore, thinking it surely must be due to Windows Updates doing something weird, crashing the bots and eventually the server, who knows what happened. “I will deal with it in the morning,” I thought, and so I did.
Moar RAM Needed
A little side note: The replay system I have loaded everything all at once, because I never tested for so long, it wasn’t an issue. This time it was. I could only run it in the freshly-restarted-state, without any other apps opened.
So in the morning, I had to rewrite the system to stream the logs instead, otherwise I wouldn’t be able to actually use the PC while waiting on the replay to finish. (Not even swap space helped here.)
The Bug
Thinking with a fresh head, the update and restart couldn’t have caused an exception. It would disconnect the bots, sure, but the server should handle that. It might stop the server process, but that’s alright. Whatever it is, it’s a real issue. Just very weird coincidence that it manifested right before the Windows Update.
The bug itself wasn’t very exciting, comes down to different handling of components being added/removed to/from an entity. Old ECS used callbacks and did it immediately after a system ran, bitECS gives you the control over it using special queries, so you can run them whenever you want. I did handle that properly in all, but one place. The one that’s outside of the game loop – when a client disconnects and the connection manager triggers a component change. Fix was to simply run the update query right afterwards. One line of code.
It took so long to manifest, because the bots don’t cause that many disconnection events (I think I’ve had it configured to one every few seconds in total), and for the bug to actually happen, some other entity would have to interact with the, now removed, entity in the same tick.
Anyway, easily fixed, difficult to find. Would probably take me months if I didn’t have this replay system, and didn’t run it for long enough. I’m thinking I will have to run it over the weekend or something, for the next games.
Moral of the story? Multiplayer is difficult, testing is useful, and one should keep order-dependent logic in a single place if possible. I did have a similar bug – doing changes outside of game loop – in NextRealm HexSweep. I did not have a replay system back then, it took bots ~2 hours at increased speed to cause it. Was a pain to debug, took me over a week.