Optimizing the NetHack 4.3 save system

Tags: nethack internals save | Written by Alex Smith, 2014-11-17

In NetHack 3.4.3, saving is relatively simple: the game basically just writes out memory images of its internal structures to a save file, with the main complexities being handling pointers (which don't persist well). There are a number of separate save files: one for each level in the game, plus "lockfile 0" which stores information that isn't tied to a level, such as the character's inventory.

These files are kept as separate files on disk while playing the game, and because NetHack grew up on systems with limited memory, are not cached in memory apart from lockfile 0 and the current level. An actual save file is basically just an archive of all these files; the recover utility will bundle the files up into an archive for you, in case the game crashed and couldn't do it itself. (In such a case, typically the game will be rewound to the start of the level, because the in-memory versions of the current level and lockfile 0 will have been lost, leaving only stale disk versions. Lockfile 0 itself is flushed to disk on every level change just so that recover works correctly.)

Although this save system is simple, there is huge scope for improvement. For instance, it is not portable between platforms (something that, at least, is easy to fix by writing the structures field-by-field), and you lose your game if the game engine does a panic and you didn't keep a backup of the save. Because savescumming is frowned upon in NetHack circles, there's considerable incentive not to keep a backup just to prove that you aren't cheating. (Most public servers have now found a pretty good compromise: the server automatically keeps backups, but they aren't accessible to anyone but the server admin, who will only restore them if the game is verified to have been lost due to a bug. On nethack.alt.org, the largest 3.4.3 server, the most common reason for such a restore is the infamous "hangup while loading a save" bug that deletes the save file, and which can happen to anyone due to an unlucky timing for a network disconnect.)

NetHack 4 and its predecessors have experimented with various other save systems. NitroHack used a radically different save system that basically just records the game's starting RNG seed, plus all commands that were entered by the player, the idea being that the gamestate could be reconstructed by replaying them. (Some other roguelikes, such as Brogue, use the same principle.) Sadly, though, NitroHack's implementation was unusably buggy in practice, causing many save games to be lost (and it didn't help that an optimization in the save system, storing a 3.4.3-style binary save at the end for fast loading and deleting it on each load, meant that a corrupted save would be hard to detect, and appear to work fine for a while despite the game being doomed). It also suffered from some shortcomings even in theory; notably, any change to gameplay at all (even something as simple as adding or removing a yes/no confirmation on a prompt) would invalidate all existing saves.

NetHack 4.2's save system basically put redundant layers on top of the NitroHack save system. In particular, it stored a diff of the gamestate between each turn and the next, in addition to the commands. Loading a save was preferentially done by adding together diffs, rather than using the commands (which were a last resort used if the diffs were corrupted). This system still went wrong a lot because it was based on the NitroHack save code, but there was enough redundant information that I could nearly always recover a save file manually (typically with the occasional bizarre residual corruption, such as items with nonexistent names).

For NetHack 4.3, the save system was rewritten from scratch in order to avoid the numerous bugs that had persisted from the famously buggy NitroHack save system, ensuring that things like locking discipline was defined and respected by all processes that might read a save. For instance, if you get disconnected from the nethack4.org server and you're playing NetHack 4.3, you can just reconnect and continue your game even if the first process is still alive; the two processes will both be capable of entering commands. There were also some robustness improvements, like making a complete backup of the gamestate every now and then. The basic principle, though (command log + gamestate diffs) was the same as the 4.2 save system.

The NetHack 4.3 save system has been great for both players (allowing features like full historical farlook where you can farlook any square on any level on any previous turn of the game, and be told what your character saw or remembered on that square back then), and developers (the vast majority of engine bugs are trivially easy to reproduce, as we can just rewind the save and do the same commands). However, there's currently one aspect that causes a lot of practical trouble: because the save file contains so much information, it's very large in terms of disk space, with a compressed NetHack 4.3 save file being around the same size as an uncompressed ttyrec of the same game. This isn't normally a problem for local play (with the possible exception of pudding farmers, who arguably deserve what they get), but it is a problem on servers (over 10% of nethack4.org's disk space is consumed by save files every couple of months, which is why I need the policy of deleting old recordings).

Existing save system optimizations

Most of the existing optimizations in the NetHack 4.3 save system is currently mostly optimized in the save engine (libnethack/src/{log,memfile}.c if you want to follow along at home). In particular, there are two layers of compression. First, the gamestate is diffed against the previous gamestate, but using a content-aware diff; the game engine (libnethack/src/save.c and several parts of other files with a similar function) sends hints to the save system about what part of the save file means what, and the save system will use this information to diff a particular field of the new gamestate against the same field of the old gamestate (if it exists). These hints (called mtags in the source code) affect only the efficiency of the diff, not the correctness; a misplaced mtag will simply cause the diff to be larger than necessary.

The diff is then compressed with zlib (unless it's so short or uncompressible that compression would make it longer rather than shorter), encoded in base64, and written out to the save file. The encoding of the diff itself had some minor tweaks to increase how compressible it is, although not much was done in that direction.

However, there's a lot of scope for optimization that, up until I started writing this blog post, hadn't really been looked at. The encoding used for the diff format was a "best guess" rather than driven by profiling; and more importantly, I'd made basically no changes for compressibility to the format used to represent the gamestate itself. If a field changes, the new value is in most cases stored in the diff exactly the same way as it appears in memory, and the definition of a field changing is that its representation in memory changes. There are a few exceptions driven by concerns other than save file size (normally related to backwards compatibility with older save file formats).

There is one other optimization I made, within the last few days. I needed to replace the random number generator that NetHack 4 uses due to numerous issues with the old one. For the new random number generator, I basically just run consecutive 12-bit integers through SHA-256 (using /dev/random or a similar source to obtain the initial seed). This was partly chosen to remove an exploit (reverse-engineering the RNG seed from random results in-game, then using your knowledge of the random number generator to get infinite wishes or the like) that's been performed on various servers in the past (nobody tried it on nethack4.org, but it would have been possible, because the Mersenne Twister has no real protection against this sort of thing, and because the starting seed was just 32 bits, which is within the ability of modern consumer systems to brute-force). However, another reason I chose that particular RNG is that it diffs very well; you can make thousands of RNG calls and still typically only have to update two or three bytes in the gamestate. This counts as the first (and, before I started writing this blog post, only) change to the game engine for reducing the save file.

Thus, there's presumably huge scope for improving the save file size. However, before optimizing, it's a good idea to profile to know exactly where the problem is. I'm going to focus on reducing the size of the diff portion of the save file (a simple visual inspection reveals that this makes up the majority of the save file bytitself), and have added instrumentation to NetHack 4 to explain every single byte that goes into a diff. Here goes…

Test case: Walking into a wall

One of the simplest things we can do that actually has an effect on the gamestate is to walk into a wall. This takes no time, but it's become reasonably well known by now that it erodes engravings that you happen to be standing on, by a random number of characters. The random number is generated even if you aren't actually standing on an engraving, making "wallwalking" very useful when doing highly in-depth debugging (or something like a tool-assisted speedrun) to change the sequence of random numbers you would otherwise get.

Here's the save file entry from a test game where I did just that (NetHack 4.3 save files are append-only, so it's easy to identify which lines are produced by any particular command):

move D6
+5cd5e2b
~EMAEgMKp3FH/////sM0BgOKOzA==

We have three lines here. The first line specifies that I was performing a move command, and the argument. The second line records the timing of the command (used to make sure that realtime-based elements of gameplay are reproducible, and maybe someday to replay back a game at the same speed that it was played). The third line is the save diff, which looks reasonably incomprehensible in base 64, but we can decode it into hexadecimal, and in turn into editing commands, to produce the following diff:

10 c0 04 80 c2 a9 dc 51 ff ff ff ff b0 cd 01 80 e2 8e cc
 copy  edit data         copy  copy  copy  edit da  copy
16    4     c2 a9 dc 51 16383 16383 3504  1     e2 3214

The diff format uses "copy", "edit", and "seek" commands ("edit" comes with additional data, marked data or da for a single byte), using 14 bits for the number of bytes to copy/edit/seek, and 2 bits for the command itself. Basically, there's a pointer in both the old and new save file; copying will copy bytes from the old to the new pointer; editing will put new bytes into the new savefile, and advance both the old and new pointers by that distance; and seeking just moves the old pointer. (Thus, there's never a need to seek unless data was inserted to or deleted from the middle of the binary representation of the gamestate; editing effectively discards data from the old savefile, but in a way that's reversible by seeking backwards.)

We can see that a total of five bytes changed in the save file; the first four are flags.turntime, and the last is flags.rngstate[12], both fields that we'd expect to be modified. (The RNG diff is only one byte due to the save-friendly RNG that was mentioned earlier; the byte in question is the 12th because the first 12 bytes are used for the initial seed. Although the RNG state has more structure than just "a long list of bytes" internally, the save system sees it as just a long list of bytes.)

What sort of optimizations can we see in such a small diff? One is that perhaps there might be a more succint representation of the editing instructions; we have a 19-byte diff (expanding to 28 bytes when base64ed), for a total of 5 bytes changed in the save file. This is not terrible, but it doesn't seem impossible that we could do better. However, from a short sample, it's not yet clear what specific optimization would be useful (except that base64 is kind-of ridiculously wasteful as a textification format when base85 exists).

The other thing of note is that flags.turntime is already represented by the time line in the second line, and so is technically redundant. However, we'd have to add together all time lines in the entire save file to load a save if we removed flags.turntime from the diff, so that would be a bad idea. One possibility is to remove time lines from the save file that are immediately followed by a diff, because they're reundant; I actually planned to do this at one point but couldn't get it to work. This has its own tradeoff, though, in requiring that you have to parse diffs to be able to determine the timing of events in a save file, which would make statistical analysis harder to perform.

Time to look at a more complex example and see what happens.

Test case: waiting one turn

The simplest time-consuming action is waiting. Here's what happens when we wait on the first turn of the game:

wait
+3f17aa9d
~$131$eNrjOcDYwMR8gKUhPuTLBAkgx1MJSDCIgQhmEMF6iLGh2RnI0k4Eshw4gCwHdpCmbX+B
 xJGDjA1C5UAuWwiQ4IETbDePAWWZjjM2yLMdgBF7fjI2PNt/kqnBWSAJyGcEmc+YdggAbsEta
 Q==

(I split the diff over multiple lines for readability, but it's in one line on the original save file.)

The base64 is now long enough to be compressed. (The $131$ at the start indicates that the data has been compressed, and expands to 131 bytes uncompressed; the length value here is used to be able to allocate a buffer for holding the uncompressed data without having to uncompress it first, in a chicken-and-egg problem. We also tell zlib how long the buffer is, to avoid one potential source of buffer overflow attacks using malicious save files, although I'm pretty sure plenty of others remain; if the data doesn't fit, we report that the save file is corrupted.) As before, we can debase64, and decompress it to get the original data:

0c c0 01 80 02 03 c0 04 80 5f 54 f4 90 18 c0 01 80 49 22 c0 01 80 00 16 c0
 copy  edit da  copy  edit data         copy  edit da  copy  edit da  copy
12    1     02 3     4     5f 54 f4 90 24    1     49 34    1     00 22

01 80 00 03 c0 01 80 00 05 c2 01 80 83 43 c0 01 80 2b 61 c2 01 80 40 08 c0
 edit da  copy  edit da  copy  edit da  copy  edit da  copy  edit da  copy
1     00 3     1     00 517   1     83 67    1     2b 609   1     40 8

01 80 40 07 c0 01 80 02 b6 fd 01 80 02 c4 c1 01 80 12 77 c0 01 80 06 54 c0
 edit da  copy  edit da  copy  edit da  copy  edit da  copy  edit da  copy
1     40 7     1     02 15798 1     02 452   1     12 119   1     06 84

01 80 0c 54 c0 01 80 0c 54 c0 01 80 06 d9 c6 01 80 02 02 c7 01 80 1f 06 c0
 edit da  copy  edit da  copy  edit da  copy  edit da  copy  edit da  copy
1     0c 84    1     0c 84    1     06 1753  1     02 1794  1     1f 6

01 80 1f 06 c0 01 80 1f bc f9 01 80 e6 bf c9 02 80 43 10 62 c0 01 80 01
 edit da  copy  edit da  copy  edit da  copy  edit data   copy  edit da
1     1f 6     1     1f 14780 1     e6 2495  2     43 10 98    1     01

03 c0 01 80 01 66 c2
 copy  edit da  copy
3     1     01 614

This already is showing one potential for huge savings: much of the file is made out of alternations of "copy" and "edit 1", meaning that a combined command for "copy this many bytes, then edit one byte" would reduce this very common sequence from 5 bytes to 3 bytes. Because we're currently encoding the three possible diffing commands using two bits, we have one unused command that could be used for optimizations in a backwards-compatible way, and this is one possible use for it. (Another possible use would be for a "small-distance copy/edit/seek" encoded in one byte, but this would not be backwards compatible because it would require reversing the order of the two octets of each command; they're in their current somewhat counterintuitive order so that the 14 bytes that make up the distance are contiguous on a little-endian processor like x86, but this forces all commands to be two bytes long because you can't determine how long the command is from the first byte.) The choice of 14-bit integers for the sizes of copy commands definitely seems to have been the right decision for this save file at least; there are plenty of copies larger than 63 bits, but none larger than 16384.

Another way to gain savings would be to remove some of those edits. Here's what's being edited:

moves (1 byte)
flags.turntime (4 bytes)
flags.last_cmd (1 byte)
flags.interrupted (1 byte)
flags.last_arg.argtype (1 byte)
flags.last_arg.dir (1 byte)
u.uhunger (1 byte)
u.ublesscnt (1 byte)
u.ever_intrinsic[0] (1 byte)
u.ever_temporary[0] (1 byte)
u.ever_temporary[9] (1 byte)
lev->lastmoves (1 byte)
mon->movement (1 byte for each of 5 monsters = 5 bytes)
copy of moves in the regions save code (1 byte)
spl_book[i].sp_know (1 byte for each of 3 spells = 3 bytes)
flags.rngstate[12] (1 byte)
utrack[0].x, utrack[0].y (2 bytes)
utcnt (1 byte)
utpnt (1 byte)

These basically fall into the following categories:

Tracking the passage of time in the game in various units (moves, flags.turntime, flags.rngstate); these are not a problem.
Information about the player's previous action (the other flags edits); this is not entirely redundant to the command line (e.g. the engine needs to run to determine whether the command was interrupted). There's not much we can do about this in terms of the data being needed to be saved, but it might be possible to find a more compact representation.
Information about the recent history of the game (e.g. utrack). It's probably best that this stays in the save file where it is; otherwise, the save file would be very hard to seek. (utrack uses a circular buffer, which is a pretty diff-friendly format. History in unfriendlier formats might need to be saved differently, but I don't think that's a problem at the moment.)
Information about which events have ever happened (u.ever_*, utrack.utcnt); this is going to be irrelevant over the cause of a long game. (utrack.utcnt stores which elements of the track array have ever been in use, if you were wondering; it climbs up to 50 on the first 50 turns of the game, and then stays there forever. This means that it's redundant, but not really harmful.)
Information which changes by a fixed (or mildly varying) amount every turn, such as u.uhunger, u.ublesscnt, lev->lastmoves mon->movement, spl_book[i].sp_know, utpnt, and (especially) the redundant copy of moves. This is responsible for most of the save writing (14 separate writes, each costing 5 bytes in the diff), and also (independently of that) the sort of write that there's the most scope for optimizing.

It seems clear, then, that the major gains in the save file size for this sort of turn are in optimizing information that's tied to the turn counter. Before making a concrete plan, though, we need to check that the same pattern holds for larger turns.

Time-consuming turns in extreme circumstances

The largest ever save file on nethack4.org was a pudding-farming game by Morwen, which came to 398024142 bytes over the course of 174583 turns. This is an average of 2279 bytes per turn.

Most of this size is, as usual, in save diffs. I unquit the game (another advantage of NetHack 4.3's save system is that "unquit" is actually a meaningful operation, although it's restricted to debug mode or manual edits of the save file for obvious reasons) and played a couple of turns (attacking puddings with Morwen's katana) with save file instrumentation turned on, to see where the size was coming from. The diff came to 2279 bytes (compared to the 131 bytes for the first turn of the new game). I'm not going to paste the decoding of the entire diff here because that would be ridiculous, but here are some statistics:

The diff has 596 copy instructions (46 of which are for 16383 bytes, meaning that multiple copies were needed to copy one block of data); 548 edit instructions (every single one of which was for a single-digit number of bytes, and 539 of which were for a single byte); and 1 seek instruction.
The seek instruction performed the following resynchronization:
- MTAG_MON:000225c3 from MTAG_MON:000225ca
It thus seems as though the seek mechanism is working correctly; this seek makes total sense given the gamestate, and what I was doing on that turn (killing a monster). (There were a total of 7 seek instructions over the few turns I played, all of which look reasonable.) Incidentally, the monster ID of 0x225c3 or (in decimal) 140739 is a good example of why pudding farming is so performance-intensive.

However, it's worth noting that seeks make up a very small proportion of the typical save file, and so it's probably safe to give them a less efficient representation if it allows for gains elsewhere.

The sizes of the 7 seeks over the few turns I played ranged from -85 to 102 bytes, so again, 14 bits (this time, as a signed integer) seems like a sensible size for recording the seek offset.
The most edited tag was MTAG_MON. The movement field alone was edited 435 times, making up the bulk of the save file by itself, and (at five bytes per edit instruction) taking up over 2 kilobytes of the 2279-byte diff. (This is a strange feeling, as it's what I'd guessed was causing the save file bloat before doing this profiling. The reason that people profile is that almost all the time, they guess this sort of thing incorrectly, but it seems that my guess was spot on this time.) Even worse, there's limits to how well this can compress; although the diff contains a lot of repetitions of 54 c0 01 80 xx, what the xx is varies from one monster to the next.
Not counting monster movement, the other fields responsible for two or more edit instructions were:
- The copy of moves in the region save code (57 edits). Wow, this thing is saved once per level? Seriously? I assume its purpose was for catching up elapsed time using the "suspended animation" save code from the NetHack 3 series, but we certainly don't need it around any more.
- spl_book[i].sp_know (16 edits); Morwen had rather more spells than my first-turn character in the previous test.
- mon->mfleetim (11 edits), presumably due to Elbereth, another example of something that updates every turn for every monster (but is less bad, because not every monster is fleeing).
- mon->mtrack (4 edits), a "recent history" variable that belongs to each monster.
- mon->mx and mon->my (3 edits), which is not predictable (as any NetHack player will tell you), and also not that space-consuming.
- mon->mhp (2 edits), which is slightly more predictable than coordinates in the way it changes over time, but is unlikely to be a problem in practice (people tend not to leave large numbers of wounded monsters around except when pudding farming, and monsters regenerate slowly).
Almost every copy instruction was followed by an edit instruction. The exceptions were one copy that was followed by a seek, and the 16383-byte copies (which were, unsurprisingly, followed by other copies). At this point, it seems like a trivial way to get diffs almost 40% smaller is to just have a "copy then edit 1 byte" instruction outright replace the copy instruction (if you didn't want to do an edit after the copy, you just copy one byte less, and replace the last byte of the copy with the edit), at least for copies shorter than 16383 bytes long. However, it would probably be unwise to do this until after other optimizations, because it's looking increasingly like a large proportion of the one-byte edits can be optimized out.
The limit of 16383 bytes in one copy instruction was starting to show flaws in such a large save file; at one point, the diff contains 24 ff bytes in a row. However, the representation for copy 16383 (ff ff) was chosen specifically to compress well when repeated, so this probably isn't a problem due to diff compression.

Clearly, the biggest threat to save file size is gamestate values whose changes scale in two dimensions (both turn count, and per-level or per-monster or per-spell or per-item or whatever). Especially if the new value is unpredictable (as with ->movement), or the number of bytes to seek is unpredictable (as with the region code copy of the turn count), because both of these cases prevent zlib from just compressing the repetitions down.

One other interesting statistic: I didn't have instrumentation in place when this game was being played originally, but I can still inspect the save file manually. Most of the diffs are in the range of 6000-8000 bytes uncompressed (with a few in the 3000-byte range which I'm guessing are turns on which Morwen got a free action, and so the turn counter didn't increase; the command log shows that she wasn't doing anything different from normal). The diffs in my instrumented version are just 3000 bytes, even when the turn counter incrased. Why the difference? Well, the old RNG had a 2508-byte internal state (yes, bytes; ridiculous in retrospect that it was originally seeded with only 4 bytes of random data), and although only a small portion of it changed on any given RNG call, the large number of RNG calls involved in pudding farming may have caused the entire state to change, adding around two and a half kilobytes to the save diff (which, is, as with the nature of randomness, pretty much completely uncompressible). This possibly doesn't explain the difference by itself, but because Morwen was playing on the old RNG and I was instrumenting on the new one, it's likely a big chunk of the difference, at least.

Optimizing the save format

So, we've looked at some examples on both ends of the scale, ranging from trivial moves to a move taken on the largest save file ever. (One other situation to think about is the generation of a new level, which is pretty much entirely made out of max-length edit instructions because there's nothing to diff against; this is fine under the current system, but we need to make sure we don't pessimize it by, say, removing the large edit instructions that seem to be unused elsewhere.)

We have one awkward constraint right now, though, to do with release timing. NetHack 4.3 is currently in public beta; there are many unfinished 4.3-beta1 games on the nethack4.org server, and (although I have no way to count them) likely numerous unfinished 4.3-beta1 games on people's computers at home. For 4.3-beta2, I'd like it if players would be able to continue their existing games rather than having to start from scratch (some players can put months of effort into an existing game). Thus, although we don't have to generate save files that can be read by old copies of the same engine, we have to make sure we at least are capable of reading old saves.

Something complicating this is NetHack 4.3's "desync detector", which has both been really useful in catching bugs and preventing people losing their games, and a way of making it really difficult to change the save format. Basically, whenever the game reads or writes the save file, it loads the gamestate from the save file, saves it again to memory, and compares the two to ensure they match, byte for byte. If they don't, it reports an error to the user and rewinds the turn. This makes silent save file corruption basically impossible (the only way it can happen is due to data which isn't saved in the save file at all, typically due to global variables that nobody has noticed even exist). This causes trouble for removing unused or redundant fields from the save data, such as the turncount in the regions code; deleting them outright, setting them to zero, or the like will trigger the desync detector. About the best that can be done if we want to keep save compatibility without changing the desync detector itself to be hugely more complex is just to freeze them at whatever value they happen to have in existing files eternally, and starting them out at zero in new saves.

(Another problem with the copy of the turncount in the regions file is that it's actually read and used, and appears to have some special-casing for bones files. I'm guessing/hoping this is all dead code, but would need to trace the codepaths for a while to be sure. Maintenance on 26-year-old projects can be hard sometimes.)

Supposing we didn't have to worry about backwards compatibility, what could we do?

One simple optimization that doesn't require changing anything in the game internals: for things that naturally change at the rate of 1 per turn (like turncount), express the value in the save file as a difference from the turncount (or the sum of the value and turncount, if it goes down over time). This probably needs a special case for values that go down to zero and then stay there, but that's not conceptually very difficult. A big advantage of this method, besides not needing to touch the game engine, is that it can handle arbitrary changes to the value; any legal number will have some valid encoding, and worst-case, we'll just have to change it every turn like before. This means that it'll bring savings even for things like hunger that change by approximately 1 every turn. This would work great for spell knowledge, but not monster movement.
Alternatively, we could change the way the game engine stores these sorts of data; for instance, it already stores timers as the time that they'll expire (or the number of remaining turns if the timer is paused, as with corpses in an ice box). This is probably a little "cleaner" than the previous method, but easier to get wrong. It also allows for futher scope for optimization; the game engine knows the exact rate of nutrition consumption (whether it's 1 per turn, 1.05 per turn because you're wearing an amulet, 0.05 per turn because you have a ring of slow digestion, or whatever), so it would be able to store the current nutrition value as a "turn you get hungry" value and update it whenever the nutrition consumption rate changed.
One of the most radical possibilities is changes to gameplay in order to make the save file format smaller. A simple example would be to change the speed of black puddings to 12, which would ensure that they always had a movement value of either 12 or 0 (barring corner cases like polymorph), and in particular, that the movement values of different puddings stayed in sync with each other. Making gameplay changes just to help out the save code is, of course, a terrible idea. However, looking at the save code can suggest gameplay changes that would help, and if they independently turn out to be good ideas, it kills two birds with one stone; the save code can thus be seen as a springboard for creativity that makes you consider changes you might not think of otherwise. (Don't worry, NetHack players, we're not actually going to change black puddings to speed 12; that was just an example of the sort of thing that can be done.)

The monster movement value presents problems of its own, though, because although it changes predictably, it's not in a way that's at all easy to boil down to a formula. Here are the exact current rules:

The player, and each monster, have a "movement" value. This starts out at 0.
Whenever no player nor monster has a movement value of 12 or higher, a new turn starts. Among other things, this adds each player and monster's speed value to their movement. If this brings the player's speed value to 12 or higher, they immediately get the next turn. Otherwise, skip to step 4.
If the player has at least 12 movement and we've just started a new turn, or at least 13 movement, the player gets an action at the cost of 12 movement.
Each monster with at least 12 movement gets an action. If any monster moved, go back to step 3.
If the player still has 12 movement points (and thus can move, but we didn't go back to step 3 because none of the monsters could), they get their remaining action for the turn now. Either way, nobody has 12 or more movement points now, so go back to step 2.

The game is saved only immediately before the player takes an action (and then only sometimes; if one command takes multiple actions, such as putting on armour, the engine doesn't save mid-command for efficiency reasons, but thankfully, that seems to be irrelevant here).

The first thing to notice here is a bug, which makes things more complex than they could be; the "13" in step 3 is almost certainly a typo (caused by writing > rather than >= in the code). We could change it to a "12" to make the algorithm rather simpler; although I haven't done that yet, I'm strongly planning to (especially as I've reported it as a bug to the NetHack 3 devteam). This is a good example of a positive gameplay change (making the turn order more consistent) that can be driven by save compressibility considerations.

Fixing that bug removes many of the special cases, simplifying our algorithm to this:

The player, and each monster, have a "movement" value. This starts out at 0.
Whenever no player nor monster has a movement value of 12 or higher, a new turn starts. Among other things, this adds each player and monster's speed value to their movement.
If the player has at least 12 movement, the player gets an action at the cost of 12 movement.
Each monster with at least 12 movement gets an action. If any monster or the player moved, go back to step 3. Otherwise, go back to step 2.

Things look a lot more predictable now. This is still very hard to calculate with, though, because we're looking for a representation of the monster movement value that is constant on the player's turn, which is a kind-of awkward point in the sequence to do things. (Here's a pathological case that can actually come up under the current code: the player has speed 0 but more than 24 movement.)

One promising start is to add a new gamestate variable that tracks the number of actions that any single player or monster can have taken so far. Then, we can figure out whether someone can move using that variable, and the movement value at the start of the turn. The algorithm becomes this:

The player, and each monster, have a "movement phase" value ("phase" here with its signal processing meaning, i.e. "how it's adjusted relative to the start of the turn"). This starts out at 0.
At the start of each turn, we set an "actions within turn" value to 1. Then each player and monster's movement phase becomes equal to their current movement phase plus their speed, modulo 12.
If the player's (speed + movement phase) is greater than or equal to 12 times the "actions within turn" counter, the player moves.
For each monster whose (speed + movement phase) is greater than or equal to 12 times the "actions within turn" counter, the monster moves.
The "actions within turn" counter increases. If this is too high for a player or monster to move, go back to step 2. Otherwise, go back to step 3.

This algorithm is less save-intensive than the previous one, as nothing but the "actions within turn" value, a single small integer, changes within a turn, and the movement phase value doesn't change either for players or monsters whose speed is a multiple of 12. However, there's one subtlety here; the previous algorithm checked a player or monster's speed at the start of a turn, this one checks current speed. This is again probably best considered a bugfix rather than an actual problem, given that speed changes kicking in immediately rather than on the next turn is more intuitive. At this point, though, it starts to change some actual strategies that have been developed for the game (even if I'm probably the only person who actually uses them), such as "change into a slow monster to perform a glitch, then change back again before the end of the turn", or "gallop-dash into a wall at the end of turn 1999". It also means you get no opportunity to put on an amulet of unchanging as a blue jelly, which is something that the occasional curious player tries even though it's an absolutely terrible idea in terms of gameplay. I can't think of any such strategies that aren't also either bug exploits or bug workarounds, though, so maybe there isn't a problem; fixing the underlying bugs will make the strategies obsolete.

However, we really need a version of this algorithm that works for arbitrary speeds. 12 is a pretty common speed value, but some monsters have other values (in particular, black puddings, the main offenders for blowing up save files, are speed 6), and the player frequently has a random speed value because speed boots give a randomly large boost. (Luckily, we don't have to worry about different random rolls on different actions within a turn with the above algorithm; the random factor is never 12 or more, so there's only one action on which the random roll is relevant, and so we don't have to worry about inconsistency.)

The key here is that every turn, the movement phase of a monster or player is being set to its old value plus their speed, modulo 12. Thus, the value on a particular turn can be predicted in advance: it's equal to the value on turn 0 (projecting backwards in time if necessary), plus the speed times the turn counter modulo 12. The speed is already recorded in the save file (perhaps indirectly via reference to monster properties), so all that remains is to store the value on turn 0 in the save file. Even better, this can be done in a backwards compatible way; we simply use the existing mon->movement field that caused the trouble in the first place. This will cause monsters with a speed that isn't divisible by 12 to change phase when a save file from an old version is loaded, but this is an acceptible tradeoff, especially as it doesn't run into any of the normal problems (it doesn't trigger the desync detector, for instance). Actually, it'd even be possible to store the value with an offset depending on the monster's speed and the turn count to get perfect backwards compatibility right down to the turn order, but this seems like overkill at this point (and would be quite hard to code); although knowing if a speed-6 monster moves this turn or next turn is important in actual gameplay, and people track it for that reason, I don't know of anyone who remembers it between saving the game, and loading the game for a new play session.

I haven't implemented these changes in NetHack 4 yet, but they're almost certainly going to go into 4.3-beta2: the profiling shows that the potential gain is very large, and the actual changes that need making are pretty minimal.

As for the representation of diffs, the other potential source for optimization identified above, I'm going to leave that alone for the time being and look at it again after the engine changes. Optimizations in one place can often change the tradeoffs you need to make in other places, which is a good reason to start with the largest optimizations first and move down to the small ones later.