About this document: Last modified by Alex Smith, 2015-08-25 Copyright (C) 2013, 2014, 2015 Alex Smith. This document is licensed under the NetHack General Public License. See libnethack/dat/license for details. Saving, Watching and Replaying in NetHack 4 =========================================== History ------- NetHack 3.4.3 used a save system that was effectively just a data dump of the various internal structures used by the game; there were some modifications used, e.g. to convert pointer values to indexes before storing them to disk, but many parts of the save were simply a straight read(2) or write(2) system call. Although simple, this had many disadvantages. For instance, save files would typically not be transferrable from one computer to another unless they were running effectively the same executable (same platform, configuration options, port, etc.). If something went wrong, recovery options were limited. Power failures or game crashes were recoverable because the game stored save files in unpacked form in the lock directory, updated every level change; and thus, after a crash, the player could be returned to the start of the level via running the recover(6) program to pack the files into a save file. However, any problem with the game which lead to a malformed save file being written, or caused internal state corruption in the portion of the gamestate that was saved, would lead to a completely unrecoverable save. This was very rare (because the NetHack 3.4 series went so long with no significant changes, and was relatively conservative even compared to the 3.3 series), but did happen on occasion, especially in the devnull tournament held every November (which made tournament-specific changes to the game which could often cause problems). NitroHack used an entirely different save system. There were two ways in which a game was saved: a log of all the commands inputted by the user, and a binary save produced via serializing the game's internal structures in a processor-independent way. NitroHack would attempt to load the binary save first; if it didn't exist or was corrupted (perhaps due to a game crash), it would attempt to replay the game from scratch instead. In theory, this avoided all the problems that NetHack 3.4.3's save system had. However, it came with worse problems of its own. The log-based save system was very prone to failing altogether, due to uninitialized memory, inputs to the game that weren't correctly saved, or similar issues. Even if these issues were entirely avoided, any change to the game engine at all could change what effect a command had in some context, causing a desync and thus a failure to load the save. This meant that most of the time, the game was relying entirely on the binary save; and the binary save would not be written (or at least, not be written correctly) in situations such as a timeout at a prompt. Idling at prompts would therefore typically delete your save file, which is not an ideal property for a save system to have. The NetHack 4.2 series worked around this issue by adding a third storage format to the save file: every action, it would store the diff of the binary save between that action and the previous action. This cut down on problems involving loss of the binary save; no matter how uncleanly the process exited, it would always be possible to rewind to the start of the turn and reconstruct the binary save via adding together all the diffs to that point. It also stored messages in the save; this meant that when replaying a game, it would be possible to see what commands the player entered even if the game engine had changed to the extent that the saved commands were meaningless. There were several issues with this save system, too. The first is that it was unnecessarily complex in theory; it was storing commands, but they were rarely usable for anything at all. It was unnecessarily complex in practice, too; there were numerous special cases, for things like options and timezones, as well as things like a hardcoded dictionary of commands (meaning that it would be impossible to add new commands while the game was being played and have the commands in the save file remain meaningful). The complexity meant that it was hard to debug; there were several instances where a save file would end up with mysterious stray newlines, or blocks of options written into the middle of the file with impossible values, or the occasional block of NUL bytes, and there was no obvious reason why. This corruption was also hard to detect; because there was so much redundant information in the save file, most of it was never actually read except when something went wrong. (One good point, however, was that the huge amount of redundant information also made even highly corrupted saves possible to recover.) The NetHack 4.3 Log Format -------------------------- In NetHack 4.3, the save system is changed yet again, to keep the advantages of the NetHack 4.2 save system, while avoiding most of the disadvantages, and hopefully not introducing new ones. A NetHack 4.3 log is a binary file, but which contains only ASCII codepoints (and thus can be read easily in a text editor), using the byte 0x0A for newline. Each file starts with a header, consisting of several fields separated by spaces: `NHGAME`, a 4-character code specifying whether the game is current ("save") or finished ("done"), an 8-digit hexadecimal number (the recovery count), and the version number (e.g. `4.003.000`; this is always exactly 9 characters long). For example: NHGAME save 00000001 4.003.000 The header is a fixed length, meaning that the fields in it that need updating can be updated merely by seeking to them, then writing. The second line of the file is also a header, and gives information about the current state of the game: an approximate turn count, the character's location and various miscellaneous status information if the game is in progress, and the death reason if the game is not in progress. This is always exactly 78 columns wide, plus a newline, and is right-justified via padding it to the left with spaces. (The fixed width makes it possible to overwrite it without having to move all the rest of the data in the file.) The third line of the file is also a header, and lists summary information for use in identifying the game: start time, a zero (previously initial RNG seed, but that's now more than 32 bits), game play mode (an `enum nh_game_modes` stored as a hexadecimal integer), player name (encoded in base 64), and class, race, gender, and alignment, as ASCII strings. The start time, and all other times in the save format, are encoded in hexadecimal and count UTC UNIX time in microseconds (that is, microseconds since the epoch, except that time around leap seconds is distorted such that each day appears to be 86400000000 microseconds long). The NitroHack/NetHack 4.2 save systems contained other header information, but this is not the case with the 4.3 system, which moves straight on to a list of lines, separated by newlines: * A 'save backup' line starts with `*`, followed by eight hexadecimal digits that represent the location of the previous save backup line in the file, followed by an entire binary save, encoded in base 64. (This save can be, and typically will be, compressed using raw zlib compression to save space; in such a circumstance, the base 64 will be prefixed by the uncompressed length of the save, between dollars, e.g. `$100$`.) There is a space between the location and the save, but no spaces within the save itself. Save backup lines are generated upon new game (as the first line in the file apart from the headers), and periodically thereafter: if a save diff line would be generated, it's converted to a save backup line if the uncompressed length of the previous turn's binary save (in bytes) would be less than the number of (after compression) bytes since the start of the previous save backup line. As such, not counting the first save backup line in a file, at most 1/2 of the file can possibly be made out of save backup lines, and something more like 1/4 is most likely. This algorithm prevents the size of the save file blowing up as a result of pudding farming or similar actions; the details are chosen to produce approximately the correct ratio while being fast to calculate. A save backup line is also used in place of a save diff line if a backwards-compatible change was made to the save format while the game was running. (This is for technical reasons: the game needs to know how to be able to produce a save file in order to be able to diff against it, and even a backwards-compatible change to the save format prevents that; the backwards compatibility allows new versions to read old-format save files, but not to write them.) For the very first save backup line in the file, the number given is a hint as to the location of the last save backup line in the file. It need not be correct (and may become incorrect as a result of recovery), so when loading a file, a program must check to see if the value actually points to a save backup line before using it. * A 'save diff' line starts with `~`, followed by a binary save diff against the previous save diff or (if more recent) save backup line, and encoded in (potentially compressed) base 64. It contains no whitespace. These lines are generated whenever a command completes running. (Commands with no effect on the gamestate do not appear in the save file at all, so nor does the save diff that comes immediately after them. In a few cases, such as when a multiple-turn action is interrupted by the client, there is a need to record a minimal change to the gamestate, done via an internal `interrupt` command, purely so that a save diff line becomes legal.) These lines (and save backup lines) explicitly may not be used in any circumstance where a binary save would not load correctly: this includes helplessness, being inside a command, and similar issues. If the save loading code observes one of these lines in an inappropriate context, it will calculate the contents of the new save file (in order to act as a base for future diffs), but not attempt to load it. * A 'command' line starts with the name of a game command (and is recognised via the fact that it starts with a lowercase letter). If the command takes no argument, the line ends there. Commands that take arguments have the arguments specified after a space, and separated by spaces: * A direction is `D` followed by a decimal integer; * A position is `P` followed by two decimal integers (with a comma but no space separating them); * An inventory letter is `I` followed by that letter (which could be a letter, or `$`, `-`, or `,`); * A string is `T` followed by that string encoded in base 64; * A spell letter is `S` followed by that letter; * A limit is `L` followed by the limit (in decimal). * A 'time' line starts with `+`, and contains a number of microseconds in hexadecimal; this is a relative offset from the previous save backup, save diff, or time line in the file (whichever is most recent), and lists the time at which future lines occurred. These appear immediately after user input, which is when the engine samples the current time. * A 'user input' line starts with a capital letter, and specifies the input that the user provided to a prompt. The various sorts of user input lines are as follows, specified in terms of an example for each (all numbers are in decimal unless otherwise stated): * `P!` getpos request was cancelled * `P1,2;` position provided at getpos (x, y coords; key that accepted) * `D-1` getdir request was cancelled * `D0` direction provided at getdir (an `enum nh_direction` value) * `K65` ASCII character provided at query_key (ASCII code) * `K65,2` as above, with explicit count * `Yy` ASCII character provided at yn_function (as a literal) * `M!` menu was cancelled * `M` menu was closed normally, with no selections * `M2a` menu had 1 selected item (ID + 4, in hexadecimal) * `M1:c` menu, most general form (`:`-separated list, in hexadecimal) * `O!` object menu was cancelled * `O` object menu was closed normally, with no selections * general form of object menus is `O` followed by a `:`-separated list; list items are ID or ID,count, ID is in hexadecimal and 4 is added to it (so that small negative numbers are shorter), count is in decimal There is one other sort of user input line: `L` followed by base 64 data, representing the value given by the user as a response to a getlin request. If the user cancels a getlin requset, the base 64 data will be the base 64 encoding of the string consisting of nothing but a single codepoint 27 (0x1b). * There are also some pseudo-user-input lines, for inputs into the gamestate that don't come from the user: * `B!` no bones file was found, when an attempt was made to load one * `B` (when followed by base 64 data) a bones file was loaded * `Q` game over, the character died/quit/escaped/etc. Bones needs to be logged so that it can be reproduced throughout a fraction of a turn; game over is logged to prevent two processes trying to handle a game over at once, and so that watching processes know that a game was quit (quitting isn't technically a command, so it otherwise wouldn't appear in the save file at all). The 4.2 system also logged options and RNG state. Options are now stored in the binary save (and thus stored in save diffs); and the RNG state was stored in the binary save all along. Save file locking and I/O discipline ------------------------------------ Previous versions of NetHack allowed a save file to be opened by only one process at a time, as a solution to the problem of save file corruption due to multiple processes writing to the same file. This was typically accomplished via locking the files. (The locking on some previous versions, such as NetHack 4.2, seems to have been buggy, which may explain some cases of save file corruption.) NetHack 4.3 uses a different approach. Instead of limiting save files to be open by one process at a time, any number of processes can have a save file open at any given time, and the save file has no notion of whether it is open or not. This gives several advantages over the previous method: * If a player becomes disconnected from the game in a way that leaves their process still running, they can continue to play the game in a new process, while the old process just watches (and eventually times out). This avoids the situation seen occasionally on NAO where a player locks themself out of their own game, and needs to ask the server admin for help in killing the process that has the file open. * A process can be exited via just exiting it (even with a `kill -9`), without causing any harm to the save file. This means that game crashes will not require a separate recovery process. * Unlike with previous save systems, there is an obvious way to implement game watching: open the file with multiple processes at the same time, with the "watching" processes unable to write to the file but otherwise identical to processes that would be used to play the game. For the same reason, it is possible for one process to replay a game while another process is playing it. * There is no distinction between watching and replaying a game, and converting between watching and playing a game is very easy (a process simply changes its mind about whether it's willing to write to the file or not). This makes it possible to implement an "instant replay" feature which allows reviewing the recent game turns. * There are also some minor advantages, e.g. it is possible to have two people open the same game and both be able to control the character, which is not particularly useful but is a strict improvement over previous systems. This capability also seems like a promising way to eventually implement the input for an AceHack-style multiplayer mode. To be able to ensure that this works correctly, processes that manipulate NetHack save files need to follow the following rules: * The only ways in which the save file may ever be written to are: * Creating the file (together with its header information); * Overwriting the (fixed-length) fields containing the information about the current state of the game, or the location of the last save backup; * Recovering the file, via increasing the recovery count, and removing one or more bytes from the end of the file; * Appending to the end of the file. A process must always leave a file with exactly one trailing newline after writing to it. (Failure to do this indicates that something went wrong during the write process.) * A process may only ever write to a save file if it holds a write lock on the file. (On POSIX systems, these locks are obtained using fcntl(2); on other operating systems, other locking APIs may be needed.) Before doing any actual writing (but after obtaining the lock), it must verify that the file is in a consistent state (i.e. it ends with a newline), and recover it if it isn't. * Processes may only read from and write to a file using the POSIX read(2) and write(2) system calls (or equivalents, such as pread(2)). In particular, stdio should not be used, because the buffering it uses can mess up the lock discipline. * If a process aims to write to a file, but (after obtaining the write lock) determines that the file is longer than it expected, it should cancel its attempt to write, and ignore the input that caused it to try to write. * A process should never react to user input in a way that changes the gamestate, unless it has first managed to log that user input to the file. This means that all processes will always have the same opinion about what the current gamestate is. * If a process ever discovers that a file does not end with a newline, while it hold at least a read lock, it must recover it. A file is recovered via the following mechanism: * Obtain a write lock on the file; * Increase the recovery count in the header; * Delete characters from the end of the file until the file ends with a newline (this step is skipped if the file already ends with a newline); * Relinquish the write lock. * If a process ever discovers that the recovery count in the file header has increased since its last read (and holds at least a read lock on the file), either because it just recovered it or because some other process recovered it, it should discard its entire knowledge about the contents of the file and reread it from scratch. (This prevents corruption at the end of the file if a process recovers a file, then writes to the end of it, and happens to leave the file the same length as some other process was expecting.) It may, someday, be worth changing this rule to allow some sort of method of communicating how much of the file was deleted; this might be needed for reasonable performance in multiplayer, where recoveries happen every time one player completes a turn while the other player is halfway through theirs. * A process should aim to hold write locks for as short a time as possible. If a write lock is held for more than a second, something has gone wrong; a process that notices a long-held write lock on a save file should notify the user with a phrase such as "stuck process", "undergoing maintenance", or the like (depending on whether the process holding the lock is a NetHack 4 process or something else). In the case of a stuck process, it may attempt to free the lock using `SIGHUP`, `SIGTERM` or `SIGKILL` (in that order), but only upon request from the user. * A process may not even read from a save file, not even to check the status in the header, unless it has a read lock on it. (It is possible for multiple processes to have a read lock on the file at the same time; this does not cause problems.) * If a process wishes to receive notification about changes to a file (the usual case when watching a game, and also true while playing a game because it will need to know about moves made from another process), on an operating system where it is possible to query which processes have a file locked, it should maintain a read lock on the file. (It doesn't need to do any actual reading; the lock is simply to notify other processes that the file is interesting to them.) On operating systems where it is impossible to determine which process has a file locked, processes will need to use polling instead; they should read lock the file at the start of each second to check its length, and relinquish the lock before the end of the second. * In order to establish a write lock on a file, a process should start by attempting to grab the lock in a non-blocking way. If it fails, it should check to see which processes are blocking the lock: * If there are no processes blocking the placement of the lock, it should try again immediately; * If there is a conflicting write lock, then the process should block on the placement of the lock (and complain that something is wrong after 2 seconds); * If there are one or more conflicting read locks, then the process should request all existing readers to relinquish their lock. The way in which this is done is to send SIGRTMIN+1 to one reader, then block on the placement of the lock. To protect against a very unlikely race condition (the file becoming completly unlocked, then some new process placing a read lock before its write lock establishes), it should repeat this check every 500 milliseconds for 2 seconds. If it cannot place the lock within 2 seconds, it should complain that something is wrong. If it does manage to place the lock correctly, it should send SIGRTMIN+2 to the process or processes it sent SIGRTMIN+1 to before it does any actual writing. On systems where it is impossible to determine which process is holding a lock, this entire procedure is unavailable. Currently, the situation is handled by polling upon every keypress, instead. * Whenever a process holding a read lock receives SIGRTMIN+1, it should relinquish its read lock as soon as possible; then send SIGRTMIN+4 back to the original process. Then it should wait for a SIGRTMIN+2 signal, or for 1 second, before re-establishing the lock. * Whenever a process receives SIGRTMIN+4 (which only happens while it is trying to place a write lock), it should look for further processes holding a lock on the file, and send SIGRTMIN+1 to them (if any). This is similar to the initial system for establishing a lock. * SIGRTMIN+0 and SIGRTMIN+3 are also used, as synonyms for SIGRTMIN+1 and SIGRTMIN+2. Which signals are used depends on the relative values of the PIDs of the sender and recipient; the purpose of this mechanism is to provide asymmetry in order to resolve what would otherwise be deadlocks. See the comments in libnethack/src/files.c for a hugely detailed explanation of how all this works. These rules collectively ensure that processes will never operate on an inconsistent view of the file, and can (on operating systems with appropriate support) request notifications of changes to a file. Playing, watching or replaying a game ------------------------------------- Processes that are playing, watching or replaying a game all operate the same way. Unlike previous systems, the actual location of the file pointer is mostly immaterial; it's used like the cursor in curses, being moved to an appropriate location immediately before reading or writing, with its current position being irrelevant. Instead, the process remembers the location in the save file that corresponds to its currently loaded binary save, and the save itself (this can be done lazily, i.e. only loading it if required); and the location that corresponds to its currently loaded game structures (if any), together with the game structures themselves. The game structures may get ahead of the binary save, because binary saves cannot save halfway through a turn, or in the middle of a multi-turn action. The binary save must, however, always be the most recent binary save in the file before the game structure location (otherwise, diffs would be generated against the wrong save, making the save file impossible to load). A process also has a "target location" that represents the place in the file that it wants its current game structures to represent (for watching or playing, this is the end of the file; for replaying, it could well be much earlier in the file). A process should always know whether it's before or after its target location, but apart from that it can represent the target location in any units it likes (bytes relative to some anchor, actions, turns...). The target locations are currently stored implicitly, calculating them from the gamestate location and knowledge of the game mode. A process accepts user input only while its gamestate location is at its target location. If its target location is the end of the file, it also requests changes to the file (via read locks or polling). A process only allows changing of the gamestate (i.e. playing turns) while its gamestate location is at the end of the file. (Also, remember that if the file changes after the process accepts user input that changes the gamestate but before it can record it, it discards that input; this deals with the situation where two processes try to make different moves simultaneously.) If a process is not at its target location, it tries to seek its gamestate location to that location; first it tries to move its binary save location to the last possible point in the file before the target location, then it sets its gamestate location to the binary save location (unless it's already after the binary save location but before the target location), and seeks that forwards to the target location. If its binary save location is beyond its target location, it starts by rewinding to the start of the file (via moving the save file location to the first save backup in the file, starting on the third line of the file). If it's before its target location, then it attempts to move its binary save location forward in the file: * If the next save backup in the file exists and is before the target location, the binary save location is moved forward to that backup (the current binary save is marked as unloaded, and will be loaded from that save backup if it becomes relevant); * Otherwise, if the next save diff in the file exists and is before the target location, the current binary save is loaded (causing a desync if the binary save cannot be loaded), then the binary save location is moved forwards to that diff, and the binary save is changed via applying that diff; * Otherwise, the binary save location is in the right place; the game is loaded from the binary save (setting the gamestate location to the binary save location), then the game is replayed one step at a time via reading command, message, and user input lines. If a line appears in the wrong context (e.g. a message line shows the wrong message, a command line appears when no command is expected, or a user input line provides the wrong sort of input), then it's possible that the save was generated via a different version of the engine: * If watching or replaying a game, the gamestate is reloaded from the current binary save, and frozen until the next save diff. Attempts to reconstruct the gamestate between the diffs are done entirely from message and time lines (i.e. the only thing that changes is displayed messages, so the map will freeze until the next turn). * If playing the game, it causes a desync. The file is only changed when actually playing the game; any messages, input, etc. that occurs is recorded in the log file. Whenever it is safe to write a save diff, the game uses the following algorithm (to insure against mistakes in the binary save code): * The game is saved into a binary save in memory, diffing against the previous binary save; * The game is loaded from the new binary save; * The game is saved again, as a diff against the new binary save; * If the saves differ, it causes a desync. * If the saves don't differ, the diff (or possibly the entire binary save) is recorded in the log file. For more information, see the documentation in `doc/mainloop.txt`, which focuses on the same issues from the point of view of the API rather than the save system. Recovery -------- When a desync happens (whether because the binary save mechanism failed, or because the game engine was upgraded mid-turn), the user is notified, and given the option to end the process or to rewind the save. The game engine determines how far to rewind the save via using binary search to determine the last point at which the game can successfully be loaded, and tells the user how many turns would have to be rewound to recover the save. Alternatively, saves can be rewound via an administrator. This is most useful to undo the effects of game-breaking bugs, whether they make the game unwinnable, spam `impossible`s every turn, cause panics on any attempt to perform actions, or just kill the character outright. This is done via means of a command `#rewind` while watching a game. (The nature of NetHack 4.3 logfiles means that it is possible to recover a save even while someone is playing it, and everything should work correctly.) The `#rewind` command is only available if the *game* is an explore or wizard mode game, or the *user* using it is capable of entering wizard mode. Rewinding a file for recovery purposes is much the same as rewinding it because a process had crashed and left an incomplete line at the end of the file, except that it rewinds further. The process doing the rewinding picks a location to rewind to: for a manual rewind, this is the gamestate pointer of the game doing the rewind, and for recovery from a desync, this is the last save diff that can be successfully loaded. (Automatic desync recovery always recovers to a diff because this gives a player the best possible chance at avoiding doing whatever it was they did to cause the desync.) The file is truncated just after the final newline of the line that is being rewound to; and the recovery count in the header is increased. The recovering process should also fix the location in the first save backup line; it isn't incorrect to omit this step, but it helps efficiency. Binary saves ------------ The above log format, used for saving, is mostly based around binary saves, which contain the entire gamestate of the game. In order to ensure the correctness of the save system, all global and static variables are divided into the following categories: * Read-only variables; this contains things like tables of data. These must be marked as `const`, and are either hardcoded into the file, or generated by `makedefs`. Because their value never changes during the course of a game or from game to game, they do not need to be saved. * The `flags` structure contains persistent state relating to the game as a whole; most of this is game options (which are saved with the game, and include birth options as a special case). The values of the fields of `flags` are turned back into option values by the options code when it's necessary to report what a game's option values are. This structure also contains any state that describes the previous command (such as is necessary to track whether phase-of-moon updates have happened since the previous command, and as is necessary for the `repeat` command (default binding `Ctrl-A`) to work). * State that tracks the current progress of commands being executed, in the global variable `turnstate`. This contains information about, say, whether the pathfinding is doing a travel or an autoexplore, or whether a prayer is responsive. This also stores helplessness; while the player is helpless, we treat the command as not necessarily being complete yet (even if the action the player performed is), and thus that a binary save should not be produced. When a command is complete (or when a new game is started or a binary save is loaded), all the fields are reset. * Information about an entire dungeon, a single dungeon level, or a single player within the dungeon. (In previous versions of NetHack, dungeons were conflated with players within the dungeon, because only one player could be handled at one time. The NetHack 4.3 series introduced a distinction in preparation for the implementation of multiplayer.) This is responsible for the entire binary save. (Birth options are saved as dungeon state; regular options as player state.) These are currently stored in global variables of their own, the main ones being `u` and `levels` (but with plenty of smaller ones, such as `invent` and `migrating_mons`). Eventually, we hope to merge all these into just a few globals (`u`, `levels`, and a new one for dungeon-wide information). * State involved with managing the save system (current log file descriptor, target location, information about whether a state is being loaded, etc.). This is stored in `program_state`, is never saved in the binary save, and is meaningful only while the game is being played, watched or replayed (not once it's exited), apart from the fields whose purpose it is to track whether the game is being played. It is initialized from scratch in `nh_play_game`, and becomes meaningless when `nh_play_game` returns. Whenever the game engine runs, all this data (apart from target location) must have a value that depends only on the gamestate location, thus allowing all this information to be reconstructed from the log file. (If a process crashes, its target location is lost, but this is no great loss; the only effect is that if a replaying process crashes, the user has to navigate back to the same point in the replay by hand. While watching or playing, the target location is always at the end of the file.) * State involved with communicating with the interface (for instance, the list of extra option values to reply with to the UI option parsing code in an older version of the options code). This is never saved, and might not even be reset between games; instead, it is valid between `nh_lib_init` and `nh_lib_exit`. None of this information may ever be allowed to affect gameplay in any way. Currently, this is just `windowprocs` (which a special case, because it's part of the API), which is a variable of its own, because we removed all the other uses of those sorts of variable; we will invent a new namespace for this if it ever become necessary again. * Global and static variables we haven't got around to removing yet. There are so many of them :-( Save diff format ---------------- Binary saves can be saved as-is (in a save backup line); however, more commonly, they're saved as a diff from the previous backup save. This section describes the diff format. There are actual multiple diff formats in use, to be able to read old save files: ### 4.3-beta1 In 4.3-beta1, a diff consisted of a number of 16-bit commands, each of which could be followed by bytes of data. The commands were little-endian numbers, with the top two bits specifying the command, and the other 14 specifying a count. A diff is conceptually a program that writes an output save based on an input save; the output save is append-only, the input save is random access (and has a file position pointer, which starts at the start of the save). The file position pointer can go beyond the end of the input save, in which case it can't be read from, but otherwise its position is legal and remembered. Here are the commands that were in use: * `xxxxxxxx 10xxxxxx`: An "edit" command. The bottom 14 bits specify (as an unsigned integer) the number of bytes of data the command has. Those bytes are appended verbatim to the output save, and the file position in the input save is moved forwards the same distance. * `yyyyyyyy 11yyyyyy`: A "copy" command. The bottom 14 bits specify the number of bytes to read from the input save (starting at the file position pointer) and append to the output save. Then the input file pointer is moved forwards that number of bytes (placing it just beyond the last byte copied). * `zzzzzzzz 00zzzzzz`: A "seek" command. The file position pointer is moved forwards by the given number of bytes (as a 14-bit signed integer). Moving forwards a negative number of bytes moves backwards minus that many bytes. This command has no data. This diff format worked quite well in practice, but there was scope for improvement. The vast majority of most diffs consisted of medium-sized copies alternating with small edits, with the occasional very large copy. Large edits and seeks were both possible, but both rare. Additionally, when level files got larger (e.g. during farming), sometimes 14 bits was not enough to represent the size of a copy. ### 4.3-beta2 4.3-beta2 introduced a new diff format that requires less space to express these common structures. The same commands were used, but expressed differently. (The endianness is also different; commands are big-endian so that they can be variable-length.): * `0xxyyyyy yyyyyyyy`: A "copy then edit" command. The bottom 13 bits specify the number of bytes to copy (which can be 0, in the case of small edits which follow a seek or at the start of the file, but only if at least 2 bytes are edited). Adding 1 to the higher 2 bits specifies the number of bytes to edit (i.e. `0010000 00000001` copies 1 byte, then edits 2). * `100xxxxy yyyyyyyy yyyyyyyy yyyyyyyy`: A "large copy then edit" command. Works just like before, only there's more bits to specify the size of the copy, more bits to specify the size of the edit, and we don't add 1 to the size of the edit (meaning that we can encode a pure copy; this is never necessary, as you could replace the last byte by an edit, but it can be simpler to generate). * `101xxxxx xxxxxxxx xxxxxxxx xxxxxxxx`: A "large edit" command. (The first few bytes of the edit can be placed in a preceding copy-then-edit command, which requires at least one byte of edit to be presented.) This works just like the edit command from 4.3-beta1, except that it has more bits. * `111zzzzz zzzzzzzz`: A "seek" command. * `110zzzzz zzzzzzzz zzzzzzzz zzzzzzzz`: A "large seek" command. Like a normal seek, but larger. Unlike the preceding format (which has no header), this format has a two-byte header: `00000001 01000000`. (This prevents any confusion with the previous format, meaning that both formats can be mixed in a file, e.g. when porting old saves from 4.3-beta1.) The sequence `00000000 00000000` (copy 0, edit 1) is specifically disallowed, so that it can be used as an EOF marker. The runtime engine substitutes `10100000 00000000 00000000 00000001` instead. ### 4.3-beta3 4.3-beta3 makes further improvements to the save diff format, after making further observations about the sort of data that we're diffing. The key is to use different encodings for different parts of the file. We have a large number of commands, some quite specialized. The encodings of the commands change dynamically depending on how recently they've been used: 00 most recently used 01 2nd most recently used 1000 3rd most recently used 1001 4th most recently used 1010 5th most recently used 1011 6th most recently used 110000 7th most recently used 110001 8th most recently used 110010 9th most recently used The pattern continues; the general rule is that the first half of the encoding ends with a `0` and is otherwise made of `1`s; and the second half of the encoding can have arbitrary values. Each command takes one or more arguments. (Data used for an edit isn't counted as an argument to the command; the argument to an edit is the size of the edit.) All but one of these arguments has a fixed number of bits; the remaining argument has a minimum number of bits, but will be given extra bits if need be in order to bring the command up to a whole number of bytes. The number of bits for each command are listed in `decl.h`. For example, `mdiff_copyedit` is listed as having "12+ bits copy size; 2 bits edit size (1-5)". It thus could be encoded as: * `00xxyyyy yyyyyyyy` if it were the most recently used command (note that the arguments are encoded in reverse order, with the flexible one coming last); * `1000xxyy yyyyyyyy yyyyyyyyy` if it were the third most recently used command (the copy size has stretched up to 18 bits here, because 10 bits wouldn't fit the requirement to be 12+, and 18 bits is the next-largest number that makes the command a whole number of bytes long). In some cases, the argument we'd want to use to a command won't fit in the command's default size. In such a case, we precede it with the encoding for the `mdiff_wider` "command", which causes one extra byte to be added to the flexible part of the command over what would normally be necessary. For example, suppose `mdiff_copyedit` is the most recently used command, and `mdiff_wider` is the third most recently used command: * `00xxyyyy yyyyyyyy` is the simplest case of a copyedit command, with 12 bits of copy size and 2 bits of edit size. * `100001xx yyyyyyyy yyyyyyyy yyyyyyyy` is the next-simplest case of a copyedit command. The encoding of `mdiff_wider` is `1000`. We then give the `mdiff_copyedit` command immediately (without rounding up `mdiff_wider` to a whole number of bytes); it's now the second most recently used command, so its encoding becomes `01`. We now need 2 bits of edit size (which doesn't change, because it's a fixed-size argument), and at least 20 bits of copy size (`mdiff_copyedit` normally takes at least 12, but `mdiff_wider` adds on another 8 bits); this is rounded up to a whole number of bytes for the whole command, which happens in this case to be 24. * `10000001 xxyyyyyy yyyyyyyy yyyyyyyy yyyyyyyy` is what we get if we widen once more. The encoding of `mdiff_wider` is `1000`, as before. We then give the command again to make it even wider; the encoding is now `00`, because it's the most recently used command. Then we give `mdiff_copyedit`, the second most recently used command, encoded as `01`. We then need to encode our 2 bits of edit size (as always), and at least 28 bits of copy size (12 + 8 + 8), which here rounds up to 30 in order to make the command as a whole a whole number of bytes. This scheme thus allows arbitrary commands to be given arbitrarily large flexible arguments. Because `mdiff_seek`, `mdiff_copy` and `mdiff_edit` are all provided as commands, arbitrary diffs can thus be encoded. (A diff ends with an `mdiff_eof` command, and starts with a header `00000001 01000011`; this makes it possible to find the end of the diff, and recognise the format that the diff is in.) Diffs are created by starting with the same algorithm as in 4.3-beta2; this produces a list of copyedit and seek commands (probably with one copy command at the end). Seeks are just encoded using a (potentially widened) `mdiff_seek`. For copyedits, we try each possible command that could encode it to see which leads to the shortest encoding (the save system itself can also give hints). As an example, in one of the games I use to test the save system, I have the following copyedit instruction (which happened to decrement the byte that was there before): copy 125739 edit 1 Assuming the commands start in their default order, this could be encoded via any of the following methods: * `mdiff_wider` + `mdiff_wider` + `mdiff_copy` 125739; `mdiff_edit` 1; 1 byte data wider size = 125739 size = 1 -- ---------------------- ---- 11000000 10000001 11101011 00101011 10010000 xxxxxxxx ------ ---- ---- -------- wider copy edit data * `mdiff_wider` + `mdiff_copyedit` 125739, 1; 1 byte data copyedit size = 125739 -- ----------------------- 11000001 00000001 11101011 00101011 xxxxxxxx ------ -- -------- wider edit size = 1 data Unsurprisingly, a copyedit instruction is shorter than separate copy and edit instructions on short data. * `mdiff_copy` 23; `mdiff_wider` + `mdiff_coord` 1479, north size = 23 coord size = 1479 ------ ------- ------------ 01010111 11011100 00010101 11000111 -- ---- --- copy wider dir = north The main purpose of `mdiff_coord` is to abbreviate the case where the only thing changing with monsters is their x and y coordinates, but it can also be used in other cases where individual elements are just being incremented and decremented. Because `mdiff_coord` is intended for monsters, its size is given in units of 85 bytes, and measured from the end of the last `mdiff_copy` command or start of the last `mdiff_coord` command. This means that we need to specify the size using `mdiff_copy` and `mdiff_coord` as 23 + (1479 * 85) or 125738; the next byte is interpreted as an x coordinate and copied untouched; the byte after (the one we want to edit) is interpreted as a y coordinate and decremented while being copied (thus editing it in the way we want). In the actual file, the byte in question was in fact the y coordinate of a monster. The next command was `copy 168 edit 2`, with the edits being a decrement and an increment in that order. This can now be encoded as "`mdiff_coord` 1, south-west", which is a single byte: dir = south-west --- 00111001 -- --- coord size = 1 (Why does a size of 1 translate to 168 bytes copied? Because the first position at a multiple of 85 from the start of the previous `mdiff_coord` is a copy of 83, which would be a size of 0, so a size of 1 must mean 83 + 85 = 168.) The use of specific commands to handle cases like monster movement (that are responsible for the bulk of the save file size in 4.3-beta2) means that 4.3-beta3 can encode these problematic cases much more succintly.