NetHack 4 homepage | NetHack 4 blog | ATOM feed
Tags: nethack vanilla internals portability terminal libuncursed | Fri Jan 9 17:34:31 UTC 2015 | Written by Alex Smith
Recently, the NetHack 3 series devteam have asked about how to bring Unicode to NetHack. Likewise, NetHack 4's Unicode support is somewhat lacking; it handles Unicode on output but not on input.
This blog post looks at the current situation, and at the various possibilities for resolving it.
One of the most common places to see Unicode output is when rendering the map. Most NetHack variants now do this, because it's one of the more portable ways of displaying the line-drawing characters which are commonly used for walls.
The simplest approach is to treat Unicode as an alternative to IBMgraphics or DECgraphics. This is what two of the most popular variants (UnNetHack and nethack.alt.org) do. The basic idea is to store everything in an 8-bit character set such as IBMgraphics (code page 437) internally, then convert to Unicode just before output.
The big advantage of this method is that it's only a very minimal change to the game core. (In fact, there is probably no need to change the game core at all when doing this. When using a rendering library such as libuncursed, the library will do the Unicode translation if necessary for the terminal, and the game core can think entirely in terms of code page 437 or the like.) The big disadvantage is that it's not very customizable; there are more than 256 possible renderings that might need to be drawn on the map ("glyphs" in the terminology of NetHack 3.4.3, or "tilekeys" in the terminology of NetHack 4), so the game core needs to either thing in more than 8 bits (in which case it may as well use Unicode directly), or else artificially restrict the set of possible renderings.
NetHack 4 uses a somewhat different method. The game core mostly doesn't deal with what map elements look like at all (actually, it is aware of a default ASCII rendering for each tilekey, but the only intended use of this is to draw the map in the dumplog when a character dies). The game core communicates with the interface in terms of "API keys" (which have a 1 to 1 correspondence with tilekeys, but are spelled differently for historical NitroHack-related reasons). The interface translates the API keys to tilekeys, and then looks up the appropriate renderings in the current tileset. When playing tiles, the tileset will contain tile images; when playing on a terminal (or fake terminal), the tileset will contain Unicode characters together with information on how to draw them (color, background, underlining, etc.).
Here's an excerpt from one of the NetHack 4 tileset files:
iron bars: cyan '≡' fountain: blue '⌠' the floor of a room: bgblack gray regular '·' sub unlit the floor of a room: bgblack darkgray regular '·'
The information for the floor of a room specifies all the information
tha might be necessary, because it's drawn on the bottommost map
layer: it has a black background, no underlining, is gray (or darkgray
if unlit), and is drawn with a
· character. Iron bars are not on
the bottommost map layer, so they preserve background and underlining,
but override the color to cyan and the character used to
I'm generally pretty pleased with the NetHack 4 approach here; it's
enabled things like per-branch colors for walls implemented entirely
in the tileset, without needing to get the game core involved at all
(beyond telling the interface which branch is being rendered). The
drawback is that it's quite complex (using separate binaries
tilecompile to generate the tilesets).
There's one other problem, too, and this is the
; (farlook) and
(whatis) commands. Understanding the problem is easier with a quick
NetHack's predecessor Hack had a map screen that looks very similar to
NetHack's. However, there were much fewer objects that existed in the
game: in fact, it was possible to assign a different ASCII character
to each of what NetHack 4 now calls a tilekey. So, for example, if
you wanted to know what the
$ in the starting room
represented, you'd press
/ ("tell what this symbol represents"
according to Hack's documentation), and then
The output looks like this:
d a dog $ a pile, pot or chest of gold
There is no
; command in Hack; none was necessary, because the same
letter always represented the same thing. This came with many
gameplay limitations, though. For example, it is impossible to
determine whether a dog in Hack is tame or not, and all potions look
the same unless they're in your inventory or you're standing on them.
NetHack added many more features to the game: in particular, many more
monsters than exist in Hack. It typically distinguishes between the
monsters using colour, something which is most obvious where dragons
are concerned (a red
D is a baby or adult red dragon, a green
a baby or adult green dragon, and so on).
/ command in NetHack still supports the Hack method of
Specify unknown object by cursor? [ynq] (q) n Specify what? (type the word) D D a dragon--More-- More info? [yn] (n) n
However, as we can see, it's got a lot more complicated. The main
reason for this is that telling the
/ command that we see a
insufficient to get full information about it. Telling the game that
D is red would be one way to get more information, but even
then, this would run into the problem with tame monsters that Hack
had. NetHack thus allows the object to be specified via pointing it
out with the cursor:
Specify unknown object by cursor? [ynq] (q) y Please move the cursor to an unknown object.--More-- (For instructions type a ?) D a dragon (tame red dragon called example)--More-- More info? [yn] (n) n
We now have all the information about the red
D, rather than just
information on what a
D represents generally. For compatibility with
Hack, though, we're told that
D is a dragon before the game tells us
about this specific dragon, something that is IMO just confusing and
should probably be removed.
There are a lot of extraneous prompts here, so the
; command was
introduced to do the same thing as the
/ command but making the most
common choice for each question:
Pick an object. D a dragon (tame red dragon called example)
; is very common in normal NetHack play, but the
command is almost unused nowadays. (Its most common use is to chain
multiple farlooks via accepting with
,, something that does not
work with the
; command due to arbitrary restrictions. I've removed
these restrictions for NetHack 4.)
NetHack 4 also adds two other methods of farlooking: moving the cursor
over an object during a location prompt (either that of
;, or in any
other command); and moving the mouse pointer over an object, when using
a high terminal (so that there's space beneath the map to say what the
Anyway, the big offender here is the very first character on the farlook output. Here's the output from vanilla NetHack with DECgraphics:
Pick an object. └ a wall
Oops. Our non-ASCII characters have leaked into
pline, which is
part of the game core.
This problem can be seen as a pretty minor one, because IMO the output
/ is dubious anyway. Here's what NetHack 4 does
in the same situations:
Pick an object. A red dragon on the floor of a room. Pick an object. A wall.
NetHack 4 tiles can render the floor beneath the dragon, so farlook
gives the same information (so that ASCII players do not have a
disadvantage compared to tiles players as a result of layered memory).
The major change, though, is that the "this is a
D, which means
dragon" bit of the output has been removed entirely, because the game
core doesn't know what a red dragon looks like; the tileset might be
rendering it as a red underlined
D (the default), but the tileset
could also render it as any other character, or as a tiles image, or
So my conclusion here is: the game core shouldn't be using Unicode for
the map because it shouldn't be using any character set for the map.
Let the interface sort that out. This means you have to change the
; commands, but they were in need of a change as it is, and
there's no way to make them work with tiles anyway. (Besides, hardly
anyone knows how to type a
└ to give it as an argument to
Apart from the map, the other place where Unicode might potentially be iuseful is in strings inside the game: character names, fruit names, monster names, and perhaps messages printed by the game (currently NetHack is only officially in English, but the occasional non-ASCII character crops up even in English, e.g. "naïve"; interestingly, these spellings are dying out in favour of non-accented ones, perhaps due to the use of computers).
This is not currently a problem that most variants handle; for example, NetHack 4 currently uses the same text input routines as NitroHack, which disallow non-ASCII characters.
In order to implement this, there are three separate problems. One is
reading input from the user, but this is not really the concern of the
game core; reading Unicode needs to be done differently for each
windowport anyway. Another is storing the strings in memory; this is
the problem that nhmall's rgrn post is talking about. The remaining
problem is processing such strings in situations such as
and the wish parser.
When storing the strings in memory, there are three real options, which also affect how easy it is to rewrite string handling functions:
char32_t. These are all 32-bit-wide types.
This gives us the "UTF-32" encoding of Unicode, which stores each
codepoint in one 32-bit unit. (Unicode codepoints go from 0 to
1114111 inclusive, meaning that 16 bits is not enough to store the
whole of Unicode; 32-bit types are the next-largest that are
The main advantage of this is that it maximises the amount of code
that we'd expect to continue to work, given that one
Unicode acts quite similarly to one
char in ASCII. However,
there are various caveats:
There are multiple different ways to express a 32-bit type in
long has been around forever, but is often more than 32
bits (which might or might not be a problem depending on the
int32_t is C99, and might not be supported
by some compilers that are particularly slow at updating to
modern standards (after all, C99 was only released 15 years
char32_t is C11, and has the advantage that it's
possible to write a
char32_t * string literal:
const char32_t *string = U"→ this is Unicode ←";
It's unclear which of these types would be the best
representaition. (It would be possible for compile-time
configuration to choose between
you lose most of the benefit of
char32_t if you make it
configurable, because then you can't use string literals.)
When using ASCII, you can normally just disallow control characters; placing a newline or escape or the like in an object name isn't something that it's reasonable to support. Thus, you can safely assume that all characters in ASCII are one em wide on the screen (at least when using a fixed-width font; most NetHack windowports do). Unicode has legitimate uses for zero-width characters (combining characters, direction overrides, and the like). As such, either you'd sacrifice the ability to render these characters correctly, or else you'd need a more complex function for counting string width (losing most of the benefits of UTF-32 in the first place).
Lookup tables would stop working (a 256-entry lookup table is
sensible; a 1114111-entry lookup table less so). In
particular, this means that
strstri would need a complete
rewrite. (That said, the only use of
strstri, as far as I
know, is for counting Elbereths, and
Elbereth is pure
wchar_t. This is a system-dependent type designed for holding
Unicode characters. In practice, it is 32 bits wide on Linux and
16 bits wide on Windows; I haven't tested other operating systems.
The huge advantage of
wchar_t is that it's been in the C
standard longer than any other form of Unicode support, and is
very widely supported by now. For example, almost every compiler
accepts the following form of string literal, not just C11
const wchar_t *string = L"→ this is Unicode ←";
wchar_t has the best library support of any of the
options being considered here: there are
functions, length functions, substring search functions, and the
wchar_t is the option I chose in libuncursed,
incidentally, mostly for compatibility with curses, but it's not
an awful choice in its own right.) This means that if you only
wanted to get a program working on Linux, a
wchar_t would be
outright superior to a
The Windows API also requires that
wchar_t is used for all
Unicode input and output. It handles characters outside the
16-bit range by representing them as UTF-16, for backwards
Being a system-specific type, a
wchar_t cannot be placed
into a save file directly if you want that save file to be
portable between platforms. This is not a huge problem:
struct padding also differs between platforms, so as you have
to pack and unpack the structures manually anyway, you can
wchar_ts to something else upon save.
On Windows, a
wchar_t is not large enough to hold all
Unicode characters: it misses out on the "astral plane"
characters above codepoint 65535. In the case of libuncursed,
I didn't worry about this too much because the purpose of
libuncursed is to produce lowest-common-denominator terminal
output, and astral plane characters don't render correctly on
many terminals anyway. The NetHack 3 series has more of a
tradition of being able to configure the game to take
advantage of unusual features that your terminal has, so it's
more of a problem there.
unsigned char, encoded as UTF-8. This is a
multibyte encoding which represents ASCII as ASCII, and other
Unicode characters as sequences of non-ASCII characters.
One obvious advantage here is that NetHack uses ASCII
string literals anyway, meaning that this would reduce the amount
of code that needed to be touched as far as possible: a UTF-8
string literal in C11 is written
u8"→ this is Unicode ←", but if
there are no non-ASCII characters (and there usually aren't), you
can omit the
u8 and get portability to old compilers too.
UTF-8 is a multibyte encoding in which different characters
have different widths, so a specialized string width function
is absolutely required in cases like engraving (which cares a
lot about the width of a string). It doesn't make sense for
it to take twice as long to engrave
éééé as it does to
Because UTF-8 is equivalent to ASCII in most simple cases, but not once non-ASCII characters start being used, the Unicode code would get a lot less testing than in the other cases here: the ASCII case (common) is different from the non-ASCII case (rare). This means that using the wrong string width function, or the like, might not be spotted for several months.
The handling of UTF-8 in the standard library is mostly based
on the locale conversion functions, which are
user-configurable. This works fine if the user has configured
them to use UTF-8, but not if they're using some legacy
encoding. Trying to get functions like
wcstombs to behave
is harder than just hand-rolling them yourself, often, meaning
that UTF-8 basically has no viable library support.
As usual, there does not seem to be any obvious choice here.
char in UTF-8 all seem like
somewhat viable options.
There are also some things to watch out for regardless of the encoding used. For example, the character name is often used as part of the filename, and Unicode in filenames is a pretty nonportable topic in its own right.
wchar_t is what the leaked NetHack code uses, incidentally. This
looks like it's the best option unless the problem with the astral
planes on Windows is a dealbreaker, but it may well be. (Having to
use UTF-16 with
wchar_t is something of a disaster; it gives you
pretty much all the drawbacks listed above at once.) Before writing
this, I was in favour of UTF-8, but now I'm more dubious about it; I
think I'd prefer a widespread change in which if something breaks, it
breaks obviously, than I would a change in which everything appears to
work and then breaks much later.
I'm not planning to implement Unicode input in NetHack 4 within the
next couple of months, because there are other more urgent things to
do first. However, I'll likely implement it eventually, and most
likely I'll use
wchar_t when I do so (it's possible that I'll use a
32-bit type, though that would mean backwards-incompatible changes to
libuncursed). For the NetHack 3 series, a compile-time choice between
char32_t would fit in most with the typical
philosophy of that codebase (perhaps alongside a function that either
expands to a
U prefix on a string literal, or a function call to a
ASCII-to-UTF-32 conversion function), but the other choices don't seem
that bad either.
If you have any comments or suggestions, let me know, either directly via email, or by posting comments on news aggregators that link to this blog post; I'll be looking and responding there, and summarizing the bulk of the sentiment I hear about for the DevTeam. Perhaps there's some important point that everyone's missing that will make the choice obvious.