NetHack 4 homepage | NetHack 4 blog | ATOM feed
Tags: development linux debugging | Thu Aug 27 16:51:18 UTC 2015 | Written by Alex Smith
One fairly common problem that comes up when writing a large program such as NetHack is the dreaded infinite loop. Unlike other kinds of error, that normally just cause a crash (possibly with an error message), an infinite loop (in which the program does the same thing over and over indefinitely) leaves you with a problem: the program is still running and still doing something, and it can sometimes be very hard indeed to stop it. I know that on occasion, I've lost half an hour or more of my time to work on my programming hobby to an accidentally introduced bug that consumed all of memory and caused runaway swapping.
Infinite loops are also the bane of anyone who runs a public NetHack server. paxed, the administrator of nethack.alt.org (almost certainly the most popular server for NetHack in general, although it doesn't support NetHack 4), puts a huge amount of effort into getting rid of even unlikely or potential infinite loops when they're discovered or suspected, because once one starts, it's going to make the server laggy for every other player. I know that I've been personally hit by this on nethack4.org a few times, too, needing to fix these bugs on an emergency basis to prevent the server becoming unusable for players while I'm asleep and thus unable to exit the processes myself.
Normally, I develop NetHack 4 on Linux (although I do some amount of NetHack 4 development on Windows in order to ensure that it remains compatible for all the Windows players out there). One nice advantage of Linux for this is that it has a huge number of options for getting rid of your runaway processes (a process is, in most of the cases relevant here, approximately equivalent to a single invocation of a program), as a sliding scale from the contained but ineffective approaches at one end, to the reliable approaches that cause a lot of collateral damage at the other. One disadvantage, though, is that because there are so many, it can be hard to remember what they are – and when a runaway program is eating up your CPU, sometimes you need to react fast to stop it taking down the system. Thus, I decided to write this guide in the hope that its readers (and me, later on) will remember some of the techniques next time they need it.
I tend to explain everything as I go and give a lot of background, so this is pretty long. There's a cheatsheet at the end for people who just want a summary, or something to print out.
One of the tradeoffs involved in development is to do with the power available to the person sitting at the keyboard, versus the security of the system against an unwanted person sitting at the keyboard. In particular, you have to answer questions like "do I want people to be potentially able to kill the process that's busy keeping the computer locked, and thus get at my locked computer when I'm not in the room"?
Typically speaking, there isn't a huge risk of someone pulling off something like this because trying to accurately aim at a single process with the limited set of functionality available on a locked computer is almost impossible, and the less discriminate process-killing methods tend to get rid of anything that might be of use to an attacker too (sure, they could lose you any unsaved work you have, but they could also do that simply by turning the computer off, so you're not losing anything there). That said, although low, the risk is not nonexistent, which means that some of the more useful process kills for development are disabled by default on many Linux distributions nowadays.
There are a couple of "procfiles" (that control kernel settings)
relevant to this. The main one is
man 5 proc. Given that the file is overwritten every
boot, if you want a custom value for it, you'll want to change the
relevant part of the boot process; many distributions use a program
sysctl to handle setting the values of procfiles during boot,
in which case you'll want to look in
/etc/sysctl.d for the relevant
file. (On my Ubuntu system, the file I needed to change was
/etc/sysctl.d/10-magic-sysrq.conf. I went for the value 246, which
is pretty permissive, but disallows things like dumping memory
directly from the kernel; the main potential security hole here is an
attacker managing to hit a screen lock process with
There's also another setting you have to look out for on Ubuntu
specifically. One of the changes that Ubuntu makes in order to make
life harder for attackers is to prevent processes tampering with or
looking at each others' memory or control flow directly if they don't
have some sort of pre-existing relationship. The control for this is
/etc/sysctl.d/10-ptrace.conf, which explains the possible
settings pretty well. In general, if you're doing a lot of
development where you don't habitually run programs under a debugger
but suddenly need to debug them retrospectively (this describes me
quite well!), you'll probably want to turn this Ubuntu setting off; it
might block a malicious program from looking at your passwords in
memory but it won't stop it installing a keylogger, so it mostly only
helps past the point where you're already in serious trouble.
So, your program's gone into an apparent infinite loop. There are six real possibilities:
The program is waiting, rather than actively running. Perhaps it's waiting for input, but either it doesn't visibly respond to it or the input isn't being delivered for some reason.
The program isn't in an infinite loop, it just looks like it due to a bad choice of algorithm. This can happen with using a quadratic algorithm when you should be using a linear (or O(n log n)) algorithm, is more likely to fool you when it's a cubic algorithm that should be quadratic, is effectively a tight infinite loop when it's an exponential algorithm that should be quadratic, and is a pain to debug when the problem's that you have an off-by-one in a loop test and now the code's going to have to loop over every single representable integer before the loop can end.
I actually did this last case once; I forget whether it was in
NetHack 4 or AceHack, but for some weird corner-case reason (the
conditions to reproduce it being rather specific but not
influenced by the player's actions nor by random numbers) it only
happened in one of the levels in Vlad's Tower, and only once per
game (but once every game). Given a 32-bit
int and the
complexity of the code it was running, it took about a minute,
which is long enough that it looked like an infinite loop, but
short enough that it went away while I was busy getting the
debugger set up. Thus it lead to quite a bit of confusion before
I worked out what was going on.
In most cases, though, algorithmic mistakes like this one won't be so finely balanced. Either it takes a short enough time that you can wait for it and treat it like normal inefficient code, or it takes a long enough time that you give up waiting and can basically treat it like a genuinely infinite loop.
A softlock, where the program is still responding normally to at least some subset of its normal functionality, but a logic error prevents the program proceeding past where it is right now. A simple example would be a dialogue box that you can't close because the keybinding for Cancel is broken and you don't have any valid options to enter to get it to close normally.
Softlocks are normally pretty harmless from the development point of view (to a player, it's a different matter, as you may well need your save file to be manually edited to continue); exit-anywhere (or save-and-exit-anywhere functionality) is the second most likely part of a game to survive a softlock (the most likely is statistically the background music, which isn't really relevant in NetHack's case because most variants don't have any). This is because (as with the background music) it's something that needs to be present no matter what the game was doing.
As a terminological note, you see the term loose infinite loop sometimes for this sort of problem, typically implying a loop that's checking for input each time round the loop. This is mostly historically relevant on DOS, where most process-exiting inputs only had an effect when the program read input (I don't know whether this was because DOS's kernel only checked during input request syscalls, or if it was actually implemented in the libc rather than the kernel). The term "softlock" is typically only used in computer game communities, where the boundaries of the definition are a bit blurred (for instance, does it include any unwinnable state because you're going to have to use meta-functionality like save-and-exit to end the program rather than the "intended" solution of winning the game?), but as this blog's about a computer game and the word has pretty much the meaning I want I'm going to use it; many of the boundaries between other classifications are equally fuzzy.
Both the NetHack 3 series and NetHack 4 support exit-anywhere
Ctrl-C (the difference being that NetHack 3.4.3
will quit the game if you do this, because it doesn't know how to
save midturn, and NetHack 4 will keep the save file around for you
because it does). It actually wouldn't surprise me if the feature
was originally added to NetHack as insurance against softlocks.
Given that softlocks can be dealt with in the same ways as any
other sort of infinite loop, in addition to possibly being able to
escape them using in-game functionality, there isn't much more to
say about dealing with softlocks in your own code. However, one
common "softlock" is when you're using
common in roguelike play, and
ssh also common in development)
and your network connection drops in a way that isn't detectable;
your remote-access program is still providing most of its
functionality normally, but the bit you care about (the actual
remote access) is locked up and can't proceed, and the "normal"
methods of escaping a softlock without collateral damage and
without spending time locating the exact process number to kill
won't work because they get sent to the other end of the
Of course, it should be completely unsurprising at this point that
the developers of
telnet added softlock escape codes
precisely to deal with this sort of situation, but surprisingly
few people know what they actually are, so here you go:
(Both these codes are configurable, but those are the defaults, and they tend not to actually get changed from the defaults very often.)
A tight infinite loop in which the program never reaches a point in the code where it would ask for user input (or, in a variant that's seen all too often on public servers, is repeatedly asking from user input from a source unable to provide it, like a broken network connection). It's just going round the same actions again and again, and not getting anywhere. If you leave a process in this state, it'll use up as much of a single CPU core as the operating system scheduler will let it have, indefinitely.
This is the most common sort of infinite loop, and frustrating to encounter on a production system that has nobody actively maintaining it right now, because it's slowing down your entire system noticeably and wasting electricity for no purpose at the same time. Hosting companies tend to take a dim view of that, too.
A leaky tight infinite loop; like a tight infinite loop, but now it's allocating resources on every iteration. The usual outcome of this is what's known as swap thrashing; the system tries harder and harder to find memory to fulfil the resources in question, normally leading to the hard disk being used as temporary storage for memory (rather than the usual situation, which is the other way round). Eventually the system runs out of memory entirely and has to start denying memory requests, killing processes, or both. (Denying memory requests normally ends up effectively equivalent to killing processes because few processes can continue working in an out-of-memory state, and so they typically crash or exit. Perhaps because of this, Linux will by default cut out the middleman and kill processes directly when memory is exhausted; it tries to aim at the actual culprit.)
It's possible to leak things other than memory; for example, you can leak files (filling up the disk), filehandles (normally harmless because you run out of filehandle numbers well before you run out of memory), network sockets (saturating your bandwidth), processes (the infamous "forkbomb" which is incredibly hard to recover from because it's hard to kill the processes faster than they're created) or the like. Memory leaks are by far the most common, though.
The "good" thing about a leaky tight infinite loop is, that because it literally can't keep going forever on a finite computer, it has to come to a natural end eventually. The bad part of this is that it typically takes about half an hour to do so once regular memory is full, during which the system is incredibly unresponsive (you press a key or move the mouse and the system might or might not not respond ten minutes later). I've been in this situation too many times by now, and this is why I'd have loved a guide like this earlier: you need to know in advance what methods will or won't work to get out of that state, because you can't practically use your computer to check once it's started, and a command that might normally take only ten seconds to type may well take fifteen minutes if you type it blindly and hope for no typos. In other words, the faster you are, the better.
The other good thing about a leaky tight infinite loop is that it often looks sufficiently different from normal operation that even crude heuristics can often determine that this has happened, meaning that it's the easiest sort of infinite loop to take preventative measures against.
A processor crash or kernel panic. This is when your entire system isn't responding in any normal way; your code can't progress because nothing's even attempting to run it. When this happens accidentally, it's normally because of a tight infinite loop in some low-level part of the system, such as the kernel, or (if you're really unlucky) processor microcode (this has actually happened).
Sometimes a kernel will panic intentionally because it's detected that something has gone really badly wrong and any attempt to continue would likely make things worse; this is normally an intentional softlock in which the only functionality remaining is to reboot the system (and, on recent hardware, often to log a debug dump to space set aside for this eventuality in the hardware itself; on older hardware there's nowhere to put it because you can't normally safely access a disk under circumstances this extreme). This nearly always leads to some sort of clear visual display, such as Windows' Blue Screen of Death, or Linux's eternally flashing Caps Lock light.
Given that this is at least three levels of abstraction beneath the level even a typical C programmer normally thinks (libc, syscall API, kernel internal API), there's not normally much point in worrying about it unless you're a kernel or driver developer yourself. Ensure that it's not a hardware failure and that you aren't overclocking, complain at whoever's responsible for the bug (on an OS with even vaguely modern security features, it isn't you unless you're actively trying to mess with system internals and are exercising admin rights to do it), and move onto something else (or if you're really frustrated and it's reproducible, decide to become a kernel or driver developer for the purpose of fixing someone else's bug).
There are a few error states that look a lot like an infinite loop, but are actually caused by a messed-up display; it's not that the program is stuck, but rather that it's waiting for input normally, or crashed, and what's actually happened is that it's the view onscreen that's stuck.
There are two main causes of this, both of which tend to hit screen-oriented terminal-based programs (such as many ASCII roguelikes) the most because they're the programs with the largest tendency to mess around with the relevant codepaths.
The first is to do with XON/XOFF flow control, a very old protocol that uses a software-based solution to preventing a connection being overloaded. The normal way to prevent one end of a connection between computers sending faster than the other end can receive is to use a dedicated wire to say "stop sending", thus dealing with the problem in hardware; this has been a solved problem since the days of RS-232.
For those computer users who can only vaguely remember what RS-232 is: you know how nowadays, most mice connect to a computer via USB? Before the the invention of USB, they typically used a round "PS/2" connector. RS-232 is the standard they normally used before that became standard, so we're talking quite a long time ago by now; there are several different connectors but the most common was a relatively small (by the standards of the day) trapezium-shaped one, a bit like a smaller VGA port. And RS-232 gets the flow control problem right, so you have to go back even further, to days where people tried to make do with a minimum number of wires for their connections, to find systems where you need to do your flow control in software.
Anyway, despite XON/XOFF being a solution to a problem that has been a
total non-issue for an incredibly long time in computer terms, hey,
you might come across a system that needs it, right? So terminals and
terminal emulators still have a configuration flag that lets you turn
it on and off. I'm not opposed to the flag itself (I like random bits
of computer history like that), but of course, its existence
inevitably means that sometimes it somehow ends up being turned on by
libuncursed, NetHack 4's rendering library, turns it off
deliberately on startup when rendering to a terminal, but it's obscure
enough by this point that many programs don't know about it and don't
The problem comes with the way that XON/XOFF works: the rule is that if you send a DEVICE CONTROL 3, then the other side of the connection queues output locally (not sending it) until you send a DEVICE CONTROL 1. Worse, the other side is queueing its output; it isn't ignoring your input. So whatever keys you're pressing are having an effect, you just can't see it. It should be clear how dangerous this can be in a roguelike, where a few random keypresses while you're "trying to get the game to respond" can kill your character.
How likely is a DEVICE CONTROL 3 to find its way into your connection? Well, thanks to the utter ambiguity of terminal codes, we find that it has precisely the same code as Ctrl-S. This is the "save without confirmation" command in Dungeon Crawl Stone Soup (thus might well be pressed intentionally by a roguelike player who doesn't know about the XON/XOFF trap), and right next to Ctrl-D on a QWERTY keyboard, a commonly used command to kick down doors in NetHack. So it definitely happens. And the combination of Ctrl-S being pressed with a misconfigured terminal? Pretty rare that it happens to any individual person, but across all the games of NetHack being played, it's happened enough times that even just counting incidents where I was around to give help, I've lost count.
The antidote is, of course, to press Ctrl-Q, but if you don't know, that's almost impossible to guess. Because of the potential negative consequences if an XOFF is the problem, Ctrl-Q is normally my first suggestion for an apparently stuck program (especially if a network connection is involved, which makes a misconfigured XON/XOFF setting much more likely).
Of course, all this means that Ctrl-S and Ctrl-Q become terrible choices for keybindings for a roguelike, or even program generally (you don't want to encourage people to press Ctrl-S, and you don't want to react in any potentially dangerous way to Ctrl-Q because players might have to press it to recover from an issue). In both the NetHack 3 series and NetHack 4, both these key combinations are unbound despite the short supply of keybindings, because of the problems that they can cause. (Making Ctrl-Q "quit and delete your save file" is thus perhaps the worst possible binding choice for that command, even though it's an obvious one; this is a mistake I'd urge ASCII-in-terminal roguelike developers not to make, unless they're really confident in the terminals of their users.)
Of course, you can go further than this if you really want to drive
the point home that these bindings are dangerous. I use a range of
editors; in one editor I commonly use,
emacs, "save" is Ctrl-X
emacs is clearly very confident in its terminal handling
abilities; perhaps with good reason, as almost certainly, it's one of
vim that holds the record for compatibility with
the most terminals (although
rogue, the original Roguelike, has a
surprisingly good argument for being included on that list, seeing as
it was a driver of terminal handling innovations at the time). Of
course, this means that I often end up muscle-memorying a Ctrl-S into
other editors when trying to save, and
nano's reaction is pretty
GNU nano 2.2.6 File: xoff-example [ XOFF ignored, mumble mumble ] ^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text^T To Spell
nano, your "respect XON/XOFF" flag was correctly turned
off this time. But who knows where you might typo that in the future?
The other common cause also has to do with a terminal setting, and a program responding normally but with no visible effect. However, the other details are pretty different; this time, it's a terminal setting which is being used intentionally rather than a compatibility setting from decades ago, and a different program from the one we thought we were running.
The trigger for this is pretty mundane: the program you're using does
an outright abnormal-termination crash (segfault,
exit(EXIT_FAILURE), and the like), but doesn't have the opportunity
to reset the terminal settings because it crashed so suddenly. (There
are various things that programs can do to mitigate this, but they
normally don't, and they're limited: a
SIGKILL out of nowhere is
completely unblockable, although very rare except in response to
explicit action by the user. Perhaps I should add some sort of
mitigation code for this to libuncursed; there are some technical
obstacles to this like needing async-signal-safe terminal status
updates and dealing with competing segfault handlers, but nothing
The result is that the user is dumped back into their shell, but the screen's all messed up, and user input has no visible onscreen effect. (Any output the shell produces in response does have a visible onscreen effect. Unfortunately, it may well be in an unexpected place, and in black-on-black or a similarly unnoticeable colour scheme. NetHack 4 tends to output in purple a little above the middle of the screen with this type of crash; I've seen it enough times by now that I recognised it, but given how that area is mostly purple anyway it'd be easy for a user unfamiliar with this type of crash to miss.
As in the previous case, keys being pressed are having an effect, but again, just not a visible one. This time they're being sent to the shell, so anything you're typing is being interpreted as shell commands. Luckily, random input normally doesn't do much when interpreted as shell commands (the worst that I'm aware of having happened is a bunch of files being created with stupid names), but there's always the risk of a particularly dangerous command being spelt out, so you'll want to deal with this possibility early to be on the safe side. (Not to mention that Ctrl-C is the first thing you were going to try anyway.)
The fix is Ctrl-C Return
Ctrl-C empty current shell input line Return execute current shell input line reset reset all terminal settings to default Return execute current shell input line
(Actually, this is a pretty good trick to know in general for dealing with messed-up terminals.)
The Ctrl-C and first Return to get rid of anything that you might have
typed by mistake while trying to get the game to respond, and any text
that might have been spammed with mouse movement (if the process ends
suddenly like this, it doesn't get to turn mouse input back off
again). So you have to keep your mouse still while doing this!
Technically the first Return shouldn't be necessary, but it sometimes
seems to be; I haven't figured out why yet.
reset is a program that
ships with libncurses (and thus will be on pretty much any Linux
system); note that it deletes your scrollback, but in this state,
you're not likely to have usable scrollback anyway.
As for why the problem happens in the first place, it's because in
roguelikes, you nearly always disable local echo (you don't want
moving east to write actual
6s on the screen), and the
sudden crash means that it never gets turned back on again. A pretty
simple problem, but it can really catch out unprepared people.
So, let's assume that we have an apparent infinite loop, the program at fault is in fact running, and it's in our code rather than the kernel. It might just be a bad choice of algorithm, but if it is we may as well treat it as infinite. We also want to respond fast; it might or might not be leaky, and if it is leaky, we don't know how long it'll be before swap thrash doom starts; it might be anywhere from days to seconds. However, assuming that our computer doesn't reboot instantly and that we probably have some sort of state (unsaved files, open windows, that sort of thing) that we care about, we want to try contained methods first.
If the program's running from the foreground of a terminal window, we
can start by sending various "stop running" key combinations to it.
This is the case for programs that run in terminals, obviously.
Perhaps less obviously, it's also typically the case for graphical
programs that run in their own window, so long as you started them
from a terminal, you didn't background them (typically with
& on the
command line or Ctrl-Z later), and the program didn't daemonize itself
(not normally worth worrying about, I can't think of a reason why a
GUI program would want to daemonize and in practice they basically
The most basic way to exit a misbehaving program is with Ctrl-C. By
default, this sends the
SIGINT signal, which tells programs to exit
(and exits them crash-style with no debug dump, if they have no
specific handler for it).
There are a ton of potential reasons why this wouldn't work:
All of these are pretty likely and reasonable, too. The reason is that a crash-style exit, with no confirmation, upon a single easily typoed key command is something that programs really don't want to happen (especially with typical roguelike save mechanics where doing so would lose you your entire game, but even in other cases). Given how well-known Ctrl-C is, pretty much all sufficiently large programs do something to stop this happening.
Ctrl-C is still well worth trying, though. Even though programs
nearly always take steps to change its default implementation, its
intended function is sufficiently well-known that many try to preserve
the meaning. Perhaps this is via adding a handler that converts it
to a normal-style exit, via adding a confirmation, or via using it as
a softlock escape code. In other words, most programs will at least
tell you how to quit in response to Ctrl-C. (
vim is a fun example
of this: its entire response to Ctrl-C is to print a message telling
you how to quit.)
Unfortunately, though, the fact that SIGINT has a lot of safe-shutdown logic associated with it means that it's also normally the codepath most vulnerable to getting stuck in a loop itself. Perhaps it's waiting for a "safe place" in the code to do a shutdown (NetHack 4's Ctrl-C handler works like this, for example); an infinite loop could mean it never gets there. Perhaps it calls into the same buggy code that lead to the loop in the first place.
In other words, you typically can't expect this to work on a truly broken program, but it rarely hurts to try.
Ctrl-C is well-known as a "universal exit" code for programs. There's actually a subset of programs (command-line-interface terminal programs) which have an even more universal exit code: Return Ctrl-D (i.e. Ctrl-D at the start of a line). By default, Ctrl-D is interpreted as "flush standard input", causing any partial line entered so far to be sent to the program you're using (thus this won't work with roguelikes, which don't use line-at-a-time input for obvious reasons). If you press it at the start of a line, there isn't a partial line, so you send zero characters to the program, a state that looks identical to end-of-file (unless you try to read again, and why would you do that after end-of-file)?
The vast majority of terminal-based command-line-interface programs on Linux know about the "Ctrl-D at start of line = exit" convention and will exit in response to this. Even the ones that didn't intentionally have it in mind during implementation will normally exit anyway; after seeing an end to their source of commands, there's clearly nothing more that can be done, and they'll often fall into error-handling code (which normally exits a command-line-interface process):
Welcome to Adventure!! Would you like instructions? user closed input stream, quitting...
As for how this applies to infinite loops, clearly it won't help if the program isn't reading input, but if it's just softlocked, it's normally pretty effective at jumping it out of its current state. Command-line-interface programs normally don't bother with an explicit softlock escape code, because they have Ctrl-D.
Ctrl-C is very well known, but there's also a very similar effect that's considerably less well-known. SIGQUIT, whose default binding is Ctrl-\, was designed to be identical to Ctrl-C except that it's a true crash-style exit by default (with debug dumps if they're turned on, and all that sort of thing), rather than the Ctrl-C reaction which is just mostly crash-style by default.
Anyway, all the comments under Ctrl-C would apply to Ctrl-\ too, but with two big exceptions: it's considerably less well-known, which rather changes the whole dynamic; and it's not needed as a softlock escape code (because Ctrl-C exists already). Many developers will do something to handle or block Ctrl-C (the key combination) or SIGINT (the signal it sends) or both; putting the same effort in for Ctrl-\ or SIGQUIT is much rarer (although it happens).
This means that Ctrl-\ is, in practice, a surprisingly good command for intentionally crashing a process when Ctrl-C doesn't exit it. The downside is that considerably fewer programs will try to do cleanup, saving open files, giving confirmations, etc.. on pressing it, meaning that you don't want to use it as your very first option; perhaps you could have exited the infinite loop and saved your save file at the same time. The upside is the same thing as the downside; considerably fewer programs will try to do anything fancy, meaning it's less likely to be broken.
Programs (like NetHack 4, via libuncursed) that do handle Ctrl-\ normally use the same codepath for it as Ctrl-C. The reasoning is typically that the odds of the program being stuck in a loop are lower than the odds of someone hitting the combination by mistake, and besides, there's still a whole blogpost of combinations to try to get rid of the loop.
So if the reason that programs tend to block key combinations that induce crash-style exits is that they're normally typos rather than alternative methods of exit when the normal method is blocked by a bug, what about a key combination with lower consequences for typoing it? For example, it could just pause the program, allowing it to be crashed or resumed at the user's leisure once it's stopped using all the CPU cycles.
As you might have guessed, there is such a key combination. The
signal in question is called SIGTSTP, and the default keybinding is
Ctrl-Z (a binding that's by now sufficiently well-known that even some
GUI programs have started implementing it, although 'undo' is still a
more common interpretation). Although, like Ctrl-C and Ctrl-D,
there's pretty high awareness of it among developers, there's much
less of an incentive to do complex things in response; typoing it is
easily reversible (
fg), and it serves as a reasonably safe way to
indirectly crash-kill a process (first pause it so that it stops
chewing up CPU and so that you have access to a shell, then use that
shell to crash-kill the process from outside).
Actually exiting a process via SIGTSTP is a little more involved than in the previous example. You basically use Ctrl-Z to pause it, then the techniques in the next section to exit or intentionally crash it from there. The difference is that (in most shells) you can reliably find out the process ID for the last thing you successfully SIGTSTP'ed with a single command:
jobs -p %%
Admittedly, I had to look it up. (You can also use
%% as a
substitute for the process ID as an argument to
kill, so long as you
haven't done job control manipulation since.)
SIGTSTP isn't a magic bullet for exiting processes, because many
processes still need to do handling for it. A program might block it
outright for interface reasons (even though you can simply resume the
fg, that doesn't mean that the end user knows that, and
if they assume the program has crashed they may try to run it
recursively and cause Bad Things to happen). There are also valid
reasons to handle it; NetHack 4 (via libuncursed) handles it in order
to put the terminal settings back to where they user expects them
(most users won't want mouse movement to spout text into their
terminal, for example).
We're now starting to get into the realm of "asking nicely, with potentially destructive side effects". SIGHUP is another signal in the same basic category as the other signals we've seen in this section; it's a request to exit that can be blocked or handled. However, the usual way to send it is irreversible: you close or disconnect from the terminal the process is running in. (Alternatively, you can send it using the usual techniques for killing a process using a separate terminal, which are explained later in this blog post.)
SIGTSTP was different from the other signals in terms of developer reactions because failing to handle it isn't normally a big deal (if you aren't taking over the terminal settings, that is). SIGHUP is different in a different way: you're (under normal circumstances) not going to get any more information from the user, so this is no time for confirmation prompts; whatever you're going to do, just do it. This makes it a particularly good way of exiting programs which are stuck trying to do something interactive for some reason. (Unfortunate exception: if the reason they're stuck trying to do something interactive is that the terminal doesn't exist and the program assumes that it does. This is probably the #1 most common source of tight infinite loops in recent NetHack history; it's a surprisingly easy mistaken assumption to make that you can just repeatedly ask questions until you get a valid answer, but a missing terminal isn't going to give you one.)
SIGHUP is also often well worth a try because it's sufficiently different in meaning from the other termination signals that it often has a different codepath, giving you a second try to find a codepath that works to exit the process. For example, in NetHack 3.4.3, the SIGHUP handler tries to assemble whatever's in the game's memory into a working save file (a much better outcome than destroying the game like the SIGINT handler does, although a rather less reliable one and the source of known exploitable bugs). In NetHack 4, the game tries to navigate its own menus to produce a controlled shutdown, and crash-kills itself if it can't manage to do so within a relatively short time limit (which could happen in the case of a softlock); this is thus a rather more reliable way to exit the program than Ctrl-C, which attempts to open a menu that takes further user input.
The clear downside, of course, is that after doing this, you certainly don't have the program in the foreground of a terminal any more! So this has to be the last thing you try in this section. Also, it can be hard to work out whether it worked or not (you won't have anywhere to see messages that might have been produced), and if it doesn't work, it can be hard to identify the process you were trying to kill. (Although if you don't have anything else running at 100% CPU, that typically gives you at least one reliable giveaway. As always, the problem is as to whether you can exploit it before the swap thrashing starts.)
Suppose that the techniques in the previous section aren't useful, either because you don't have a terminal, or because the process is overriding the keystrokes you wanted to use, but that the system is still in relatively good shape right now: you can start new programs and use other programs, it's just that one process is stuck. At this point, you can just open up a new terminal window to get to a shell prompt (or use an existing one that's running a shell), and use that shell to send signals to the process in an attempt to exit it.
The most basic way to do this is using the
kill command. This
command takes a process ID as its argument, and sends a signal to that
process. For example, say that the stuck process in question has a
process ID of 12345:
kill 12345 # this command sends SIGTERM, by default kill -HUP 12345 # this command sends SIGHUP, as requested kill -STOP 12345 # this command sends SIGSTOP, as requested # and so on
There's a pretty wide range of signals you could use, and most of them
will by default end the program. You can specify the signals either
by name or by number (the names are normally easier to remember, but
if you happen to have the numbers memorized, they can be faster to
type). In addition to the signals mentioned in the previous section,
here are a few of the more interesting ones that you can send using
SIGTERM: This is the default signal used to exit a program from
"outside". There are only two common situations where it gets
kill and friends, and during a normal system
shutdown process. The big advantage over something like
is that processes will never consider it to be a typo; if a
process does handle it, it will nearly always be with its usual
meaning of "urgently do an orderly shutdown". It's worth noting
SIGTERM nearly always implies some sort of time pressure –
during shutdown, the process that's implementing the shutdown
init, the same process that implements bootup) will
SIGKILL after only a few seconds if the process hasn't
exited – and so programs are unlikely to do any sort of UI in
response to a
SIGTERM and will often respond by creating an
autosave file or the like. This is nearly always exactly what you
Numerically, this signal is written as
kill -15. Basically
nobody ever does this, because it's the default; the only time
that knowing the number is relevant is that you sometimes see the
number in a crash report.
SIGSEGV: This is pretty much the complete opposite of
its usual purpose is being used to crash a program when a
programming error related to memory use is detected by the
operating system (things like accessing beyond the end of an
array, dereferencing a
NULL pointer, and the like). If you work
as a C programmer on Linux, you've probably seen this signal (a
"segfault" or "segmentation fault") tons of times.
Sending the signal manually might thus seem kind-of weird. The advantage is that because it looks so much like a programming error, the process you're sending it to is going to make few assumptions about its own state if it handles it, meaning that you have an offchance of a useful autosave file, and that it will probably work to kill the process in question.
Numerically, this is
kill -11, a number I'm familiar with by now
from seeing it in crashes all too often.
SIGTSTP? This is basically the same as that,
pausing a process to enable a controlled shutdown to happen later.
The advantage over
SIGTSTP is that it cannot be handled; if
the kernel is functioning normally (and if you have permission to
send signals to the process), the process will pause whether it
likes it or not.
Strangely, given how it's universally implemented on POSIX systems
and has special rules of its own (making it one of the more
important signals), this signal doesn't have a consistent number
on all systems, although
kill -19 is a common choice.
SIGKILL: Perhaps the most infamous of all process-ending
signals, and unusual in that its number is probably better known
than its name; a large proportion of Linux users might not know
what the signal is called, but most of them know about
It exits a process. It cannot be blocked, handled, or intercepted
by any means, and the process gets no chance to clean up or to
produce a crash dump or anything of the sort, it just ends. As an
example of how comprehensive this is: normally if you signal a
process thats running under a debugger, the debugger will see the
signal and give its user options for handling it, but after a
kill -9 on the process that's being debugged, the process will
just end anyway and the debugger will be confused as to where it's
ptrace: No such process. in response to
pretty much any command).
The only real exception I'm aware of is to do with permissions:
you can't normally
SIGKILL a process that's owned by root or by
another user unless you're root yourself. The signal itself is
still all-powerful, but the kernel won't let you try to send it in
As another note, processes can sometimes "seem" to survive a
SIGKILL, when what's actually happened is that the process has
exited but its process ID is being kept around because something
still needs to refer to it. This is normally indicated with the
Z in process listings (where it appears depends on the
program you're using to make the list, and isn't shown by every
For many people, this is their first resort when killing a process, because it works so unconditionally. I think it's better to try other options first, though, because you might be able to salvage more of the state of your program; perhaps if you used more caution, you could get a working autosave file, or crash dump, or the like.
That handles the signal number part of
kill, but what about the
process ID? In most cases, you won't happen to know what it is, so
here are some methods you can use to find out:
pgrep is one of my favourite methods because it just reports
process IDs without actually doing anything to the process. It
can identify a process in a number of ways, but the default is a
substring match on the process name (
pgrep nethack4 is something
I find quite useful as a result, especially when I'm trying to
identify a broken process on the nethack4.org server). While
writing this blog post, I learned about
pgrep -a which shows the
command line as well as the process ID, which seems like it could
make identifying the right process even faster.
You can replace
pkill -9, etc.) in
order to just kill all matching processes immediately, which is
nice if you know that you aren't going to get false positives.
(It takes a lot of experience to know this, though: especially
with substring matches, false positives are quite common!)
top is a screen-oriented program for listing processes, which
lets you sort via various criteria (using
> to change
the sort order. It even lets you signal processes from inside it
k. The big advantage is that it displays process
name, CPU usage, and memory usage onscreen (and supports all these
as sort orders), normally meaning that the infinitely looping
process easily stands out from the crowd. The disadvantage is
that as a relatively heavy program, it can take a while to start
if it has to compete with an infinitely looping program that's
tying up system resources.
If the offending program is a GUI program, you could try asking
the graphics system (which on Linux, is typically X). If you're
using X, then the command in question is
(which allows you to specify a window via clicking on it), but I
can rarely remember how to spell that and thus more commonly use
xprop | grep PID. As a variant on this, you can use
which is the GUI equivalent of closing standard input; it causes
the graphics system to shut down all communications with the
window you click on, and most programs will exit in response
because they have no way to continue.
Being able to find the process ID is also useful if you want to debug
the problem, rather than just make the process go away.
and a process ID will pause the process in question (assuming it has
permission to debug that process: on Ubuntu, it probably won't if you
haven't changed your settings as described above), and also allow you
to debug it from there (and you can subsequently use the
to kill the process if you want it to end, which I think simply sends
Of course, if you need to kill a stuck process but don't have the
permission to do so (e.g. because the process is owned by another user
or because it's a service), you can simply gain permission via the
normal means (
sudo, etc.) alongside your kill command. As
always when using elevated permissions, be careful that you know what
you're doing and that you're entering the right commands: part of the
reason the permission checks are there are to prevent you accidentally
taking down or corrupting the system, and when overriding the checks,
you could well end up killing a critical system utility and making
The most likely reason that the above techniques would fail are that, because X has locked up or because of swap thrashing, the system's UI isn't responding and thus you can't open a terminal window (or use an existing terminal window that's running a shell) and enter shell commands into it.
The least damaging resort, therefore, is to try to find a shell somewhere which is responding. The most accessible place is known as the "virtual terminals".
Another quick history lesson. UNIX computers used to be mainly used
via physical terminals, which were a separate device from the computer
itself and connected to it in much the same way as a printer or a
keyboard. Nowadays, the usual way to replicate that functionality is
using graphical programs like
xterm that emulate the physical
terminals; these use an abstraction known as a "pseudoterminal" to do
their work, and need further layers of abstraction to display their
windows onscreen, communicate with the user, and so on. In between
came the "VGA console", which is basically what DOS uses in order to
display text to the user (and which can even nowadays be seen on many
systems during the early boot process); and the "framebuffer console"
which is basically a part of the kernel that has the same
functionality as the VGA console but uses the kernel's graphics code.
These are collectively known as "virtual terminals", because they do
the same job as terminal hardware, but without requiring a physical
At one point, using virtual terminals would have been the main method
of using a Linux-based computer. (It still is if for some reason
you're using the computer locally, i.e. not over a network, and also
haven't installed any graphics software like X. This configuration is
very unusual, though; most systems that don't need graphics are
servers, and most servers are used over a network using programs like
ssh, rather than via being physically present at the server.) They
still exist, though, and are nowadays mostly used in emergencies
(either due to issues during the boot sequence that happen before X
has loaded, or because X has frozen). Many programs work in them,
though, including NetHack 4.
The method you use to switch to a virtual terminal is by holding Ctrl and Alt and pressing one of the F keys (e.g. Ctrl-Alt-F1). There are nearly always several virtual terminals available; on my laptop, typically I have six. Sometimes there's also one dedicated to boot messages, although that seems less common nowadays; and when you're running graphics software, that takes over a "virtual terminal" of its own (meaning that after pressing Ctrl-Alt-F1, I can get back to my graphical desktop using Ctrl-Alt-F7, because it gets the next available number after the first six).
Once you're at a virtual terminal, all you have to do is log in (using
your username/password pair, as normal), and you'll have a working
shell. You can then kill processes in the normal way; you won't have
a GUI but that doesn't really matter because you have a working
command line. Unfortunately, during swap thrashing, this process can
be really slow (just displaying the password prompt after the username
is entered can take over a minute), but it does normally work
eventually. You can log back out of a virtual terminal using the
logout command, or (unsurprisingly, given the discussion
Ctrl-D at the command prompt when no text is entered.
If you have reason in advance to think that you might need to kill a
process in a hurry during swap thrashing, you could always try logging
in on a virtual terminal pre-emptively (and perhaps even starting
top pre-emptively). That way, you will need to run considerably
fewer commands once the thrashing starts, meaning that you can end it
I should also mention that in addition to the virtual terminals, the original terminal system also exists, the "serial terminal", and is even lower-level (it even works during early boot). This requires a separate terminal system connected to your computer. It's actually pretty easy to get such a terminal system nowadays – although physical VT100s are rare, software for emulating their functionality, like HyperTerminal on Windows or Minicom on Linux, is readily available – but modern computer hardware rarely has the serial port needed to make the connection. (You can get USB serial ports, but they need a lot more work from the kernel to handle.) From memory, the cables also tend to be quite expensive.
Finally, if you have a working network connection, a second computer
to use it, and if your firewall isn't too upset at the idea, you can
use a program like
ssh to get a terminal on your computer over the
network. This is the way Linux servers are most commonly administered
nowadays, and although less usual, it works on desktop/laptop/mobile
too. I'm not sure whether
ssh is more or less badly affected by
swap trashing than the virtual terminals are; I've never tried this
method myself (both because I rarely have second computer handy that's
networked with the one I'm using, and because my firewall is set to
ssh connections), but other people have reported
reasonable success with it.
Suppose that you have a particularly hard crash, or that swap trashing is so bad that you feel powerless to even attempt to log in on a separate console (or don't have the time to type complex commands at swap trash speed). Perhaps it's more important to get the system back into a usable state now even if you lose other processes in the process. You can try some of the following techniques:
In the early days of multitasking operating system design, keyboard manufacturers realised that users would need some way to communicate with the operating system: on a single-tasking operating system, a program can take over the entire keyboard and do what it likes with it, but that would mean that there would be no way to switch to a different process.
As most computer users will be aware, the solution to this problem that ended up being adopted was to add global key combinations like Alt-Tab and clicking on the taskbar that individual programs normally don't interfere with. Windows also decided to adapt the Ctrl-Alt-Delete combination (previously used for rebooting the computer) into a key that couldn't be intercepted by applications and could be used to forcibly quit them (among other things).
However, the keyboard manufacturers had a different solution in mind. If your programs are already using all the keys on the keyboard, then a simple solution is to add another key, that's reserved for communicating directly with the operating system kernel. The key in question exists on most modern keyboards, and is called SysRq (presumably standing for "system request"). In order to remove the need to add an extra physical key, it's normally a modifier/key combination, and in particular is normally Left Alt-PrtSc (and the PrtSc key may or may not also have SysRq written on it; on most older keyboards it does, but modern keyboards tend to leave the label off). Laptop keyboards might occasionally have it somewhere else. (Something I learned while writing this blog post is that there are various bindings for it on non-PC hardware, too, e.g. SPARC apparently uses Alt-Stop and PowerPC apparently uses F13 or Alt-F13, which is sometimes labeled as PrtSc.)
On Windows, Ctrl-Alt-Del is sufficient for this purpose, and so most computer users don't have much of an idea of what SysRq is for (Windows will interpret Left Alt-PrtSc as a literal Alt-PrtSc, and take a screenshot of the current window). Linux, however, uses the SysRq key with its original intended meaning (although many distributions disable much of the key's functionality for security reasons; see the configuration advice earlier in this section). You have to hold down Alt continuously while typing the combination (presumably as security against typos), which on many keyboards means that you have to let go of SysRq because they can't handle that many keys being pressed at once.
One particularly useful combination here is SysRq-F, which kills one process, asking the kernel to guess which one you want to kill (based mostly on how much memory pressure it places on the system). (This usually has to be typed as Alt-(PrtSc,F), i.e. you continue holding down Alt but let go of PrtSc before pressing F.) In the case of an leaky tight infinite loop, it is nearly always very obvious to the kernel which process you mean to kill, so you can expect it to get the right target first time. Of course, there's risk to doing things this way: it might hit the wrong process, in which case you just lost some work or destabilised the system slightly. It still seems like one of the best options for dealing with swap trashing, though.
Over 99% of Linux laptop/desktop users will be using a graphical desktop over 99% of the time. This means that all the programs they start, whether text or graphical, will have been started via the graphical desktop, and thus killing the desktop and all its descendants will necessarily kill the process they care about (because it kills every process they care about). This is pretty much equivalent to logging out, except that the processes in question won't get a chance to do a clean shutdown. It's a pretty bad option, really, but it's still better than shutting down the entire system, as you won't have to go through the boot process; you get right back to the login screen, which can be a major time gain (especially in corporate-style configurations in which the boot process has a lot of network connectivity, installing updates, and the like).
The traditional key combination for doing this was Ctrl-Alt-BkSp,
which was standard and worked for years. On many Linux systems, it
still works. However, more recently, it's often been disabled, for
two reasons. The first is relatively simple: it was deemed too easy
to press by mistake. It's not that easy to press by mistake, but
the consequences are really wide-ranging and drastic for something
that isn't totally unreasonable as a typo. Emacs users might enjoy
reading the documentation of
lists keys that are rebound: two of the bindings in question are
C-M-Backspace, which seem reasonable until you
realise that on modern keyboards, the usual way to type
C- is Ctrl-
and the usual way to type
M- is Alt- (you could use Esc- but that's
much less common). (Emacs finally realised how unfortunate these
bindings were in version 22.1, and removed them, but the documentation
hasn't been changed to match yet.)
The other reason that the Ctrl-Alt-BkSp binding is typically disabled is that it was realised that another pre-existing binding did the same thing. SysRq-K (i.e. normally Alt-(PrtSc,K)) forcibly kills every program on the current virtual terminal, i.e. every program that was started via your graphical desktop, if that's what you're looking at at the time. This was suggested as a much harder-to-typo binding that achieves basically the same thing. Unfortunately, there's some controversy about this particular binding (including what its intended use is, and how secure it is), meaning that there's conflicting advice about always using it (including before a login), never using it, manually configuring it to something else, and setting up the system in such a way that it doesn't actually work. (Not to mention, what its name is; it's documented as "SAK" but the acronym can be expanded to "Secure Access Key" (which some people claim is a misnomer), or "System Attention Key" (which is kind-of nondescriptive of what it does).) However, I think it's widely agreed that when it works, it's good at killing processes.
As a side note, graphical logouts are also useful when you need to log out in a hurry and some program or other is taking its time during the log out process. If you have to be at a meeting in 1 minute's time, i.e. have to leave now in order to get there in time, and don't want to leave the system logged in while the log out process runs in case someone manages to look at your files in the meantime, a graphical logout (on systems that allow it) is going to be a better option than cutting the power to the computer.
Another SysRq combination that's sometimes useful for getting rid of an infinite loop is SysRq-N. This is to do with "realtime priority" processes: the special property of such processes is that they can demand CPU cycles whenever they want them and take as many as they need before allowing other programs to run. While useful when writing a program that needs special access to the hardware (it's basically the next level down from writing a kernel module in terms of what you can do with interacting with the hardware), this obviously causes particular danger when such a process gets into an infinite loop, as it then needs all the CPU cycles available and thus no process (other than possibly other realtime priority processes) will get to run.
In most cases, this probably hasn't happened. At least in theory, code reviewers look really carefully for potential infinite loops in realtime priority processes, just as they look really carefully for potential security bugs in setuid executables; because the code is using a special exemption from one of the standard assumptions made by the operating system, people will pay special attention to ensure that it doesn't abuse it. The exception is if you happen to be a programmer working on a realtime priority process yourself, in which case you probably know you are.
The SysRq-N combination is basically a "last ditch" attempt at bringing a realtime priority process under control. It removes realtime priority rules from all processes, forcing them to take turns on the CPU like normal processes do. Of course, those processes were presumably using realtime priority for a reason, so doing this can be expected to cause them to stop working reliably. Hopefully, none of them were doing anything critical for the system's survival.
Working on a realtime priority process is a relatively rare activity,
and one that's well worth taking precautions against infinite loops.
Here are a two things you can do without having to resort to something
as global as
You can keep a shell handy with a higher realtime priority. As long as you don't give the shell any input, it won't (or at least shouldn't) request any CPU cycles, so it shouldn't throw off the timing of other realtime priority processes.
Getting keyboard focus to such a shell might be difficult, and you can't run it in a graphical terminal because the terminal itself (and the X server) would need realtime priority to be able to forward your keypresses, so it's best to run it on one of the virtual terminals so that you can switch to it with Ctrl-Alt-F2 or the like. If at all possible, start your program from the shell in question via running it in the background, so that the shell keeps keyboard focus at all times.
In remotely modern Linux kernels, you can use kernel resource
limits to effectively place a "watchdog" on the process,
automatically killing it if it seems to have gone into an infinite
loop (via counting the time until it voluntarily relinquishes the
CPU). The limit in question can be set using the shell command
prlimit, or a C library function that is also called
The very last resort is to shut down the system. You could just cut the power, but that comes with risks such as filesystem corruption; doing a clean shutdown is better if possible.
Obviously, at this point, you're not going to be able to shut down the system via the normal method; rather, you have to guide the kernel through the steps of a manual shutdown yourself. As you're communicating with the kernel, these are all SysRq-combinations.
Here's the traditional recommended way to manually shutdown a system under Linux, and the reasons why each step is performed. Note that many of them may take some time to work correctly, so you have to wait several seconds (at least) between most of them, and until there's no hard disk activity:
SysRq-R: forcibly take control of the keyboard. Sometimes allows you to see what you're doing. Sometimes allows you to Ctrl-Alt-F1 when you couldn't before. This is all in theory; I don't think I've ever seen a benefit from it. At least you don't have to wait long after it; it shouldn't be a time-consuming process.
SIGTERM to every process (except
is responsible for startup and shutdown). This probably won't
have much of an effect on the actual offending process, but it
does have the advantage of potentially producing useful autosave
data from any other process you happened to have open, meaning
that you don't lose your work there.
SIGKILL to every process, except
system is unlikely to be in a very usable state after that. At
this point, traditionally at least, the idea is that the system
would be in a quiescent state without anything trying to modify
it, meaning that you don't have to worry about things interfering
with your clean shutdown.
SysRq-S: write all unwritten data to disk. This is a pretty important step for protecting against filesystem corruption; the hope is that it will at least leave your disks in a consistent state.
SysRq-U: make all disks read-only. No point using the computer without a reboot from this point, because you wouldn't be able to save anything. Along similar lines, useful during a shutdown, because nothing can be changing at the point of the shutdown.
SysRq-B (reboot) or SysRq-O (shutdown). The last part of the
shutdown process. It's pretty scary just how instant this is; you
press it and your computer turns off. You've done the rest of the
clean shutdown process already (
init does basically this when
shutting down, with the main difference that it takes more care to
do things in the right order), but it still comes as an abrupt
SysRq-R-E-I-S-U-B is seen as a shutdown code in many Linux tutorials,
and I hope that this explanation makes it a bit clear what it does and
why it was traditionally recommended. (It's relatively easy to
reisub spelt backwards is
busier, which is a real
The problem is that nowadays, Linux implementations don't use
sysvinit, a traditional (and very simple)
init implementation that
was standard even just a couple of years ago, but rather more complex
implementations that handle things like parallel dependencies during
startup and restarting crashed processes. (
upstart was popular for
a while, but nowadays many Linux distributions are standardising on
the highly controversial
systemd.) With both
systemd, things go very differently from how they used to. The
difference starts at step 2: SysRq-E (and SysRq-I) still request
termination of (or outright kill) every process but
init, but now
init sees all these processes that "should" be running but aren't,
and tries to restart them. The result is that you get a kind of weird
hybrid between a shutdown and boot-up process.
Worse, in some cases (such as X), the fact that the process is
unexpectedly not running is misinterpreted as a crash (despite the use
SIGTERM, which is one of the least likely signals to be involved
in a crash as opposed to sent intentionally). This means that you get
the graphics troubleshooter loading up (and at least on Ubuntu, the
graphics troubleshooter is kind-of buggy). Ironically, the
troubleshooter itself got stuck in an infinite loop (most likely
waiting for input that it couldn't receive) that prevented Ctrl-Alt-F1
working when I was testing out the manual shutdown process under
systemd, meaning that SysRq-E was my best option to exit it (as
SysRq-K seemed to not be working for some reason, and SysRq-F would
have been too unpredictable with no apparent memory leak; perhaps the
computer was somehow in a mode where no virtual terminals exited).
So what happened in my shutdown attempt was something like SysRq-R, SysRq-E (which brought me to the graphics troubleshooter), SysRq-E (which brought me to the login screen for some reason). I experimentally tried logging in, and although the system sort-of worked, there were definite oddities, such as editor windows disappearing from the screen. Further SysRq-E and SysRq-I entries left me in various other places, such as framebuffer console 1. I then did SysRq-S and SysRq-U, logged in on the framebuffer console, verified that I couldn't do anything to affect the filesystem (and thus it was safe to effectively "cut the power" even though I was logged in), and a final SysRq-O to shut down.
I'm not really sure what conclusions to draw from this, except that
modern Linux distributions are too complex to sanely shut down
manually. Perhaps a change could be made to
init that gives a more
sensible reaction to SysRq-E (it should be easy enough to detect that
it's happened). As it is, the main advice I can give is to do the R
(because surely there's some reason people recommend it) and the E (in
order to get programs to autosave), skip the I because it does more
harm than good nowadays, and just do the S/U/B in whatever state the
system ends up in after the E (waiting for hard disk activity to stop
before each), on the hope that it's a reasonably quiescent one.
Of course, all this knowledge would be less necessary if broken processes were kind enough to just exit all by themselves rather than tying up the CPU or taking down the system. It's hard to do much to distinguish a process that's just looping without using up resources from a process that's acting normally, but if a process is using up an unexpectedly large amount of memory, that makes for a sensible trigger to terminate it.
ulimit shell builtin (supported by most shells nowadays apart
from the retro primitive ones) allows you to place limits on various
resources a process might try to consume. Its effect is limited to
the shell you run the command in and its descendants (which in
practice typically means a single terminal tab), which makes it pretty
safe to experiment with. I currently have a couple of
commands in my
/home/ais523/.bashrc file, which runs every time I
open a new terminal tab (because I use
bash as my shell), placing
limits on a couple of resources particularly likely to be used up by
ulimit -Sv 3000000 ulimit -Sd 1000000
v is virtual memory, which includes all memory used by a program and
all memory it's claims it intends to use (even if it hasn't actually
used it yet);
d is data segment size, which in practice includes all
statically allocated data in the program, plus (with glibc's default
malloc implementation) data allocated via
malloc requests for
small amounts of memory (it's more common to see an infinite loop that
leaks small amounts of memory every iteration, rather than large
amounts rarely, so a lower limit here can catch issues sooner). Both
of these are measured in kilobytes, meaning that we're talking
approximately 3GB and 1GB respectively here; much larger than is
needed by any command-line program I use on a regular basis, but still
small enough that a modern computer can handle the amount of memory
required without swap thrashing. (If I need to do something
particularly memory-intensive for some reason,
ulimit -Sv unlimited,
etc., turns the limit back off again.)
S, for "soft", allows you to undo the change without needing to
start a new process, and likewise means that malicious processes can
exceed the limit via turning it off again; it's thus intended to catch
the innocent mistakes that this blog post is mostly focused on, rather
than programs that might be trying to use up your memory
Hitting these limits normally causes
malloc to fail. This normally
leads to the program exiting one way or another (whether a controlled
shutdown or a sudden crash). Very rarely, the memory allocation that
tips the process over the limit will be something other than
(such as a stack allocation, most likely a VLA), in which case the
process has no real choice other than to crash (you can install a
handler for this by using a separate pool of memory reserved
specifically for the eventuality, but hardly anybody does because it's
unlikely and you can't do much about it anyway).
Not really related to the subject of the article, but if you're
messing around with
ulimits, here's another really useful one:
ulimit -Sc 1000000
This means that if the process crashes, you'll get a dump (called
core) that can be loaded into a debugger to determine why. Not so
useful to the average user (which is why it's off by default), but
it's pretty valuable when it's the code you just wrote that crashed.
Note that you'll have to delete the
core files manually after use;
new crashes won't overwrite old
Typically, try these in order, skipping ones that are obviously irrelevant.
Warning: Try these first, to eliminate them; otherwise, if this has happened, you're doing a lot of input with potential side *effects blind
Possible causes of this, and the fixes:
XON/XOFF flow control is on for some reason and someone pressed Ctrl-S: Ctrl-Q
The process crashed with terminal echo turned off, so no keypress has
a visible effect: Ctrl-C Return
The process has a working or default SIGINT handler: Ctrl-C
It's a command-line program at its command line: Ctrl-D
The process has a working or default SIGQUIT handler: Ctrl-\
The process has a working or default SIGTSTP handler: Ctrl-Z, then
if the process pauses,
jobs -p %% and Return to discover its process
ID, then kill it using the techniques in the next section
Warning: This one has to come last, because it prevents the others working:
The process has a working or default SIGHUP handler: close the terminal window or tab containing the process
Identify the process ID, using:
You know its name:
pgrep -a substring_of_name
It uses a lot of CPU or memory:
It's a GUI program and has a window onscreen:
xprop | grep PID
Then send the process a signal (PID here is the process ID):
The process has a working or default SIGTERM handler:
The process has a working or default SIGSEGV handler:
kill -SEGV PID
You don't mind losing autosave/crash dump information:
kill -9 PID
Get a new console via:
The keyboard is still responding: Ctrl-Alt-F1 or Ctrl-Alt-F2
You have a serial terminal: connecting the serial terminal
You have a network connection: connecting using
then log in and follow the advice in the previous section. (Try other Ctrl-Alt-F key combinations in order to get back to where you were.)
The kernel can guess what process to kill: SysRq-F (i.e. Alt-(PrtSc,F))
You're logged in to a graphical desktop: Ctrl-Alt-BkSp or SysRq-K
The process has realtime priority: SysRq-N, then see previous section
You need to reboot the system: SysRq-R-E-S-U-B (no I because systemd)