NetHack 4 homepage | NetHack 4 blog | ATOM feed

Debugging tips: How to exit a process on Linux

Tags: development linux debugging | Thu Aug 27 16:51:18 UTC 2015 | Written by Alex Smith

One fairly common problem that comes up when writing a large program such as NetHack is the dreaded infinite loop. Unlike other kinds of error, that normally just cause a crash (possibly with an error message), an infinite loop (in which the program does the same thing over and over indefinitely) leaves you with a problem: the program is still running and still doing something, and it can sometimes be very hard indeed to stop it. I know that on occasion, I've lost half an hour or more of my time to work on my programming hobby to an accidentally introduced bug that consumed all of memory and caused runaway swapping.

Infinite loops are also the bane of anyone who runs a public NetHack server. paxed, the administrator of nethack.alt.org (almost certainly the most popular server for NetHack in general, although it doesn't support NetHack 4), puts a huge amount of effort into getting rid of even unlikely or potential infinite loops when they're discovered or suspected, because once one starts, it's going to make the server laggy for every other player. I know that I've been personally hit by this on nethack4.org a few times, too, needing to fix these bugs on an emergency basis to prevent the server becoming unusable for players while I'm asleep and thus unable to exit the processes myself.

Normally, I develop NetHack 4 on Linux (although I do some amount of NetHack 4 development on Windows in order to ensure that it remains compatible for all the Windows players out there). One nice advantage of Linux for this is that it has a huge number of options for getting rid of your runaway processes (a process is, in most of the cases relevant here, approximately equivalent to a single invocation of a program), as a sliding scale from the contained but ineffective approaches at one end, to the reliable approaches that cause a lot of collateral damage at the other. One disadvantage, though, is that because there are so many, it can be hard to remember what they are – and when a runaway program is eating up your CPU, sometimes you need to react fast to stop it taking down the system. Thus, I decided to write this guide in the hope that its readers (and me, later on) will remember some of the techniques next time they need it.

I tend to explain everything as I go and give a lot of background, so this is pretty long. There's a cheatsheet at the end for people who just want a summary, or something to print out.

Before you start: configuring Linux for quick process kills

One of the tradeoffs involved in development is to do with the power available to the person sitting at the keyboard, versus the security of the system against an unwanted person sitting at the keyboard. In particular, you have to answer questions like "do I want people to be potentially able to kill the process that's busy keeping the computer locked, and thus get at my locked computer when I'm not in the room"?

Typically speaking, there isn't a huge risk of someone pulling off something like this because trying to accurately aim at a single process with the limited set of functionality available on a locked computer is almost impossible, and the less discriminate process-killing methods tend to get rid of anything that might be of use to an attacker too (sure, they could lose you any unsaved work you have, but they could also do that simply by turning the computer off, so you're not losing anything there). That said, although low, the risk is not nonexistent, which means that some of the more useful process kills for development are disabled by default on many Linux distributions nowadays.

There are a couple of "procfiles" (that control kernel settings) relevant to this. The main one is /proc/sys/kernel/sysrq, documented in man 5 proc. Given that the file is overwritten every boot, if you want a custom value for it, you'll want to change the relevant part of the boot process; many distributions use a program called sysctl to handle setting the values of procfiles during boot, in which case you'll want to look in /etc/sysctl.d for the relevant file. (On my Ubuntu system, the file I needed to change was /etc/sysctl.d/10-magic-sysrq.conf. I went for the value 246, which is pretty permissive, but disallows things like dumping memory directly from the kernel; the main potential security hole here is an attacker managing to hit a screen lock process with Alt-SysRq-f.)

There's also another setting you have to look out for on Ubuntu specifically. One of the changes that Ubuntu makes in order to make life harder for attackers is to prevent processes tampering with or looking at each others' memory or control flow directly if they don't have some sort of pre-existing relationship. The control for this is in /etc/sysctl.d/10-ptrace.conf, which explains the possible settings pretty well. In general, if you're doing a lot of development where you don't habitually run programs under a debugger but suddenly need to debug them retrospectively (this describes me quite well!), you'll probably want to turn this Ubuntu setting off; it might block a malicious program from looking at your passwords in memory but it won't stop it installing a keylogger, so it mostly only helps past the point where you're already in serious trouble.

Identifying the problem: classifying infinite loops

So, your program's gone into an apparent infinite loop. There are six real possibilities:

Don't jump the gun: is the program even running?

There are a few error states that look a lot like an infinite loop, but are actually caused by a messed-up display; it's not that the program is stuck, but rather that it's waiting for input normally, or crashed, and what's actually happened is that it's the view onscreen that's stuck.

There are two main causes of this, both of which tend to hit screen-oriented terminal-based programs (such as many ASCII roguelikes) the most because they're the programs with the largest tendency to mess around with the relevant codepaths.

XON/XOFF flow control

The first is to do with XON/XOFF flow control, a very old protocol that uses a software-based solution to preventing a connection being overloaded. The normal way to prevent one end of a connection between computers sending faster than the other end can receive is to use a dedicated wire to say "stop sending", thus dealing with the problem in hardware; this has been a solved problem since the days of RS-232.

For those computer users who can only vaguely remember what RS-232 is: you know how nowadays, most mice connect to a computer via USB? Before the the invention of USB, they typically used a round "PS/2" connector. RS-232 is the standard they normally used before that became standard, so we're talking quite a long time ago by now; there are several different connectors but the most common was a relatively small (by the standards of the day) trapezium-shaped one, a bit like a smaller VGA port. And RS-232 gets the flow control problem right, so you have to go back even further, to days where people tried to make do with a minimum number of wires for their connections, to find systems where you need to do your flow control in software.

Anyway, despite XON/XOFF being a solution to a problem that has been a total non-issue for an incredibly long time in computer terms, hey, you might come across a system that needs it, right? So terminals and terminal emulators still have a configuration flag that lets you turn it on and off. I'm not opposed to the flag itself (I like random bits of computer history like that), but of course, its existence inevitably means that sometimes it somehow ends up being turned on by mistake. libuncursed, NetHack 4's rendering library, turns it off deliberately on startup when rendering to a terminal, but it's obscure enough by this point that many programs don't know about it and don't take precautions.

The problem comes with the way that XON/XOFF works: the rule is that if you send a DEVICE CONTROL 3, then the other side of the connection queues output locally (not sending it) until you send a DEVICE CONTROL 1. Worse, the other side is queueing its output; it isn't ignoring your input. So whatever keys you're pressing are having an effect, you just can't see it. It should be clear how dangerous this can be in a roguelike, where a few random keypresses while you're "trying to get the game to respond" can kill your character.

How likely is a DEVICE CONTROL 3 to find its way into your connection? Well, thanks to the utter ambiguity of terminal codes, we find that it has precisely the same code as Ctrl-S. This is the "save without confirmation" command in Dungeon Crawl Stone Soup (thus might well be pressed intentionally by a roguelike player who doesn't know about the XON/XOFF trap), and right next to Ctrl-D on a QWERTY keyboard, a commonly used command to kick down doors in NetHack. So it definitely happens. And the combination of Ctrl-S being pressed with a misconfigured terminal? Pretty rare that it happens to any individual person, but across all the games of NetHack being played, it's happened enough times that even just counting incidents where I was around to give help, I've lost count.

The antidote is, of course, to press Ctrl-Q, but if you don't know, that's almost impossible to guess. Because of the potential negative consequences if an XOFF is the problem, Ctrl-Q is normally my first suggestion for an apparently stuck program (especially if a network connection is involved, which makes a misconfigured XON/XOFF setting much more likely).

Of course, all this means that Ctrl-S and Ctrl-Q become terrible choices for keybindings for a roguelike, or even program generally (you don't want to encourage people to press Ctrl-S, and you don't want to react in any potentially dangerous way to Ctrl-Q because players might have to press it to recover from an issue). In both the NetHack 3 series and NetHack 4, both these key combinations are unbound despite the short supply of keybindings, because of the problems that they can cause. (Making Ctrl-Q "quit and delete your save file" is thus perhaps the worst possible binding choice for that command, even though it's an obvious one; this is a mistake I'd urge ASCII-in-terminal roguelike developers not to make, unless they're really confident in the terminals of their users.)

Of course, you can go further than this if you really want to drive the point home that these bindings are dangerous. I use a range of editors; in one editor I commonly use, emacs, "save" is Ctrl-X Ctrl-S. emacs is clearly very confident in its terminal handling abilities; perhaps with good reason, as almost certainly, it's one of emacs, vi or vim that holds the record for compatibility with the most terminals (although rogue, the original Roguelike, has a surprisingly good argument for being included on that list, seeing as it was a driver of terminal handling innovations at the time). Of course, this means that I often end up muscle-memorying a Ctrl-S into other editors when trying to save, and nano's reaction is pretty amusing:

  GNU nano 2.2.6             File: xoff-example                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                        [ XOFF ignored, mumble mumble ]                         
^G Get Help  ^O WriteOut  ^R Read File ^Y Prev Page ^K Cut Text  ^C Cur Pos     
^X Exit      ^J Justify   ^W Where Is  ^V Next Page ^U UnCut Text^T To Spell    

Sure, says nano, your "respect XON/XOFF" flag was correctly turned off this time. But who knows where you might typo that in the future?

Crash with local echo disabled

The other common cause also has to do with a terminal setting, and a program responding normally but with no visible effect. However, the other details are pretty different; this time, it's a terminal setting which is being used intentionally rather than a compatibility setting from decades ago, and a different program from the one we thought we were running.

The trigger for this is pretty mundane: the program you're using does an outright abnormal-termination crash (segfault, abort(), exit(EXIT_FAILURE), and the like), but doesn't have the opportunity to reset the terminal settings because it crashed so suddenly. (There are various things that programs can do to mitigate this, but they normally don't, and they're limited: a SIGKILL out of nowhere is completely unblockable, although very rare except in response to explicit action by the user. Perhaps I should add some sort of mitigation code for this to libuncursed; there are some technical obstacles to this like needing async-signal-safe terminal status updates and dealing with competing segfault handlers, but nothing insurmountable.)

The result is that the user is dumped back into their shell, but the screen's all messed up, and user input has no visible onscreen effect. (Any output the shell produces in response does have a visible onscreen effect. Unfortunately, it may well be in an unexpected place, and in black-on-black or a similarly unnoticeable colour scheme. NetHack 4 tends to output in purple a little above the middle of the screen with this type of crash; I've seen it enough times by now that I recognised it, but given how that area is mostly purple anyway it'd be easy for a user unfamiliar with this type of crash to miss.

As in the previous case, keys being pressed are having an effect, but again, just not a visible one. This time they're being sent to the shell, so anything you're typing is being interpreted as shell commands. Luckily, random input normally doesn't do much when interpreted as shell commands (the worst that I'm aware of having happened is a bunch of files being created with stupid names), but there's always the risk of a particularly dangerous command being spelt out, so you'll want to deal with this possibility early to be on the safe side. (Not to mention that Ctrl-C is the first thing you were going to try anyway.)

The fix is Ctrl-C Return reset Return:

Ctrl-C      empty current shell input line
Return      execute current shell input line
reset       reset all terminal settings to default
Return      execute current shell input line

(Actually, this is a pretty good trick to know in general for dealing with messed-up terminals.)

The Ctrl-C and first Return to get rid of anything that you might have typed by mistake while trying to get the game to respond, and any text that might have been spammed with mouse movement (if the process ends suddenly like this, it doesn't get to turn mouse input back off again). So you have to keep your mouse still while doing this! Technically the first Return shouldn't be necessary, but it sometimes seems to be; I haven't figured out why yet. reset is a program that ships with libncurses (and thus will be on pretty much any Linux system); note that it deletes your scrollback, but in this state, you're not likely to have usable scrollback anyway.

As for why the problem happens in the first place, it's because in roguelikes, you nearly always disable local echo (you don't want moving east to write actual ls or 6s on the screen), and the sudden crash means that it never gets turned back on again. A pretty simple problem, but it can really catch out unprepared people.

The first resort: asking the terminal nicely

So, let's assume that we have an apparent infinite loop, the program at fault is in fact running, and it's in our code rather than the kernel. It might just be a bad choice of algorithm, but if it is we may as well treat it as infinite. We also want to respond fast; it might or might not be leaky, and if it is leaky, we don't know how long it'll be before swap thrash doom starts; it might be anywhere from days to seconds. However, assuming that our computer doesn't reboot instantly and that we probably have some sort of state (unsaved files, open windows, that sort of thing) that we care about, we want to try contained methods first.

If the program's running from the foreground of a terminal window, we can start by sending various "stop running" key combinations to it. This is the case for programs that run in terminals, obviously. Perhaps less obviously, it's also typically the case for graphical programs that run in their own window, so long as you started them from a terminal, you didn't background them (typically with & on the command line or Ctrl-Z later), and the program didn't daemonize itself (not normally worth worrying about, I can't think of a reason why a GUI program would want to daemonize and in practice they basically never do).

SIGINT

The most basic way to exit a misbehaving program is with Ctrl-C. By default, this sends the SIGINT signal, which tells programs to exit (and exits them crash-style with no debug dump, if they have no specific handler for it).

There are a ton of potential reasons why this wouldn't work:

All of these are pretty likely and reasonable, too. The reason is that a crash-style exit, with no confirmation, upon a single easily typoed key command is something that programs really don't want to happen (especially with typical roguelike save mechanics where doing so would lose you your entire game, but even in other cases). Given how well-known Ctrl-C is, pretty much all sufficiently large programs do something to stop this happening.

Ctrl-C is still well worth trying, though. Even though programs nearly always take steps to change its default implementation, its intended function is sufficiently well-known that many try to preserve the meaning. Perhaps this is via adding a handler that converts it to a normal-style exit, via adding a confirmation, or via using it as a softlock escape code. In other words, most programs will at least tell you how to quit in response to Ctrl-C. (vim is a fun example of this: its entire response to Ctrl-C is to print a message telling you how to quit.)

Unfortunately, though, the fact that SIGINT has a lot of safe-shutdown logic associated with it means that it's also normally the codepath most vulnerable to getting stuck in a loop itself. Perhaps it's waiting for a "safe place" in the code to do a shutdown (NetHack 4's Ctrl-C handler works like this, for example); an infinite loop could mean it never gets there. Perhaps it calls into the same buggy code that lead to the loop in the first place.

In other words, you typically can't expect this to work on a truly broken program, but it rarely hurts to try.

Flushing stdin

Ctrl-C is well-known as a "universal exit" code for programs. There's actually a subset of programs (command-line-interface terminal programs) which have an even more universal exit code: Return Ctrl-D (i.e. Ctrl-D at the start of a line). By default, Ctrl-D is interpreted as "flush standard input", causing any partial line entered so far to be sent to the program you're using (thus this won't work with roguelikes, which don't use line-at-a-time input for obvious reasons). If you press it at the start of a line, there isn't a partial line, so you send zero characters to the program, a state that looks identical to end-of-file (unless you try to read again, and why would you do that after end-of-file)?

The vast majority of terminal-based command-line-interface programs on Linux know about the "Ctrl-D at start of line = exit" convention and will exit in response to this. Even the ones that didn't intentionally have it in mind during implementation will normally exit anyway; after seeing an end to their source of commands, there's clearly nothing more that can be done, and they'll often fall into error-handling code (which normally exits a command-line-interface process):

Welcome to Adventure!!  Would you like instructions?
user closed input stream, quitting...

As for how this applies to infinite loops, clearly it won't help if the program isn't reading input, but if it's just softlocked, it's normally pretty effective at jumping it out of its current state. Command-line-interface programs normally don't bother with an explicit softlock escape code, because they have Ctrl-D.

SIGQUIT

Ctrl-C is very well known, but there's also a very similar effect that's considerably less well-known. SIGQUIT, whose default binding is Ctrl-\, was designed to be identical to Ctrl-C except that it's a true crash-style exit by default (with debug dumps if they're turned on, and all that sort of thing), rather than the Ctrl-C reaction which is just mostly crash-style by default.

Anyway, all the comments under Ctrl-C would apply to Ctrl-\ too, but with two big exceptions: it's considerably less well-known, which rather changes the whole dynamic; and it's not needed as a softlock escape code (because Ctrl-C exists already). Many developers will do something to handle or block Ctrl-C (the key combination) or SIGINT (the signal it sends) or both; putting the same effort in for Ctrl-\ or SIGQUIT is much rarer (although it happens).

This means that Ctrl-\ is, in practice, a surprisingly good command for intentionally crashing a process when Ctrl-C doesn't exit it. The downside is that considerably fewer programs will try to do cleanup, saving open files, giving confirmations, etc.. on pressing it, meaning that you don't want to use it as your very first option; perhaps you could have exited the infinite loop and saved your save file at the same time. The upside is the same thing as the downside; considerably fewer programs will try to do anything fancy, meaning it's less likely to be broken.

Programs (like NetHack 4, via libuncursed) that do handle Ctrl-\ normally use the same codepath for it as Ctrl-C. The reasoning is typically that the odds of the program being stuck in a loop are lower than the odds of someone hitting the combination by mistake, and besides, there's still a whole blogpost of combinations to try to get rid of the loop.

SIGTSTP

So if the reason that programs tend to block key combinations that induce crash-style exits is that they're normally typos rather than alternative methods of exit when the normal method is blocked by a bug, what about a key combination with lower consequences for typoing it? For example, it could just pause the program, allowing it to be crashed or resumed at the user's leisure once it's stopped using all the CPU cycles.

As you might have guessed, there is such a key combination. The signal in question is called SIGTSTP, and the default keybinding is Ctrl-Z (a binding that's by now sufficiently well-known that even some GUI programs have started implementing it, although 'undo' is still a more common interpretation). Although, like Ctrl-C and Ctrl-D, there's pretty high awareness of it among developers, there's much less of an incentive to do complex things in response; typoing it is easily reversible (fg), and it serves as a reasonably safe way to indirectly crash-kill a process (first pause it so that it stops chewing up CPU and so that you have access to a shell, then use that shell to crash-kill the process from outside).

Actually exiting a process via SIGTSTP is a little more involved than in the previous example. You basically use Ctrl-Z to pause it, then the techniques in the next section to exit or intentionally crash it from there. The difference is that (in most shells) you can reliably find out the process ID for the last thing you successfully SIGTSTP'ed with a single command:

jobs -p %%

Admittedly, I had to look it up. (You can also use %% as a substitute for the process ID as an argument to kill, so long as you haven't done job control manipulation since.)

SIGTSTP isn't a magic bullet for exiting processes, because many processes still need to do handling for it. A program might block it outright for interface reasons (even though you can simply resume the program with fg, that doesn't mean that the end user knows that, and if they assume the program has crashed they may try to run it recursively and cause Bad Things to happen). There are also valid reasons to handle it; NetHack 4 (via libuncursed) handles it in order to put the terminal settings back to where they user expects them (most users won't want mouse movement to spout text into their terminal, for example).

SIGHUP

We're now starting to get into the realm of "asking nicely, with potentially destructive side effects". SIGHUP is another signal in the same basic category as the other signals we've seen in this section; it's a request to exit that can be blocked or handled. However, the usual way to send it is irreversible: you close or disconnect from the terminal the process is running in. (Alternatively, you can send it using the usual techniques for killing a process using a separate terminal, which are explained later in this blog post.)

SIGTSTP was different from the other signals in terms of developer reactions because failing to handle it isn't normally a big deal (if you aren't taking over the terminal settings, that is). SIGHUP is different in a different way: you're (under normal circumstances) not going to get any more information from the user, so this is no time for confirmation prompts; whatever you're going to do, just do it. This makes it a particularly good way of exiting programs which are stuck trying to do something interactive for some reason. (Unfortunate exception: if the reason they're stuck trying to do something interactive is that the terminal doesn't exist and the program assumes that it does. This is probably the #1 most common source of tight infinite loops in recent NetHack history; it's a surprisingly easy mistaken assumption to make that you can just repeatedly ask questions until you get a valid answer, but a missing terminal isn't going to give you one.)

SIGHUP is also often well worth a try because it's sufficiently different in meaning from the other termination signals that it often has a different codepath, giving you a second try to find a codepath that works to exit the process. For example, in NetHack 3.4.3, the SIGHUP handler tries to assemble whatever's in the game's memory into a working save file (a much better outcome than destroying the game like the SIGINT handler does, although a rather less reliable one and the source of known exploitable bugs). In NetHack 4, the game tries to navigate its own menus to produce a controlled shutdown, and crash-kills itself if it can't manage to do so within a relatively short time limit (which could happen in the case of a softlock); this is thus a rather more reliable way to exit the program than Ctrl-C, which attempts to open a menu that takes further user input.

The clear downside, of course, is that after doing this, you certainly don't have the program in the foreground of a terminal any more! So this has to be the last thing you try in this section. Also, it can be hard to work out whether it worked or not (you won't have anywhere to see messages that might have been produced), and if it doesn't work, it can be hard to identify the process you were trying to kill. (Although if you don't have anything else running at 100% CPU, that typically gives you at least one reliable giveaway. As always, the problem is as to whether you can exploit it before the swap thrashing starts.)

Sending signals manually: if you have a working shell

Suppose that the techniques in the previous section aren't useful, either because you don't have a terminal, or because the process is overriding the keystrokes you wanted to use, but that the system is still in relatively good shape right now: you can start new programs and use other programs, it's just that one process is stuck. At this point, you can just open up a new terminal window to get to a shell prompt (or use an existing one that's running a shell), and use that shell to send signals to the process in an attempt to exit it.

The most basic way to do this is using the kill command. This command takes a process ID as its argument, and sends a signal to that process. For example, say that the stuck process in question has a process ID of 12345:

kill 12345        # this command sends SIGTERM, by default
kill -HUP 12345   # this command sends SIGHUP, as requested
kill -STOP 12345  # this command sends SIGSTOP, as requested
# and so on

There's a pretty wide range of signals you could use, and most of them will by default end the program. You can specify the signals either by name or by number (the names are normally easier to remember, but if you happen to have the numbers memorized, they can be faster to type). In addition to the signals mentioned in the previous section, here are a few of the more interesting ones that you can send using the kill command:

That handles the signal number part of kill, but what about the process ID? In most cases, you won't happen to know what it is, so here are some methods you can use to find out:

Being able to find the process ID is also useful if you want to debug the problem, rather than just make the process go away. gdb --pid and a process ID will pause the process in question (assuming it has permission to debug that process: on Ubuntu, it probably won't if you haven't changed your settings as described above), and also allow you to debug it from there (and you can subsequently use the k command to kill the process if you want it to end, which I think simply sends a kill -9).

Of course, if you need to kill a stuck process but don't have the permission to do so (e.g. because the process is owned by another user or because it's a service), you can simply gain permission via the normal means (su, sudo, etc.) alongside your kill command. As always when using elevated permissions, be careful that you know what you're doing and that you're entering the right commands: part of the reason the permission checks are there are to prevent you accidentally taking down or corrupting the system, and when overriding the checks, you could well end up killing a critical system utility and making things worse.

Piercing through abstractions: trying to get a working shell

The most likely reason that the above techniques would fail are that, because X has locked up or because of swap thrashing, the system's UI isn't responding and thus you can't open a terminal window (or use an existing terminal window that's running a shell) and enter shell commands into it.

The least damaging resort, therefore, is to try to find a shell somewhere which is responding. The most accessible place is known as the "virtual terminals".

Another quick history lesson. UNIX computers used to be mainly used via physical terminals, which were a separate device from the computer itself and connected to it in much the same way as a printer or a keyboard. Nowadays, the usual way to replicate that functionality is using graphical programs like xterm that emulate the physical terminals; these use an abstraction known as a "pseudoterminal" to do their work, and need further layers of abstraction to display their windows onscreen, communicate with the user, and so on. In between came the "VGA console", which is basically what DOS uses in order to display text to the user (and which can even nowadays be seen on many systems during the early boot process); and the "framebuffer console" which is basically a part of the kernel that has the same functionality as the VGA console but uses the kernel's graphics code. These are collectively known as "virtual terminals", because they do the same job as terminal hardware, but without requiring a physical terminal.

At one point, using virtual terminals would have been the main method of using a Linux-based computer. (It still is if for some reason you're using the computer locally, i.e. not over a network, and also haven't installed any graphics software like X. This configuration is very unusual, though; most systems that don't need graphics are servers, and most servers are used over a network using programs like ssh, rather than via being physically present at the server.) They still exist, though, and are nowadays mostly used in emergencies (either due to issues during the boot sequence that happen before X has loaded, or because X has frozen). Many programs work in them, though, including NetHack 4.

The method you use to switch to a virtual terminal is by holding Ctrl and Alt and pressing one of the F keys (e.g. Ctrl-Alt-F1). There are nearly always several virtual terminals available; on my laptop, typically I have six. Sometimes there's also one dedicated to boot messages, although that seems less common nowadays; and when you're running graphics software, that takes over a "virtual terminal" of its own (meaning that after pressing Ctrl-Alt-F1, I can get back to my graphical desktop using Ctrl-Alt-F7, because it gets the next available number after the first six).

Once you're at a virtual terminal, all you have to do is log in (using your username/password pair, as normal), and you'll have a working shell. You can then kill processes in the normal way; you won't have a GUI but that doesn't really matter because you have a working command line. Unfortunately, during swap thrashing, this process can be really slow (just displaying the password prompt after the username is entered can take over a minute), but it does normally work eventually. You can log back out of a virtual terminal using the exit or logout command, or (unsurprisingly, given the discussion earlier) Ctrl-D at the command prompt when no text is entered.

If you have reason in advance to think that you might need to kill a process in a hurry during swap thrashing, you could always try logging in on a virtual terminal pre-emptively (and perhaps even starting top pre-emptively). That way, you will need to run considerably fewer commands once the thrashing starts, meaning that you can end it much sooner.

I should also mention that in addition to the virtual terminals, the original terminal system also exists, the "serial terminal", and is even lower-level (it even works during early boot). This requires a separate terminal system connected to your computer. It's actually pretty easy to get such a terminal system nowadays – although physical VT100s are rare, software for emulating their functionality, like HyperTerminal on Windows or Minicom on Linux, is readily available – but modern computer hardware rarely has the serial port needed to make the connection. (You can get USB serial ports, but they need a lot more work from the kernel to handle.) From memory, the cables also tend to be quite expensive.

Finally, if you have a working network connection, a second computer to use it, and if your firewall isn't too upset at the idea, you can use a program like ssh to get a terminal on your computer over the network. This is the way Linux servers are most commonly administered nowadays, and although less usual, it works on desktop/laptop/mobile too. I'm not sure whether ssh is more or less badly affected by swap trashing than the virtual terminals are; I've never tried this method myself (both because I rarely have second computer handy that's networked with the one I'm using, and because my firewall is set to disallow inbound ssh connections), but other people have reported reasonable success with it.

Imprecise methods: killing a process with collateral damage

Suppose that you have a particularly hard crash, or that swap trashing is so bad that you feel powerless to even attempt to log in on a separate console (or don't have the time to type complex commands at swap trash speed). Perhaps it's more important to get the system back into a usable state now even if you lose other processes in the process. You can try some of the following techniques:

Asking the kernel to guess

In the early days of multitasking operating system design, keyboard manufacturers realised that users would need some way to communicate with the operating system: on a single-tasking operating system, a program can take over the entire keyboard and do what it likes with it, but that would mean that there would be no way to switch to a different process.

As most computer users will be aware, the solution to this problem that ended up being adopted was to add global key combinations like Alt-Tab and clicking on the taskbar that individual programs normally don't interfere with. Windows also decided to adapt the Ctrl-Alt-Delete combination (previously used for rebooting the computer) into a key that couldn't be intercepted by applications and could be used to forcibly quit them (among other things).

However, the keyboard manufacturers had a different solution in mind. If your programs are already using all the keys on the keyboard, then a simple solution is to add another key, that's reserved for communicating directly with the operating system kernel. The key in question exists on most modern keyboards, and is called SysRq (presumably standing for "system request"). In order to remove the need to add an extra physical key, it's normally a modifier/key combination, and in particular is normally Left Alt-PrtSc (and the PrtSc key may or may not also have SysRq written on it; on most older keyboards it does, but modern keyboards tend to leave the label off). Laptop keyboards might occasionally have it somewhere else. (Something I learned while writing this blog post is that there are various bindings for it on non-PC hardware, too, e.g. SPARC apparently uses Alt-Stop and PowerPC apparently uses F13 or Alt-F13, which is sometimes labeled as PrtSc.)

On Windows, Ctrl-Alt-Del is sufficient for this purpose, and so most computer users don't have much of an idea of what SysRq is for (Windows will interpret Left Alt-PrtSc as a literal Alt-PrtSc, and take a screenshot of the current window). Linux, however, uses the SysRq key with its original intended meaning (although many distributions disable much of the key's functionality for security reasons; see the configuration advice earlier in this section). You have to hold down Alt continuously while typing the combination (presumably as security against typos), which on many keyboards means that you have to let go of SysRq because they can't handle that many keys being pressed at once.

One particularly useful combination here is SysRq-F, which kills one process, asking the kernel to guess which one you want to kill (based mostly on how much memory pressure it places on the system). (This usually has to be typed as Alt-(PrtSc,F), i.e. you continue holding down Alt but let go of PrtSc before pressing F.) In the case of an leaky tight infinite loop, it is nearly always very obvious to the kernel which process you mean to kill, so you can expect it to get the right target first time. Of course, there's risk to doing things this way: it might hit the wrong process, in which case you just lost some work or destabilised the system slightly. It still seems like one of the best options for dealing with swap trashing, though.

The graphical logout

Over 99% of Linux laptop/desktop users will be using a graphical desktop over 99% of the time. This means that all the programs they start, whether text or graphical, will have been started via the graphical desktop, and thus killing the desktop and all its descendants will necessarily kill the process they care about (because it kills every process they care about). This is pretty much equivalent to logging out, except that the processes in question won't get a chance to do a clean shutdown. It's a pretty bad option, really, but it's still better than shutting down the entire system, as you won't have to go through the boot process; you get right back to the login screen, which can be a major time gain (especially in corporate-style configurations in which the boot process has a lot of network connectivity, installing updates, and the like).

The traditional key combination for doing this was Ctrl-Alt-BkSp, which was standard and worked for years. On many Linux systems, it still works. However, more recently, it's often been disabled, for two reasons. The first is relatively simple: it was deemed too easy to press by mistake. It's not that easy to press by mistake, but the consequences are really wide-ranging and drastic for something that isn't totally unreasonable as a typo. Emacs users might enjoy reading the documentation of normal-erase-is-backspace-mode, which lists keys that are rebound: two of the bindings in question are C-M-Delete and C-M-Backspace, which seem reasonable until you realise that on modern keyboards, the usual way to type C- is Ctrl- and the usual way to type M- is Alt- (you could use Esc- but that's much less common). (Emacs finally realised how unfortunate these bindings were in version 22.1, and removed them, but the documentation hasn't been changed to match yet.)

The other reason that the Ctrl-Alt-BkSp binding is typically disabled is that it was realised that another pre-existing binding did the same thing. SysRq-K (i.e. normally Alt-(PrtSc,K)) forcibly kills every program on the current virtual terminal, i.e. every program that was started via your graphical desktop, if that's what you're looking at at the time. This was suggested as a much harder-to-typo binding that achieves basically the same thing. Unfortunately, there's some controversy about this particular binding (including what its intended use is, and how secure it is), meaning that there's conflicting advice about always using it (including before a login), never using it, manually configuring it to something else, and setting up the system in such a way that it doesn't actually work. (Not to mention, what its name is; it's documented as "SAK" but the acronym can be expanded to "Secure Access Key" (which some people claim is a misnomer), or "System Attention Key" (which is kind-of nondescriptive of what it does).) However, I think it's widely agreed that when it works, it's good at killing processes.

As a side note, graphical logouts are also useful when you need to log out in a hurry and some program or other is taking its time during the log out process. If you have to be at a meeting in 1 minute's time, i.e. have to leave now in order to get there in time, and don't want to leave the system logged in while the log out process runs in case someone manages to look at your files in the meantime, a graphical logout (on systems that allow it) is going to be a better option than cutting the power to the computer.

Overriding realtime priority

Another SysRq combination that's sometimes useful for getting rid of an infinite loop is SysRq-N. This is to do with "realtime priority" processes: the special property of such processes is that they can demand CPU cycles whenever they want them and take as many as they need before allowing other programs to run. While useful when writing a program that needs special access to the hardware (it's basically the next level down from writing a kernel module in terms of what you can do with interacting with the hardware), this obviously causes particular danger when such a process gets into an infinite loop, as it then needs all the CPU cycles available and thus no process (other than possibly other realtime priority processes) will get to run.

In most cases, this probably hasn't happened. At least in theory, code reviewers look really carefully for potential infinite loops in realtime priority processes, just as they look really carefully for potential security bugs in setuid executables; because the code is using a special exemption from one of the standard assumptions made by the operating system, people will pay special attention to ensure that it doesn't abuse it. The exception is if you happen to be a programmer working on a realtime priority process yourself, in which case you probably know you are.

The SysRq-N combination is basically a "last ditch" attempt at bringing a realtime priority process under control. It removes realtime priority rules from all processes, forcing them to take turns on the CPU like normal processes do. Of course, those processes were presumably using realtime priority for a reason, so doing this can be expected to cause them to stop working reliably. Hopefully, none of them were doing anything critical for the system's survival.

Working on a realtime priority process is a relatively rare activity, and one that's well worth taking precautions against infinite loops. Here are a two things you can do without having to resort to something as global as SysRq-y:

Manually rebooting / shutting down the system

The very last resort is to shut down the system. You could just cut the power, but that comes with risks such as filesystem corruption; doing a clean shutdown is better if possible.

Obviously, at this point, you're not going to be able to shut down the system via the normal method; rather, you have to guide the kernel through the steps of a manual shutdown yourself. As you're communicating with the kernel, these are all SysRq-combinations.

Here's the traditional recommended way to manually shutdown a system under Linux, and the reasons why each step is performed. Note that many of them may take some time to work correctly, so you have to wait several seconds (at least) between most of them, and until there's no hard disk activity:

  1. SysRq-R: forcibly take control of the keyboard. Sometimes allows you to see what you're doing. Sometimes allows you to Ctrl-Alt-F1 when you couldn't before. This is all in theory; I don't think I've ever seen a benefit from it. At least you don't have to wait long after it; it shouldn't be a time-consuming process.

  2. SysRq-E: send SIGTERM to every process (except init, which is responsible for startup and shutdown). This probably won't have much of an effect on the actual offending process, but it does have the advantage of potentially producing useful autosave data from any other process you happened to have open, meaning that you don't lose your work there.

  3. SysRq-I: send SIGKILL to every process, except init. Your system is unlikely to be in a very usable state after that. At this point, traditionally at least, the idea is that the system would be in a quiescent state without anything trying to modify it, meaning that you don't have to worry about things interfering with your clean shutdown.

  4. SysRq-S: write all unwritten data to disk. This is a pretty important step for protecting against filesystem corruption; the hope is that it will at least leave your disks in a consistent state.

  5. SysRq-U: make all disks read-only. No point using the computer without a reboot from this point, because you wouldn't be able to save anything. Along similar lines, useful during a shutdown, because nothing can be changing at the point of the shutdown.

  6. SysRq-B (reboot) or SysRq-O (shutdown). The last part of the shutdown process. It's pretty scary just how instant this is; you press it and your computer turns off. You've done the rest of the clean shutdown process already (init does basically this when shutting down, with the main difference that it takes more care to do things in the right order), but it still comes as an abrupt surprise.

SysRq-R-E-I-S-U-B is seen as a shutdown code in many Linux tutorials, and I hope that this explanation makes it a bit clear what it does and why it was traditionally recommended. (It's relatively easy to remember because reisub spelt backwards is busier, which is a real word.)

The problem is that nowadays, Linux implementations don't use sysvinit, a traditional (and very simple) init implementation that was standard even just a couple of years ago, but rather more complex implementations that handle things like parallel dependencies during startup and restarting crashed processes. (upstart was popular for a while, but nowadays many Linux distributions are standardising on the highly controversial systemd.) With both upstart and systemd, things go very differently from how they used to. The difference starts at step 2: SysRq-E (and SysRq-I) still request termination of (or outright kill) every process but init, but now init sees all these processes that "should" be running but aren't, and tries to restart them. The result is that you get a kind of weird hybrid between a shutdown and boot-up process.

Worse, in some cases (such as X), the fact that the process is unexpectedly not running is misinterpreted as a crash (despite the use of SIGTERM, which is one of the least likely signals to be involved in a crash as opposed to sent intentionally). This means that you get the graphics troubleshooter loading up (and at least on Ubuntu, the graphics troubleshooter is kind-of buggy). Ironically, the troubleshooter itself got stuck in an infinite loop (most likely waiting for input that it couldn't receive) that prevented Ctrl-Alt-F1 working when I was testing out the manual shutdown process under systemd, meaning that SysRq-E was my best option to exit it (as SysRq-K seemed to not be working for some reason, and SysRq-F would have been too unpredictable with no apparent memory leak; perhaps the computer was somehow in a mode where no virtual terminals exited).

So what happened in my shutdown attempt was something like SysRq-R, SysRq-E (which brought me to the graphics troubleshooter), SysRq-E (which brought me to the login screen for some reason). I experimentally tried logging in, and although the system sort-of worked, there were definite oddities, such as editor windows disappearing from the screen. Further SysRq-E and SysRq-I entries left me in various other places, such as framebuffer console 1. I then did SysRq-S and SysRq-U, logged in on the framebuffer console, verified that I couldn't do anything to affect the filesystem (and thus it was safe to effectively "cut the power" even though I was logged in), and a final SysRq-O to shut down.

I'm not really sure what conclusions to draw from this, except that modern Linux distributions are too complex to sanely shut down manually. Perhaps a change could be made to init that gives a more sensible reaction to SysRq-E (it should be easy enough to detect that it's happened). As it is, the main advice I can give is to do the R (because surely there's some reason people recommend it) and the E (in order to get programs to autosave), skip the I because it does more harm than good nowadays, and just do the S/U/B in whatever state the system ends up in after the E (waiting for hard disk activity to stop before each), on the hope that it's a reasonably quiescent one.

Preventative measures: keeping processes contained in advance

Of course, all this knowledge would be less necessary if broken processes were kind enough to just exit all by themselves rather than tying up the CPU or taking down the system. It's hard to do much to distinguish a process that's just looping without using up resources from a process that's acting normally, but if a process is using up an unexpectedly large amount of memory, that makes for a sensible trigger to terminate it.

The ulimit shell builtin (supported by most shells nowadays apart from the retro primitive ones) allows you to place limits on various resources a process might try to consume. Its effect is limited to the shell you run the command in and its descendants (which in practice typically means a single terminal tab), which makes it pretty safe to experiment with. I currently have a couple of ulimit commands in my /home/ais523/.bashrc file, which runs every time I open a new terminal tab (because I use bash as my shell), placing limits on a couple of resources particularly likely to be used up by runaway processes:

ulimit -Sv 3000000
ulimit -Sd 1000000

v is virtual memory, which includes all memory used by a program and all memory it's claims it intends to use (even if it hasn't actually used it yet); d is data segment size, which in practice includes all statically allocated data in the program, plus (with glibc's default malloc implementation) data allocated via malloc requests for small amounts of memory (it's more common to see an infinite loop that leaks small amounts of memory every iteration, rather than large amounts rarely, so a lower limit here can catch issues sooner). Both of these are measured in kilobytes, meaning that we're talking approximately 3GB and 1GB respectively here; much larger than is needed by any command-line program I use on a regular basis, but still small enough that a modern computer can handle the amount of memory required without swap thrashing. (If I need to do something particularly memory-intensive for some reason, ulimit -Sv unlimited, etc., turns the limit back off again.)

S, for "soft", allows you to undo the change without needing to start a new process, and likewise means that malicious processes can exceed the limit via turning it off again; it's thus intended to catch the innocent mistakes that this blog post is mostly focused on, rather than programs that might be trying to use up your memory intentionally.

Hitting these limits normally causes malloc to fail. This normally leads to the program exiting one way or another (whether a controlled shutdown or a sudden crash). Very rarely, the memory allocation that tips the process over the limit will be something other than malloc (such as a stack allocation, most likely a VLA), in which case the process has no real choice other than to crash (you can install a handler for this by using a separate pool of memory reserved specifically for the eventuality, but hardly anybody does because it's unlikely and you can't do much about it anyway).

Not really related to the subject of the article, but if you're messing around with ulimits, here's another really useful one:

ulimit -Sc 1000000

This means that if the process crashes, you'll get a dump (called core) that can be loaded into a debugger to determine why. Not so useful to the average user (which is why it's off by default), but it's pretty valuable when it's the code you just wrote that crashed. Note that you'll have to delete the core files manually after use; new crashes won't overwrite old core files.

The TL;DR cheatsheet: "what to do to exit a stuck process"

Typically, try these in order, skipping ones that are obviously irrelevant.

Your inputs are being responded to, but have no visible effect

Warning: Try these first, to eliminate them; otherwise, if this has happened, you're doing a lot of input with potential side *effects blind

Possible causes of this, and the fixes:

XON/XOFF flow control is on for some reason and someone pressed Ctrl-S: Ctrl-Q

The process crashed with terminal echo turned off, so no keypress has a visible effect: Ctrl-C Return reset Return

The process is in the foreground of a terminal

The process has a working or default SIGINT handler: Ctrl-C

It's a command-line program at its command line: Ctrl-D

The process has a working or default SIGQUIT handler: Ctrl-\

The process has a working or default SIGTSTP handler: Ctrl-Z, then if the process pauses, jobs -p %% and Return to discover its process ID, then kill it using the techniques in the next section

Warning: This one has to come last, because it prevents the others working:

The process has a working or default SIGHUP handler: close the terminal window or tab containing the process

You have a working shell / can open a new terminal window

Identify the process ID, using:

You know its name: pgrep -a substring_of_name

It uses a lot of CPU or memory: top

It's a GUI program and has a window onscreen: xprop | grep PID

Then send the process a signal (PID here is the process ID):

The process has a working or default SIGTERM handler: kill PID

The process has a working or default SIGSEGV handler: kill -SEGV PID

You don't mind losing autosave/crash dump information: kill -9 PID

You don't have a working shell

Get a new console via:

The keyboard is still responding: Ctrl-Alt-F1 or Ctrl-Alt-F2

You have a serial terminal: connecting the serial terminal

You have a network connection: connecting using ssh

then log in and follow the advice in the previous section. (Try other Ctrl-Alt-F key combinations in order to get back to where you were.)

You can't get a shell via any means

The kernel can guess what process to kill: SysRq-F (i.e. Alt-(PrtSc,F))

You're logged in to a graphical desktop: Ctrl-Alt-BkSp or SysRq-K

The process has realtime priority: SysRq-N, then see previous section

You need to reboot the system: SysRq-R-E-S-U-B (no I because systemd)