Here’s my favorite operating systems war story, what’s yours?

Val in her Dr. Pepper days
Val in her Dr Pepper days
When I was working on my operating systems project in university, I stayed up all night for weeks feverishly rebooting my test machine, hoping that THIS time my interrupt handler changes worked. I survived on a diet of raisin bagels, baby carrots, and Dr Pepper, and only left the computer lab to shower and sleep for a few hours before stumbling back in and opening up EMACS again. I loved it.

Julia Evans‘ blog posts about writing an operating system in Rust at Hacker School are making me miss the days when I thought everything about operating systems was magical. Julia is (a) hilarious, (b) totally honest, (c) incredibly enthusiastic about learning systems programming. (See “What does the Linux kernel even do?“, “After 5 days, my OS doesn’t crash when I press a key“, and “After 6 days, I have problems I don’t understand at all.”) I’m sure somewhere on Hacker News there is a thread getting upvoted about how Julia is (a) faking it, (b) a bad programmer, (c) really a man, but here in the real world she’s making me and a lot of other folks nostalgic for our systems programming days.

Yesterday’s post about something mysteriously zeroing out everything about 12K in her binary reminded me of one of my favorite OS debugging stories. Since I’m stuck at home recovering from surgery, I can’t tell anyone it unless I write a blog post about it.

VME crate (CC-BY-SA Sergio.ballestrero at en.wikipedia)
VME crate (CC-BY-SA Sergio.ballestrero at en.wikipedia)
In 2001, I got a job maintaining the Linux kernel for the (now defunct) Gemini subarchitecture of the PowerPC. The Gemini was an “embedded” SMP board in a giant grey metal VME cage with a custom BIOS. Getting the board in and out of the chassis required brute strength, profanity, and a certain amount of blood loss. The thing was a beast – loud and power hungry, intended for military planes and tanks where no one noticed a few extra dozen decibels.

The Gemini subarchitecture had not had a maintainer or even been booted in about 6 months of kernel releases. This did not stop a particularly enthusiastic PowerPC developer from tinkering extensively with the Gemini-specific bootloader code, which was totally untestable without the Gemini hardware. With sinking heart, I compiled the latest kernel, tftp’d it to the VME board, and told the BIOS to boot it.

It booted! Wow! What are the chances? Flushed with success, I made some minor cosmetic change and rebooted with the new kernel. Nothing, no happy printk’s scrolling down the serial console. Okay, somehow my trivial patch broke something. I booted the old binary. Still nothing. I thought for a while, made some random change, and booted again. It worked! Okay, this time I will reboot right away to make sure it is not a fluke. Reboot. Nothing. I guess it was a fluke. A few dozen reboots later, I went to lunch, came back, and tried again. Success! Reboot. Failure. Great, a non-deterministic bug – my favorite.

Eventually I noticed that the longer the machine had been powered down before I tried to boot, the more likely it was to boot correctly. (I turned the VME cage off whenever possible because of the noise from the fans and the hard disks, which were those old SCSI drives that made a high-pitched whining noise that bored straight through your brain.) I used the BIOS to dump the DRAM (memory) on the machine and noticed that each time I dumped the memory, more and more bits were zeroes instead of ones. Of course I knew intellectually that DRAM loses data when you turned the power off (duh) but I never followed it through to the realization that the memory would gradually turn to zeroes as the electrons trickled out of their tiny holding pens.

So I used the BIOS to zero out the section of memory where I loaded the kernel, and it booted – every time! After that, it didn’t take long to figure out that the part of the bootloader code that was supposed to zero out the kernel’s BSS section had been broken by our enthusiastic PowerPC developer. The BSS is the part of the binary that contains variables that are initialized to zero at the beginning of the program. To save space, the BSS is not usually stored as a string of zeroes in the binary image, but initialized to zero after the program is loaded but before it starts running. Obviously, it causes problems when variables that are supposed to be zero are something other than zero. I fixed the BSS zeroing code and went on to the next problem.

This bug is an example of what I love about operating systems work. There’s no step-by-step algorithm to figure out what’s wrong; you can’t just turn on the debugger and step through till you see the bug. You have to understand the computer software and hardware from top to bottom to figure out what’s going wrong and fix it (and sometimes you need to understand quite a bit of electrical engineering and mathematical logic, too).

If you have a favorite operating system debugging story to share, please leave a comment!

Updated to add: Hacker News had a strangely on-topic discussion about this post with lots more great debugging stories. Check it out!

bash lesson of the day

So I’m running a command in a for loop in bash:

SECONDS=1
for i in 1 2 3 4; do
../my_test -S "${SECONDS}"
done

And for some damn reason, the test runs for 1 second on the first loop, then 2 seconds on the second loop, then 4 seconds on the next, etc. WTF?

Ten minutes of debugging output later, I finally type “man bash”:

SECONDS
Each time this parameter is referenced, the  number  of  seconds
since  shell  invocation is returned.  If a value is assigned to
SECONDS, the value returned upon subsequent  references  is  the
number  of seconds since the assignment plus the value assigned.
If SECONDS is unset, it loses its special properties, even if it
is subsequently reset.

Geez.

Emacs key bindings for Firefox and other stupid “power user” tricks

I’ve just installed Fedora 9 and I’m doing the usual re-configuration drill to get things to work the Right Way.

I use emacs. I want emacs key bindings in Firefox. Check out Firemacs from Kazu Yamamoto:

http://www.mew.org/~kazu/proj/firemacs/

I agree with the advice about disabling the up and down arrow keys for editing so you can use them to select auto-complete options. To do that, go to Tools->Add-ons->Firemacs, pick the “Edit” tab and delete the fields containing “up” and “down”. (Ctrl-N an Ctrl-P still work.)

Since I’ve been using a Mac for a few months, I reset the Firefox accel key to meta, so now meta-N opens a new window, meta-Q quits, etc. This gives you both Emacs key bindings AND Firefox shortcuts. To set this, type “about:config” in the URL window. Search for “accel” and set this line:

ui.key.accelKey;224

(224 is the key code for Meta, there is an on-going project to allow only slightly less understandable key words to be used instead.)

What used to work for emacs key bindings was to put the following in .gtkrc-2.0:

gtk-can-change-accels = 1
gtk-key-theme-name = “Emacs”
gtk-entry-select-on-focus = 0

I’m sure this has some effect on other apps. I’m not really sure if the first line is necessary or what the third line does, but I must have liked it at some point.

Finally, get rid of the irritating folders and icons on the desktop. Nautilus is responsible for this travesty. Disable with:

gconftool-2 -t bool /apps/nautilus/preferences/show_desktop -s false

Refactoring and dirt

Yesterday I noticed that I was in an exceptionally good mood for no especially obvious reason, such as winning the lottery. Thinking about it, I concluded that I was happy because I was finally able to refactor the code I was working on after several weeks of implementing new features. It got to the point where working on it in its disorganized state gave me the exact same feeling as touching a slimy rock – ew ew ew!

Interesting – we say “clean code” and “dirty hack,” but I think that we really do co-opt the “ew, dirt, contamination” module of our brains to work with code. Weird, huh? I’m going to pay attention for other interesting recruitment examples.