Here’s my favorite operating systems war story, what’s yours?

Val in her Dr. Pepper days

Val in her Dr Pepper days

When I was working on my operating systems project in university, I stayed up all night for weeks feverishly rebooting my test machine, hoping that THIS time my interrupt handler changes worked. I survived on a diet of raisin bagels, baby carrots, and Dr Pepper, and only left the computer lab to shower and sleep for a few hours before stumbling back in and opening up EMACS again. I loved it.

Julia Evans‘ blog posts about writing an operating system in Rust at Hacker School are making me miss the days when I thought everything about operating systems was magical. Julia is (a) hilarious, (b) totally honest, (c) incredibly enthusiastic about learning systems programming. (See “What does the Linux kernel even do?“, “After 5 days, my OS doesn’t crash when I press a key“, and “After 6 days, I have problems I don’t understand at all.”) I’m sure somewhere on Hacker News there is a thread getting upvoted about how Julia is (a) faking it, (b) a bad programmer, (c) really a man, but here in the real world she’s making me and a lot of other folks nostalgic for our systems programming days.

Yesterday’s post about something mysteriously zeroing out everything about 12K in her binary reminded me of one of my favorite OS debugging stories. Since I’m stuck at home recovering from surgery, I can’t tell anyone it unless I write a blog post about it.

VME crate (CC-BY-SA Sergio.ballestrero at en.wikipedia)

VME crate (CC-BY-SA Sergio.ballestrero at en.wikipedia)

In 2001, I got a job maintaining the Linux kernel for the (now defunct) Gemini subarchitecture of the PowerPC. The Gemini was an “embedded” SMP board in a giant grey metal VME cage with a custom BIOS. Getting the board in and out of the chassis required brute strength, profanity, and a certain amount of blood loss. The thing was a beast – loud and power hungry, intended for military planes and tanks where no one noticed a few extra dozen decibels.

The Gemini subarchitecture had not had a maintainer or even been booted in about 6 months of kernel releases. This did not stop a particularly enthusiastic PowerPC developer from tinkering extensively with the Gemini-specific bootloader code, which was totally untestable without the Gemini hardware. With sinking heart, I compiled the latest kernel, tftp’d it to the VME board, and told the BIOS to boot it.

It booted! Wow! What are the chances? Flushed with success, I made some minor cosmetic change and rebooted with the new kernel. Nothing, no happy printk’s scrolling down the serial console. Okay, somehow my trivial patch broke something. I booted the old binary. Still nothing. I thought for a while, made some random change, and booted again. It worked! Okay, this time I will reboot right away to make sure it is not a fluke. Reboot. Nothing. I guess it was a fluke. A few dozen reboots later, I went to lunch, came back, and tried again. Success! Reboot. Failure. Great, a non-deterministic bug – my favorite.

Eventually I noticed that the longer the machine had been powered down before I tried to boot, the more likely it was to boot correctly. (I turned the VME cage off whenever possible because of the noise from the fans and the hard disks, which were those old SCSI drives that made a high-pitched whining noise that bored straight through your brain.) I used the BIOS to dump the DRAM (memory) on the machine and noticed that each time I dumped the memory, more and more bits were zeroes instead of ones. Of course I knew intellectually that DRAM loses data when you turned the power off (duh) but I never followed it through to the realization that the memory would gradually turn to zeroes as the electrons trickled out of their tiny holding pens.

So I used the BIOS to zero out the section of memory where I loaded the kernel, and it booted – every time! After that, it didn’t take long to figure out that the part of the bootloader code that was supposed to zero out the kernel’s BSS section had been broken by our enthusiastic PowerPC developer. The BSS is the part of the binary that contains variables that are initialized to zero at the beginning of the program. To save space, the BSS is not usually stored as a string of zeroes in the binary image, but initialized to zero after the program is loaded but before it starts running. Obviously, it causes problems when variables that are supposed to be zero are something other than zero. I fixed the BSS zeroing code and went on to the next problem.

This bug is an example of what I love about operating systems work. There’s no step-by-step algorithm to figure out what’s wrong; you can’t just turn on the debugger and step through till you see the bug. You have to understand the computer software and hardware from top to bottom to figure out what’s going wrong and fix it (and sometimes you need to understand quite a bit of electrical engineering and mathematical logic, too).

If you have a favorite operating system debugging story to share, please leave a comment!

Updated to add: Hacker News had a strangely on-topic discussion about this post with lots more great debugging stories. Check it out!

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

12 Responses to Here’s my favorite operating systems war story, what’s yours?

  1. Johannes Berg says:

    My favourite debugging story isn’t really so much about operating systems but networks – I was an intern in the network maintenance department, and mysteriously a bunch of machines had decided to stop PXE booting, depending on which DHCP server was used. I ended up using a network sniffer, and hacking the DHCP server to try to duplicate the response bit by bit, until I had them all the same – yet the problem persisted. Eventually it turned out I hadn’t really looked at the IP header, and the DF bit was causing the PXE BIOS to drop the response.

    Now, the fix was about as tricky – the DHCP servers that were causing the problem were still running HP-UX (though slated to be replaced with the (working) Linux servers, but not immediately), and an HP-UX security update had caused the DF bit to be force-set with no way to disable it again. I had to write raw sockets code for HP-UX (called DLPI or DPLI or something like that) and build all the stuff myself in the ISC DHCP code, just to get this one bit unset.

  2. Kind of long, but this is kind of my “go to” story when people ask about the challenges of debugging system software.

    I was working at Revivio, which made a storage appliance, and we were trying to deal with some mysterious hangs. These were hard hangs which would kill even our kernel gdb stub, so normal debugging wasn’t possible. I was wishing for some kind of hardware support so we could examine memory even though the CPU was out to lunch, when I realized we had it. These machines were connected via IB, so I wrote some code to export one machine’s kernel+vmalloc regions to another and wrote a little program to slurp that into an ELF file that gdb on the second machine could understand. Those were two fun hacks all by themselves, but I wasn’t done yet.

    In phase two, I started collecting dumps but they didn’t make sense. I started to notice that some kernel stacks were getting corrupted, e.g. with one frame pointing into the middle of another, and addresses (including code addresses) on the stack that could have nothing to do with each other in any possible call sequence. I added guard pages and memory scrubs and extra pointer checks in various places, to no avail. I sort of gave up and started looking at the few dumps where it seemed like a task struct had been corrupted instead of a stack (which I had previously written off as likely bugs in my IB/ELF hacks). Finally I realized what was happening. Sometimes a task would allocate so much in a single stack frame that it would jump all the way over the rest of its stack, all the way over its own task struct and any guard areas, into the next task’s stack. Then it would return happily, but the next time the “victim” ran it would explode.

    In phase three, I wrote a script to disassemble each of our many kernel modules and find large subtractions from the stack pointer. I found and fixed not only the likely culprits for the hangs we were seeing, but many more that were likely to cause problems later.

  3. http://mjg59.dreamwidth.org/11235.html helped get me my job. Least favourite was probably the one where HP’s System Management Mode handler would always restore CPU state to CPU0 no matter which core it was running on, resulting in the system crashing 50% of the time when you opened the lid. ThHp.

    • vaurora says:

      HILARIOUS: “That’s a lot of 0xffs (around 31MB of them) with one small section that contains an 802.11 probe packet with the SSID of the hospital across the road from my house.”

      That’s a great bug, but how did you figure out that it was restoring state to CPU0?

      • mjg59 says:

        The small number of times it would oops instead of just dying would always show an entirely impossible backtrace on cpu0. I think it was Bjoern Helgaas who suggested that might be the problem? Turns out Windows always behaves that way, so who would ever notice, etc.

  4. Mike Stimpson says:

    On a VME system, I had to determine which version of a (non-CPU) board was in it.

    The first version had an 8-bit register at a particular address, whereas the second version had a 16-bit register. So I wrote out a 16-bit value to that address, and then read it back. If I got all 16 bits back correctly, then the system had the second version of the board. If I only got 8 bits back correctly, then the system had the first version.

    But that didn’t work. The first (8 bit) version got all 16 bits back correctly.

    To understand why, you have to think about what happens to the data bus on the board 8 bit version of the board. You write out 16 bits. The VMEbus interface drives a voltage onto all 16 bits of the data bus on the board. 8 of those bits get written into latches (the register). The other 8 bits go nowhere – there are no chips enabled that are reading those bits, so all chips on the bus have very high impedance for those bits. The unused 8 bits of the data bus just sit there – but they sit there at the voltage levels specified by the data bits that I was trying to write to the non-existent half of the data bus.

    When I tried to read the data back out, no chips were enabled to write data onto that half of hte data bus. But the drivers were enabled to drive the data onto the VMEbus. What did those drivers drive? Whatever was on the data bus… which just happend to be the voltage left over from the write of the data to the non-existent half of the register. That is, THE DATA BUS WAS ACTING AS 8 BITS OF DRAM! (To work, it needed to be refreshed about 10 times a second.)

    So I wrote 16 bits of data to that register, wrote 16 bits of different data to a different register, and then read back the first register. That worked.

  5. Julia Evans says:

    That’s a fantastic story! I love all of these. I’m so happy none of my problems are caused by hardware… yet

    Thanks for the kind words about my blog <3

  6. Kevin says:

    Love the post and story, but this bit was unnecessary and counter-productive: “I’m sure somewhere on Hacker News there is a thread getting upvoted about how Julia is (a) faking it, (b) a bad programmer, (c) really a man”

    How is a passive-agressive swipe like this any different than what you probably thought you were deploring? You made me feel bad unnecessarily and all I did was read the post.

  7. Your adventures with BSS reminded me of one of my earliest FOSS experiences. I was still hacking on Windows, but slowly replacing all of the interesting bits with FOSS, including using MinGW rather than Visual C. I was trying to get a large pile of non-trivial C and assembly to compile (porting it all to GCC inline assembly), and I finally got it to compile, but it mysteriously crashed. Turns out that MinGW marked the .text section read-only, and the assembly often declared variables without bothering to switch to .data; took quite an adventure via gdb’s “stepi” to finally figure that one out. (Sadly, this was before I figured out that if you don’t send in your changes right away and in small chunks, they’ll bitrot.)

    These days, I’m having fun playing with BIOS via Python, using BITS, and it really reminds me of the old DOS days where you could go poke video memory or I/O ports yourself.

    Thanks for the great nostalgia adventure! “Works if I leave it turned off for a while” is definitely one of the strangest heisenbugs I’ve seen.

Comments are closed.