Back when I worked at Sun, I used to listen starry-eyed at the knees of senior engineers while they told their tales of debugging silent data corruption. They were really good stories – hardware with obscure manufacturing defects that didn’t show up until really optimized code ran on the chip, rogue SCSI drivers overwriting blocks in memory, wild pointers of every sort. “Wow,” I thought, “That sounds really hard! I want to do that!” Ever since then, I’ve jumped at any opportunity to debug silent data corruption and I’ve loved it every time. Now I have my own set of awesome silent data corruption stories, though sadly some of the best of them are still under NDA.
Lately though, I’ve realized two things: (1) Most people think silent data corruption is almost impossible to debug, (2) I’ve debugged silent data corruption often enough that I actually have a fairly specific system for doing it. So here’s HOWTO Debug Silent Data Corruption:
- Gather data. Somehow you noticed something was going wrong; turn on all your logging and crash dumps and anything else you need. Getting as many cases as possible is key. Plan ahead to gather all the data you possibly can when the data corruption is detected since you don’t know when it will happen again. Laziness is a serious error here. One of the hardest silent data corruption bugs I ever solved was made that much harder because a co-worker threw away the log output from his debugging program because he thought it was irrelevant. The program exited just before it triggered a crash dump, so I couldn’t just crawl through its memory in the debugger, but I was sure that some of the pages from the output of the program had not been reused and were floating around in memory somewhere. I ended up writing a script to pull out the entire contents of memory from some weird compressed format the dump program used and grepping through it all for my guess as to what the program would output if the corruption was of a certain type. (This data was the key to finding the root cause of the corruption.)
- Get the data out of the computer and into your brain via the highest bandwidth channel available: your eyeballs. Most people instinctively get the first step and gather vast reams of data. Then they dink around with manually paging through the data or running a debugger by hand on the crash dump and give up after a few days (or call me, which I prefer). You need to somehow get the relevant data sifted out of the giant pile of junk and then present it all in close physical and temporal proximity on the screen in a manner optimized for data transmission through your optic nerves. Just keep fiddling with the presentation of the data until you start to see something meaningful. A typical simple visualization hack: I recently wrote a script to take only the first part of a disk and print out an ASCII character for each 512 byte block representing the content of the block (all zeroes, all ones, or mixed), and then fiddled with the output of the scripts and resized the window back and forth. Then I took the output for 10 different corrupted disks – and one uncorrupted disk – and ran a while loop popping each one up with less. Hitting ‘q’ quickly to cycle through the output for different disks let me visually detect common patterns in the corruption; the uncorrupted disk allowed me to distinguish signal from noise.
- Look for patterns. The exact pattern of data corruption is the key to tracking down the data corruption. How big is the corruption? How is it aligned? What does it contain? 8 bytes aligned at an offset of N * 280 bytes plus 48 and containing a number consistent with a kernel address means a wild kernel pointer; search memory for an address close to the location of the corruption and find out what data structure it’s living in – it should be a pointer to a 280 byte structure with a pointer in it at offset 48 bytes. 128KB of all zeroes at 128KB alignment screams “Corruption somewhere in the block I/O stack! Almost certainly a crappy SATA driver!” Single bit errors are very occasionally kernel bugs (set bit operations on wild pointers) but nearly always indicate some hideous incurable hardware problem, like bad memory or broken trace lines.
- Develop a reproducible test case. If you’re lucky, someone already did this and that’s how you gathered the data in the first case. However, more often than note, the data has been trickling in from the field at some incredibly low rate such that you can run your test system for a month and only have a 1% chance of triggering the error. Now that you have some clues about what the problem could be, you should be able to find a reproducible test case with only a little bit of blind flailing. Make sure you do the math to figure out how long you’ll have to run to get a statistically significant result. Math is your friend!
- Run the reproducible test case over and over and over and gather more and different data until you have some hypotheses to test. Talk to your friends about the problem. Walk up and down the hallways eating your vending machine dinner until your brain stops consciously thinking about the problem and comes up with some brilliant insight. Take a shower, start driving home, attempt to go to sleep – whatever your bug idea strategy is, keep using it. Don’t forget to keep looking at your data in different ways and to go back for more data every time you run out of data to analyze. I’m always tempted to reason out the problem through sheer force of mind, but it always turns out that getting just that one more piece of data simplifies the problem by another order of magnitude.
- Congratulations! You now have a hypothesis or two about what’s causing the problem! So test it. Your first 5 guesses will probably be wrong. This is why we have a reproducible test case. My favorite is when someone finds some random bug report for some piece of software or hardware that happens to be installed on the system and declares that this must be the cause of the corruption. Folks, every system has a vast number of reported bugs and most of them are not causing your silent data corruption. This part can be pretty frustrating because often your test results will not make any sense at all because your mental model is wrong. Be sure to test the null case, too – run your test with both the new disk driver and the old one and see if it actually distinguishes between the two.
- Finally, get ready to be the messenger the customer shoots. Sometimes the fix for silent data corruption is easy: just apply a patch, recompile the driver, and voila! All better now! But more often the silent data corruption is just the tip of a giant hulking iceberg of very expensive hardware badness. If you manage to prove that every single disk in the entire 1024-node cluster has to be replaced, somebody is going to be out a lot of money and they are not going to be happy to see you. Be sure to have all your statistics and political machinations ready at hand. Remember, in the end, you found the cause of the corruption and nothing they do can change that. You found the cause of *silent data corruption*! Woohoo!! Yeah!
And that is how you debug silent data corruption.