HOWTO debug silent data corruption

Back when I worked at Sun, I used to listen starry-eyed at the knees of senior engineers while they told their tales of debugging silent data corruption. They were really good stories – hardware with obscure manufacturing defects that didn’t show up until really optimized code ran on the chip, rogue SCSI drivers overwriting blocks in memory, wild pointers of every sort. “Wow,” I thought, “That sounds really hard! I want to do that!” Ever since then, I’ve jumped at any opportunity to debug silent data corruption and I’ve loved it every time. Now I have my own set of awesome silent data corruption stories, though sadly some of the best of them are still under NDA.

Lately though, I’ve realized two things: (1) Most people think silent data corruption is almost impossible to debug, (2) I’ve debugged silent data corruption often enough that I actually have a fairly specific system for doing it. So here’s HOWTO Debug Silent Data Corruption:

  1. Gather data. Somehow you noticed something was going wrong; turn on all your logging and crash dumps and anything else you need. Getting as many cases as possible is key. Plan ahead to gather all the data you possibly can when the data corruption is detected since you don’t know when it will happen again. Laziness is a serious error here. One of the hardest silent data corruption bugs I ever solved was made that much harder because a co-worker threw away the log output from his debugging program because he thought it was irrelevant. The program exited just before it triggered a crash dump, so I couldn’t just crawl through its memory in the debugger, but I was sure that some of the pages from the output of the program had not been reused and were floating around in memory somewhere. I ended up writing a script to pull out the entire contents of memory from some weird compressed format the dump program used and grepping through it all for my guess as to what the program would output if the corruption was of a certain type. (This data was the key to finding the root cause of the corruption.)
  2. Get the data out of the computer and into your brain via the highest bandwidth channel available: your eyeballs. Most people instinctively get the first step and gather vast reams of data. Then they dink around with manually paging through the data or running a debugger by hand on the crash dump and give up after a few days (or call me, which I prefer). You need to somehow get the relevant data sifted out of the giant pile of junk and then present it all in close physical and temporal proximity on the screen in a manner optimized for data transmission through your optic nerves. Just keep fiddling with the presentation of the data until you start to see something meaningful. A typical simple visualization hack: I recently wrote a script to take only the first part of a disk and print out an ASCII character for each 512 byte block representing the content of the block (all zeroes, all ones, or mixed), and then fiddled with the output of the scripts and resized the window back and forth. Then I took the output for 10 different corrupted disks – and one uncorrupted disk – and ran a while loop popping each one up with less. Hitting ‘q’ quickly to cycle through the output for different disks let me visually detect common patterns in the corruption; the uncorrupted disk allowed me to distinguish signal from noise.
  3. Look for patterns. The exact pattern of data corruption is the key to tracking down the data corruption. How big is the corruption? How is it aligned? What does it contain? 8 bytes aligned at an offset of N * 280 bytes plus 48 and containing a number consistent with a kernel address means a wild kernel pointer; search memory for an address close to the location of the corruption and find out what data structure it’s living in – it should be a pointer to a 280 byte structure with a pointer in it at offset 48 bytes. 128KB of all zeroes at 128KB alignment screams “Corruption somewhere in the block I/O stack! Almost certainly a crappy SATA driver!” Single bit errors are very occasionally kernel bugs (set bit operations on wild pointers) but nearly always indicate some hideous incurable hardware problem, like bad memory or broken trace lines.
  4. Develop a reproducible test case. If you’re lucky, someone already did this and that’s how you gathered the data in the first case. However, more often than note, the data has been trickling in from the field at some incredibly low rate such that you can run your test system for a month and only have a 1% chance of triggering the error. Now that you have some clues about what the problem could be, you should be able to find a reproducible test case with only a little bit of blind flailing. Make sure you do the math to figure out how long you’ll have to run to get a statistically significant result. Math is your friend!
  5. Run the reproducible test case over and over and over and gather more and different data until you have some hypotheses to test. Talk to your friends about the problem. Walk up and down the hallways eating your vending machine dinner until your brain stops consciously thinking about the problem and comes up with some brilliant insight. Take a shower, start driving home, attempt to go to sleep – whatever your bug idea strategy is, keep using it. Don’t forget to keep looking at your data in different ways and to go back for more data every time you run out of data to analyze. I’m always tempted to reason out the problem through sheer force of mind, but it always turns out that getting just that one more piece of data simplifies the problem by another order of magnitude.
  6. Congratulations! You now have a hypothesis or two about what’s causing the problem! So test it. Your first 5 guesses will probably be wrong. This is why we have a reproducible test case. My favorite is when someone finds some random bug report for some piece of software or hardware that happens to be installed on the system and declares that this must be the cause of the corruption. Folks, every system has a vast number of reported bugs and most of them are not causing your silent data corruption. This part can be pretty frustrating because often your test results will not make any sense at all because your mental model is wrong. Be sure to test the null case, too – run your test with both the new disk driver and the old one and see if it actually distinguishes between the two.
  7. Finally, get ready to be the messenger the customer shoots. Sometimes the fix for silent data corruption is easy: just apply a patch, recompile the driver, and voila! All better now! But more often the silent data corruption is just the tip of a giant hulking iceberg of very expensive hardware badness. If you manage to prove that every single disk in the entire 1024-node cluster has to be replaced, somebody is going to be out a lot of money and they are not going to be happy to see you. Be sure to have all your statistics and political machinations ready at hand. Remember, in the end, you found the cause of the corruption and nothing they do can change that. You found the cause of *silent data corruption*! Woohoo!! Yeah!

And that is how you debug silent data corruption.

8 thoughts on “HOWTO debug silent data corruption”

  1. Like all the best debugging, I figure there’s a step in there where you have to stand up very straight, look yourself in the eye, and say “actually, you are the smartest person in the world.” A useful skill!

  2. awesome!

    that is an awesome tutorial. while i haven’t fixed any silent data corruption, i thought i should mention a bit or two relating to the “data → optic nerve” problem, especially if you find you need to visualize binary intermixed with a bit of text, or need to visualize a few different dimensions of data:

    • turn control characters into something readable. a non-data-preserving but nevertheless useful transformation: LANG=C tr '\200-\377' '\0-\177' | LANG=C tr '\0-\037\177' '@-_?' (this strips the eighth bit and then transposes control characters into the corresponding ASCII characters you would need to combine with the control key in order to type said characters).
    • color is great stuff for quickly visualizing regions that differ. postprocess your data with something like LANG=C sed 's/things i would like to highlight/'"$(tput setaf 1)"'&'"$(tput sgr0)"'/g' and send it to a terminal with a large scrollback and the smallest font you can read comfortably. tile your screen with these if you’re visually diffing. i have good luck with xterm -fa : -fs 8 -tn xterm-256color which incidentally lets you use tput setaf (and the background-color version, tput setab ) with color numbers from 0 to 255, which is great if you need to differentiate a few different types of data. it has 8 dim colors, 8 bright colors, a 6x6x6 RGB color cube, and finally a 24-value gray ramp. remember, though, that your brain processes the color somewhat differently from brightness (lower spatial resolution) so you shouldn’t rely on color alone.
    • slight variations on these techniques can produce html with color codes that you can easily scroll around in, zoom in to, etc. the extra step you will need is escaping and putting everything inside pre: sed 's/&/&amp;/g;s/</\&lt;/g;s/>/\&gt;/g;1 s/^/<pre>/;$ s/$/<\/pre>/' (color codes for html are adequately documented elsewhere. just use <font color=”red”>…</font> and similar to mark up regions quickly.)
  3. My 2 cents

    Back seven years ago I often had a data corruption while copying large amount of data (say > 500MB) within one HDD or between two HDDs. Quite randomly a single bit of every 200-500MB got inverted. I exchanged IDE cables, power supply, double checked all cables and memory, also run memtest for good 12 hours with no results. I’ve never ever overclocked CPU or memory.

    In the end I gave up and handled this PC to a person who didn’t care about his data – he is a hardcore gamer :-)

    So, the bottom line is that some hardware failures are impossible to debug and that’s really sad. Computers nowadays are so complicated, sometimes you just have to give up facing their weird behaviour.

  4. Awesome, not least of which the choice of colors. Where did you manage to find *black* easter-egg dye? And does the binary say anything in particular?

  5. For data visualization, I have found ggobi quite useful (www.ggobi.org); although I don’t know how much use it would be in this application.

    It’s quite nice to play with, anyway.

    At the moment, I’m trying to figure out why my algorithm takes more steps to search ten items than it does to search 1000. This does not make sense. (It’s not as trivial as it sounds, either).

  6. perfect timing

    Your post made it into my feed reader (via Kernel Planet) just as I was destroying a live filesystem by rescanning the scsi bus I was booted into. This was apparently just as bad an idea as the documentation indicated it would be. Oh well, That’s why we have a test environment to play in. –Jay, posting anonymously since he apparently can’t figure out how flickr’s openID stuff is supposed to work

  7. ” You don’t know where to start debugging. You have some ideas about likely causes, perhaps, but you know you are going to have to sit down for this one.

Comments are closed.