I’ve been putting the “finishing touches” on my ext2/3 fsck readahead patches for about, oh, 4 weeks now. I finished the proof of concept around 3 or 4 months ago and started over on a nice clean version that has a chance of being merged. That version is up, working, and finishing in about 50% of the time of vanilla fsck on my test file system (running on a big fancy RAID – there’s almost no improvement on a single disk). Certainly, I needed to go back through and get rid of the debugging stuff, search for “XXX – check for failure”, and do a final code review. But that only takes a week, maybe. What I’ve been doing since then is attempting to make fsck no less stable with readahead than without. It’s far more important for fsck to work than for it to work quickly. My ultimate goal is that if the readahead thread has a random segfault, it will clean up after itself perfectly and allow the main fsck thread to continue and finish as though nothing had happened.
This turns out to be overkill (ha ha) and incredibly annoying to do with pthreads. In the worst case, the readahead thread will need to deallocate memory and possibly (but not always) unlock a mutex. After several hours of reading man pages, the solutions for these problems appear to be:
- Every time you lock a mutex, pthread_cleanup_push/pop a handler that will unlock the mutex.
- Use the incredibly clumsy and painful pthread key interface to create a faux data-associated cleanup function that frees allocated resources on thread exit.
- Catch SIGSEGV and, um, a bunch of other signals and, um, figure out if it came from the readahead thread dying and um…
So I’ve decided that if readahead breaks, then you should rerun fsck without readahead enabled. The readahead thread won’t run at all unless it is explicitly turned on, so it’s not like this will be unfathomable to the user. I’m still making the readahead/main thread interaction as robust as possible, using an explicit test/exit system in exactly the same way as the main fsck thread. The readahead thread shuts down if it detects a bug and the main thread can kill the readahead thread if it no longer wants readahead to keep going.
As for the rest of the possible targets for perfectionism, I’ve decided to just add them to the TODO list and get the damn code out there for review (as soon as I do some performance testing to make sure my perfectionist editing didn’t destroy performance). Die, perfectionism, die!
And of course, I’m expecting a barrage of suggestions for how to robust-ify readahead.