To SSD or not to SSD?

I own a Macbook Air. I am a cool file systems person, so naturally I would buy the version with the SSD rather than the boring old crappy disk, right?

Wrong.

SSD – the kind built out of flash memory – is at present far less reliable than old-fashioned spinning disks. My direct personal experience, and that of my business colleagues and friends, is that flash-based storage suffers silent data corruption at an extraordinary rate. Why is this? A few reasons:

Flash-based storage has a limited number of write cycles before it wears out and stops storing the correct data. Supposedly this is not a problem because (1) current flash can do millions of erase/write cycles, (2) SSDs implement hardware wear-leveling to spread out writes evenly over all the cells.

Let’s start with hardware wear-leveling. Basically, nearly all practical implementations of it suck. You’d imagine that it would spread out writes over all the blocks in the drive, only rewriting any particular block after every other block has been written. But I’ve heard from experts several times that hardware wear-leveling can be as dumb as a ring buffer of 12 blocks; each time you write a block, it pulls something out of the queue and sticks the old block in. If you only write one block over and over, this means that writes will be spread out over a staggering 12 blocks! My direct experience working with corrupted flash with built-in wear-leveling is that corruption was centered around frequently written blocks (with interesting patterns resulting from the interleaving of blocks from different erase blocks). As a file systems person, I know what it takes to do high-quality wear-leveling: it’s called a log-structured file system and they are non-trivial pieces of software. Your average consumer SSD is not going to have sufficient hardware to implement even a half-assed log-structured file system, so clearly it’s going to be a lot stupider than that.

How about those millions of write cycles? Let’s start with a higher-level observation: Disks have been around for decades. We have decades of experience with accelerated aging of disks to predict long-term failure rates – you stick ’em in a hot room and vibrate them around and write lots of data to them, and the failure rate under these conditions can be successfully extrapolated to long-term failure rates. When you buy a disk, you can reasonably expect that it will store your data correctly, and – almost as important – that if it doesn’t store your data correctly, you’ll know because you either get an IO error or because it makes a loud clunking noise and your computer hangs.

When it comes to flash, manufacturers are handicapped when predicting long-term failure rates for a number of reasons. First, it’s hard to extrapolate failure rates under stress tests to long-term failure rates. In particular, the failure mode of a flash cell is that the charge leaks out of the cell – slowly, over time. Stress testing by writing to the cell a lot and then reading the data back is not going to test this situation. In general, we simply don’t have a lot of experience with testing flash and it will take a few years to build it up. Second, manufacturers of the device using flash are constantly switching suppliers for the actual flash memory itself. When it comes to consumer-grade flash, manufacturers have strong pressure to drive the price down and very little pressure in the direction of quality. Frequently, the manufacturer won’t have any idea where the flash chips came from for a particular device based purely on the model number because it used more than one supplier for that model. Third, failure rates will be heavily depending on the pattern of both reads and writes, so a device that checks out fine under one test pattern will failure miserably under another load – not generally a characteristic of disks. Another fascinating aspect of flash-based SSD is that you don’t seem to get any report of checksum failures on corruption – at least I haven’t seen one in the three confirmed cases of flash corruption I’ve seen. I don’t know if this is because the device isn’t reporting it or because the OS driver isn’t listening for it, but it’s what happens.

The exception to these observations are any flash device that costs a lot of money – commercial-grade flash as opposed to consumer-grade flash. Disks vary in quality too, but it’s usually much more along the performance axis than the reliability axis. Speaking of performance, the performance of flash-based SSD has not been the huge leap over spinning disk that we expected. At present time, many disks still have higher bandwidth than many SSDs. SSDs still have a performance penalty for non-sequential IO vs. sequential IO – not as high as a disk seek, but enough to drop throughput by a factor of two or so. They also have high overhead for small random writes due to the need to erase the entire erase block the target block is located in. So SSD will beat the pants off of a disk on an uncached random read workload (e.g., system boot-up), but disks have the advantage on streaming reads and both streaming and random writes, generally speaking.

Another purported advantage of flash is lower power usage than disks. This isn’t a straight-forward equation; it depends on your usage pattern and the sophistication of the device and OS’s power-saving mode. Disks can not only be spun down, the internal electronics can be put into power-saving mode – as can elements of the host-side adapter and link. Don’t automatically assume that your SSD will use less power than a sophisticated disk in power saving mode.

Note that I am explicitly *not* talking about DRAM-based SSD. Those babies are fast, reliable, and very very expensive. If you have one, more power to you.

All of these equations will change as flash-based SSD gets better. Manufacturers will figure out better quality control, hardware wear-leveling will either get better or people will use log-structured file systems at the OS level, performance will improve, prices will drop. But if you are running out and buying storage today, you should buy a disk unless you fit one of the following categories: (a) You have a lot of money and there is some particular feature flash-based SSD gives you that is worth spending that money on, (b) You don’t care much about data integrity, (c) You won’t be doing a lot of writes, (d) You’re using a full-featured log-structured file system with built-in checksums.

Let’s come back to the Macbook Air. Supposedly, one would buy the SSD version because you wanted lower power consumption, better shock-resistance, and higher performance. You wouldn’t get the SSD because it costs a hell of a lot of money (about $1000 more) or because it has lower capacity than the hard drive version (I think it’s 40GB for the SSD and 80GB for the disk at present). The reports are that you really only notice the performance difference at boot. I’ve personally dropped my Air about a dozen times, once hard enough to dent a corner, and so far the disk is fine. Laptop disks in general have become quite reliable and it’s been years since I had one fail, even though I’m what they call a “digital nomad” – my laptop is my primary machine and I travel all the time. The battery life on my Air is stellar – 4 or 5 hours – and almost completely dominated by the display brightness. Dialing it up to max approximately halves the battery life.

Overall, I think people buy the SSD-based Air because it’s cool and new (a perfectly good reason) and because if it costs more, it must be better right? It’s also a status symbol. My personal recommendation: Buy the disk version of the Air. If you did buy the SSD version, back up frequently.

Postscript: Yes, this analysis is based on anecdotal evidence and personal experience, but I can’t afford the time to do real research unless someone pays me to. If you know someone who will, send me email!

42 thoughts on “To SSD or not to SSD?”

  1. FS suggestion in the mean time:?

    Hi Val!

    Pia got an EeePC and then went back to her T42 tank (big screen wins), so I have this lovely little flash-disk based lappy hand-me-down to play with. At the moment, I’m just using ext3, which I don’t feel entirely happy with, but I wasn’t if there was a better choice…

    Would you recommend any of the current crop of file systems (shipping with 2.6.26/27ish) for general-purpose flash-disk usage? Would love to start mucking about with a log-structured/checksummed FS on the EeePC if there’s one to watch. :-)

    Thanks!

  2. Re: FS suggestion in the mean time:?

    If you want to actually use your flash, go with JFFS2, which has been in mainline for ages. It breaks down around 64GB, but I don’t think that’s a problem for the EeePC. If you feel like filing a lot of bug reports but being on the bleeding edge, try LogFS:

    http://logfs.org/logfs/

    I’d put jffs2 on / and logfs on /dangerous. :)

  3. Thanks for the write-up. I am the least early-adopter of any geek ever, but I am impatient for consumer-grade SSDs with acceptable quality. (Being a victim of the power saving problems in many Linux distros in which power saving modes cause HDD drive heads to park several times a second, reducing lifespans to months at best means that I’m not really able to take advantage of power saving on HDDs.) Sounds like that time isn’t now.

  4. thinking about getting one

    I’ve been thinking about getting a laptop to record (music) with, and disk write noise is one of the things I wanted to avoid. Plus, some of the MLC drives aren’t so crazy expensive:
    http://www.mwave.com/mwave/Skusearch_v2.asp?scriteria=BA25460

    The JFFS2 size ceiling is good to know, thanks! I wonder if hardware write-leveling and FS write-leveling have any bad interactions.

    I think if I had spent large amounts of time thinking about how to deal with disk seek latency, I’d be a little put out if it went away.

  5. I’ve heard conflicting reports on this one (the drive spin up/down thing) – do you have any more details? I also know that different hard drive models are optimized for different usage patterns – e.g., the original iPod drives were optimized for lots of spin up/down and broke down if they were spun up for long periods of times – the opposite of regular disks.

  6. Most of what I know comes from this bug, which is a fair bit of difficult and pointless reading (ie people debating whether or not this means that the Ubuntu development model is the worst or best thing ever) with a few bits of information stuck in.

    As best I can tell, the problem is that the correct APM level on hard drives is of course usage dependant, but even setting a sensible default for laptopish use (say, aiming to not having the drive die of over-enthusiastic parking within two years) is very hard drive model dependent.

    In my case, I left Ubuntu totally alone for six months and it parked my heads over 1 million times according to smartctl. Its replacement is a Hitachi Travelstar 7K200 supplied by Dell under warranty (cheers Dell), but I don’t have a model number for the original drive.

    Ubuntu in particular also seems to have a lot of trouble keeping the settings in a single documented and easily understandable place, and their more knowledgable devs seem to have mostly stayed out of the relevant bugs, resulting in everyone and their dog posting recipes for fixing it, and re-writing them for each new release. Current state of play seems to be here. Seems like it will be better in Ubuntu 8.10 (but I have had “do not upgrade before beta” engraved on my soul, and can’t confirm personally).

  7. EeePC gotchas

    I have been running Ubuntu 8.04 off an MMC/SD card with ext3 in an EeePC 900 and while booting speed very nice (no point using the readahead script) there are issues…

    Writing is very slow. I’ve switched the IO scheduler to deadline which helped a tiny bit but when it happens you notice. Worse still, when using firefox 3 the window seems to go out to lunch periodically for tens of seconds. Looking at top when this happens shows the system gobbling up lots of IO.

    Suspend to RAM is a filesystem killer in this setup. The EeePC implements its SD card reader as a USB device (which makes things somewhat simple). However if I do suspend to RAM and then resume (remember the system is running from this SD card) it seems I will guarantee corruption (although I might not notice until I reboot). Parts of the filesystem simply seem to turn into zeros and the whole experience is quite terrifying. It’s hard to know if it’s the EeePC, the SD card or the kernel.

    A quick aside: be wary of picking up the 701 or 900 (basically the Celeron M based models). These machines leak battery even when they are completely turned off.

  8. Re: EeePC SSD appears as a block device

    Slight misunderstanding there – by “actually use your flash” I intended “store data on your EeePC and get it back again.”

  9. I did some more investigation into this, but never really got round to writing it up. The issue is that Linux (userland rather than the kernel, I guess, though possibly some unfortunate interaction) seems to be generating io on a frequent basis – you can hear this on some machines in the form of a gentle ticking noise coming from the drive. Quite what’s generating it, I have no idea, but it seems to interact poorly with the drive head unloading algorithms. They unload the head, and a second or so there’s an io request and the heads get loaded again. Might be the default ext3 commit interval, I guess, though even that shouldn’t do anything unless there are dirty pages.

  10. Well, there is atime update going on. I’d be suprised if any of the heavyweight desktop environments are able to idle without generating a constant stream of atime touches. (I know people have gotten, for example, xfce to do so with some hacking.)

    1e6 parks in 6 months is one park every 15 seconds! That’s impressive.

  11. Especially when it wasn’t on all the time: I suspect when it was it must have been parking about once every 3–5 seconds. So there was some serious insanity in there for sure.

  12. pointed out your relatime work to me about three months after the death of the disk in question. If only he would log into my laptop and ‘improve’ things as often as I do that to him…

  13. Like someone mentioned before, a filesystem like jffs2, yaffs, or ubifs on a SSD does not add anything. The solution will come with SSD devices that offer raw access to their NAND flash. I think that will happen when hardware vendors discover that it is cheaper to move their firmware FTL into a driver, similar to what happened with winmodems.

  14. Re: EeePC gotchas

    That issue with Firefox is the now relatively well-known Firefox vs. SQLite vs. fsync vs. Linux issue. It’s extremely annoying on every machine I have, grunty and petite. :-)

  15. I have a new fascinating broken flash device — this one a USB flash stick. The first few megs work OK, but after that it does this:


    % dd if=/dev/zero of=/dev/sda
    % hd /dev/sda
    00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
    *
    00d30000 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 |................|
    *
    00d302c0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 c0 e0 c0 |................|
    00d302d0 e0 c0 e0 c0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 |................|
    00d302e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 |................|
    *
    00d30340 e0 c0 e0 c0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 |................|
    00d30350 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 |................|
    *
    00d303e0 a0 e0 a0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 |................|
    00d303f0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 e0 |................|
    *

    The badness is written to the flash — multiple reads result in the same data — but variable across writes; writing multiple times results in different patterns. But it is consistently the same bytes being repeated with occasional single-bit variations.

    Seems like an electrical issue in the interface, rather than bad flash erase blocks. The design is a very simple USB-to-parallel-flash interface chip, so I could easily imagine a underrated capacitor or other issue resulting in this failure mode.

  16. Actually, a file system like jffs2 does add something on an SSD. As I said before, most SSDs have really terrible write-leveling. Using a file system explicitly designed to spread out writes will extend the life of the flash in this case.

    I agree entirely that SSDs should offer raw access to their NAND flash. Doing wear-leveling in hardware is extremely limited and only adds overhead.

  17. sqlite vs fsync on ext3

    The pausing is so bad I think I’m going to beg for someone to add toolkit.storage.synchronous to distro builds of Firefox (so people can optionally turn the synchronisation off via about:config ). I’m so desperate I don’t care about browser data safety any more!

  18. Interesting writeup. I knew there was good reason for me not to trust Flash memory for frequent writes. It’s nice to see it all laid out by someone with more hardware experience.

    Question: where do you see technologies like MRAM or Racetrack memory, vis a vis replacing flash and/or disks? How about replacing DRAM?

  19. Re: EeePC SSD appears as a block device

    So looking more at the JFFS2 recommendations page above, it is quite sensible doesn’t cover the case I’m talking about. Using JFFS2 on top of a block device incurs a performance penalty and they don’t have confidence the mtd emulation layer since it’s really only used for testing. They also don’t want you creating a large JFFS2 file system since it doesn’t scale well, and most block devices are quite large. And they don’t recommend using it on USB stick type of devices – those are written so seldom that ext2 is a much better solution.

    All of these things are true. The consideration that is left out is when the hardware wear-leveling or error checking is not very good, which is difficult to judge until you’ve had a lot of devices out on the market. For the EeePC in particular, I think the tradeoffs make sense for using JFFS2 with the mtd emulation layer – at least until we have a lot of positive experience with the reliability of this flash under ext3.

  20. nand bitflips

    Write errors is only one of the problems, bits can also be corrupted on read operations. The NAND blocks contains a CRC capable of correction one bit flip (IIRC the CRC is larger on newer chips), but until recently there were no filesystems capable of repairing the block before another bit flip occurs. Everyone just pretends read bit-flips never happen…

  21. Alas, noatime didn’t stop the disk from spinning back up several times a minute. (Something in dbus seems to be fiddling with a log, or some such. Never did quite track it down, just made it stop parking the heads.)

  22. Yeah, we switched to relatime by default in the 8.04 installer, but didn’t manage to do that on upgrades. Upgrades to 8.10 will deal with that, assuming you’re using ext2 or ext3 (I’m not sure why only those filesystems, will check).

  23. Alas,

    Attempting to read a file will still require the system to, um, read the file, regardless of whether it updates the last-accessed-time.

    Alas, the impossible is still impossible :p

    –cwillu

  24. Re: FS suggestion in the mean time:?

    JFFS2 starts to get cranky around 64 MB, not 64 GB. We are using it on the One Laptop Per Child system with a 1 GB NAND FLASH, and it is clearly way out of its league at that size. Mount times can take up to a minute. When the device starts to fill up, the garbage collector thread brings the machine to a standstill. I could go on and on… JFFS2 just wasn’t designed to scale past a few tens of megabytes.

    We are testing UBIFS as a possible replacement.

    Both JFFS2 and UBIFS are designed for use on raw NAND, but the chip industry trend is toward “managed NAND” – SSDs and similar schemes like eMMC and LBA-NAND. The reason for the trend is because the market is pushing for ever greater sizes, which translates in hardware to smaller process geometries, larger FLASH page sizes, and wider ECC. All of those factors end up requiring different interface chips, which hurts the ability of the manufacturers to deploy their new chips in existing designs. With the “managed NAND” approach, the hardware details can be hidden behind a built-in microprocessor.

    The same sort of thing happened 25 years ago in the rotational disk arena. “Raw” disk interfaces like ST-506 gave way to “smart” interfaces like IDE and SCSI. Once that trend started, the raw interfaces disappeared very quickly.

    While it’s true that many Flash Translation Layer implementations suck, that’s not the same as saying that they all suck. Hopefully, with the ever-increasing importance of Flash storage, the sucky implementations will start to go by the wayside.

  25. UDF?

    I wonder if UDF would be any good on NAND flash block devices. Not the way Linux implements it, but rather the Spared build of, say, UDF 2.60: something that incorporates native defect management so when the inevitable cell wear-out occurs, the filesystem is able to deal with it. Also, being roughly log-structured, updates end up being written to other blocks.

    Combined with some form of garbage collection when the block device is full, it’d be a non-exotic means to effectively utilize NAND flash block devices with or without any form of block sparing or wear leveling, reducing the number of erase cycles substantially and avoiding hot flaming death by frequent filesystem metadata writes at the same logical block addresses.

    UDF is widely implemented across major platforms, and I figure that with a little bit of insight and a little bit of work, it’d do very well as a filesystem for NAND solid state devices.

  26. Doing wear-leveling in hardware is extremely limited and only adds overhead.

    That’s what I always thought, too. Why does Linus think that any SSD worth buying will do adequate wear-leveling in hardware? Absent further information, I am going to assume that he is right and you and I are wrong, but I don’t understand how.

  27. Racetrack and all

    … quite a long way off, I think. There’s still a fair bit of work to do on the practicalities of fabricating the stuff. Long term, I personally think it has a fair bit of promise.

  28. Hmmmm.

    If you want to unload the heads at all (and you do, since it generates less heat), you can’t do frequent IO. Changing the APM parameters so that it doesn’t ever park the heads (because the timeout is longer than the period between IO ops) is not the right solution.

    ext3 and other filesystems unfortunately want to write every 5 seconds by default, whether you do anything or not (particularly if you don’t use noatime mount options).

    I get around this by running laptop_mode so I don’t get frequent writes.

  29. ext3 commit interval. use laptop_mode (/proc/sys/vm/laptop_mode) and mount your partitions with a large commit interval (commit=21600 on my laptop when on wall power, and 86400 when on battery). laptop-mode-tools in debian is a good place to start to do the remounting when switching between power sources (and to enable the spin down timer if that’s what you want – I’ve had my disk stay off all day during some workloads, doing this. It’s nice getting a SMART reading back from the HD saying “21 degrees”).

    If you still can’t get rid of the reads (writes will be postponed until a sync or the cache fills up or the commit interval is reached), then consult /proc/sys/vm/block_dump and you syslog to see what is being done. Ah, that reminds me. If you want long spindown times as well as long park times, in /etc/syslog.conf, you may want to change all files to be not synced (prepend the filename with a dash). You will want to at least temporarily change all logfiles to not sync when you have set block_dump, otherwise each IO operation will cause another IO operation on the logfiles. Fun and joy joy.

    Oh, and don’t do what I did the other day and cause a kernel panick during a dist-upgrade, with such long timeouts :) Always sync manually when you absolutely rely on something hitting the disk. Ah, this reminds me again. vi fsyncs annoyingly by default. set nofsync in .vimrc. Oh, and get a real browser instead of that firefox crud. firefox kills disks and children alike.

  30. I like your work, but relatime wasn’t one of them — I used relatime for a little while, but then I found that I relied on an accurate atime just too often. It’d be a lot better if atimes could be cached until the filesystem is explicitly synced (absent any crashes, it will eventually make it to disk at system shutdown at the latest), and better yet, to reduce the time required to get all these atime updates onto the disk during this sync (too late now that ext[234] has been designed, I guess), atimes were located contiguously on the disk (as a single integer for each inode, so it’s not so sparse that you end up still having to write too much contiguous stuff).

    Most of the justifications for keeping an accurate atime in the comments of that kerneltrap article, were pretty bogus (defrag? auditing? tmpwatch on the other hand), but nevertheless, I use it frequently enough for diagnostics (as long as the backup scheme you run is running off an lvm snapshot, such that your atimes are accurate of course) that I greatly missed it when I didn’t have it.

  31. Your laptop drives last?

    Mine die after (almost exactly) a year, near every time. Although I did replace my last drive early because I was so out of space.

    I very nearly bought one of the Intel SSD’s, however the AU$1500 purchase price for the 160GB unit turned me off, even if the performance is awesome.

  32. Re: UDF?

    I’ve tried out UDF with a 32gb Throttle ESATA pen drive after it got progressively slower withuse when formatted with NTFS or FAT. Looks like formatting with UDF (I used verison 2.5) completely avoids the problems and the drive stays as fast as when its first formatted!

  33. How to check your SSD for corruption

    I have an Acer Aspire One 110 / ZG5 with the 8GB SSD running Ubuntu 9.04.

    I’ve been having occasional lockups and needing to e2fsck allot on reboot and am wondering if the SSD is going bad, but I don’t know how to check it for corruption.

    I memtest’ed it which was fine.

    I tried e2fsck -fcckpv and it didn’t turn anything up. ( -cc is the non-destructive read / write test of the entire disk )

    I’m thinking about just converting it to ext3 and hoping that the problems go away.

Comments are closed.