Don’t Panic – fsync(), ext3/4, and your data

Worried about the recent firestorm over fsync() and ext3/4? Wondering if you should rewrite your applications to fsync() every other line of code? Afraid that you’ll boot into a new kernel one day and suddenly start losing all your data?

Don’t Panic. Don’t start rewriting your applications. And don’t worry about waking up one day and finding that your file system has silently switched to data=loseallmystuff mode. I will post something in more detail in the next few days, but for now, here are the top data points:

  • The majority of Linux file systems developers believe that applications that Just Work on ext3 today should also Just Work on your Linux file system in the future. Specifically, rename() implies that the file’s data will hit disk before the rename() does. (In other words, you won’t have to add an fsync() before the rename() in order to guarantee this behavior, as is technically required by POSIX – we figure it’s implied by the rename().)
  • Please don’t go add a load of fsync()s to your applications “just to be safe.” On 99.99% of ext3 systems in use, fsync() won’t return until all outstanding writes to the ENTIRE file system have hit disk. This causes enormous, unacceptable latencies if anyone else is using the file system, and in most cases isn’t what you actually want. (See data=guarded mode, below.)
  • Don’t worry about the new default journaling mode for ext3 planned for 2.6.30 (data=writeback, which is much faster than the old default, data=ordered, but has enormous security and data integrity problems). No distro would ship this as the default. The only way it could happen at Red Hat is over the dead bodies of the security team, who, let me tell you, keep an eagle eye on file system data leaks like this.
  • Chris Mason is taking time off btrfs to work on a new journaling mode for ext3, data=guarded, which will get around the current performance issues of data=ordered while preserving many of the old consistency guarantees. Please test it – the more testing it gets, the sooner data=writeback will stop being the default. Latest patches are here, here, and here.

If you are an applications developer trying to figure out whether to rewrite all your file I/O code, please sit back and wait for things to settle down for a few months. My prediction is that by the time 2.6.31 is released (and possibly earlier), Linux file systems will actually be more reliable and better performing than in 2.6.29, without application developers or distros having to lift a finger.

Edited to remove “rename() implies fsync()”. Keep the comments coming! But note that I generally don’t approve anonymous comments unless they are polite and informative.

15 thoughts on “Don’t Panic – fsync(), ext3/4, and your data

  1. Rename *does* *not* imply fsync.

    NO! Rename *does* *not* imply fsync.

    ext3, and now ext4 effectivly add one if needed, but other filesystems (XFS, etc) don’t.

    See Stewart Smith’s “Eat My Data” talk (video is available on the LCA2007 site) for more.

  2. Common sense and distributions

    No distro would ship this as the default. The only way it could happen at Red Hat is over the dead bodies of the security team

    Well, you say that. But given some of the comments from people I would expect to know better[1], I’m not so sure. I’m quite certain it won’t appear in RHEL or Fedora. But I wouldn’t bet against it appearing in some other distributions. Indeed, my money is on Ubuntu shipping with data=writeback as the default in the future. I hope to be proved wrong…

    [1] For example, Jon Corbet: Security is a smaller issue than it once was, for the simple reason that multiuser Linux systems are relatively scarce in 2009. Yikes! That may be true of desktop Linux (not my desktops, but I accept I’m an outlier in that respect), but on the server side, it’s demonstrably false.

  3. Re: Rename *does* *not* imply fsync.

    Yep it is a little more complicated. On ext3 data=writeback, or ext4 (unless you mount -o no_auto_da_alloc) a rename which clobbers an existing file gets the new file’s data moving toward disk, but is not a synchronous fsync guarantee. With ext3 data=ordered you really do get the ponies you wish for; due to the vagaries of the journaling mode, the rename won’t hit disk until the new file’s data does, so that is more or less the same semantics. And you’re right, xfs doesn’t do this at all.

    So I’d still agree with Val that the world isn’t ending for app developers, but as filesystems run off and implement behavior to try to inflict least suprise, I still worry that it is making things somewhat less predictable in the end, overall.

  4. Re: Rename *does* *not* imply fsync.

    I’m sorry – I just trying to keep things short. The longer version of “rename() implies fsync()” is:

    On ext3 with data=ordered, the data of a file will be written before the metadata of the file, which means that when a rename() is on disk, the data of the file as of the point of the rename() is also on disk.

    In the very near future, ext4 and btrfs will behave this way by default (with options for fast ‘n’ loose).

    At the file systems workshop, we discussed defining a useful set of behaviors beyond POSIX that all Linux file systems would adhere to, including rename() implies data is on disk, and that XFS was a candidate for behaving this way as well. (We theorize that a lot of the “XFS ate my data” complaints were a result of delayed writes of file data, just as in ext4, and so adopting this behavior would perhaps increase the popularity of XFS.)

  5. what about existing fsyncs

    what about applications that already use fsync(), because XFS users complained about them? At what point is it safe to remove those fsync()?

    For applications that are crossplatform, is there anything we can do other than conditionally skipping fsync() just on linux?

  6. Re: what about existing fsyncs

    fsync() is still necessary and required in some cases – don’t expect or plan to remove existing fsync()s. I’m just recommending to wait a few months and see what we end up with as policy for Linux.

    Right now, application developers are caught in a catch-22 – if they fsync(), performance will be terrible (in certain cases on existing ext3), and if they don’t fsync(), they might lose their data (existing ext4 and XFS). (I am talking about the default settings – each fs has many options which affect this behavior.) The problem is that ext3-with-poor-fsync-performance composes the vast majority of Linux systems in existence at this time, and at present, there is no reasonable programmatic way to detect the difference between two kinds of file systems: file systems which don’t need the fsync() and may perform poorly with it, and file systems which require the fsync().

    Linux kernel developers are still discussing what we should do and nothing is certain at this point, which is why I recommend waiting for a few months. This is a good time to express what you need, as an application developer, in order to write correct and fast programs across a range of operating systems and file systems. Stable interfaces giving information about the data consistency guarantees and performance of the underlying file system may be part of what you need. Another solution for systems going forward is to remove the performance/consistency tradeoff (Chris Mason’s data=guarded work).

  7. What would be cool would be a writeup of the various sequences of rename(), truncate(), link(), unlink(), etc. that apps folks use as some sort of atomic checkpoint and what actually happens with the data, metadata, and other outstanding writes for that filesystem and disk as a result on the various filesystems with the various options. And then there’s laptop mode…

    Do you know of anything like that already out there?

  8. Re: Common sense and distributions

    Honestly, most servers are not “multi user” in the sense of multiple actual people with independent Unix level users. They have multiple users stored in some app database somewhere, but unix users tend to be mysql, www-data, postfix, memcacheduser, etc…

    And also, desktop users are a lot less likely to be fiddling their filesystem options than server administrators in my experience.

    (also, at least in the last few Ubuntu releases, multiple-concurrent-gnome-users has been a buggy piece of crap, with sound breaking randomly and the X server being pretty unreliable as well!)

  9. Re: Common sense and distributions

    They may not be separate unix users, but they are still separate users from the perspective of analysing the security requirements of the server – and its filesystem.

    Jon is correct if he meant that the unix user mechanism is less important than it once was, but if a filesystem can accidentally leak data written by another ‘unix level user’ then it can just as well accidentally leak data written on behalf of another ‘application level user’, and this is just as bad.

  10. POSIX-compatible applications should be fast

    I think the most important thing is that POSIX-typical constructs are fast. So if an application calls fsync it shouldn’t have to wait for the system’s whole write set to be commited to disk, but only the relevant parts. This ensures portability between file systems and Unixes.

    And than it’s ok if file systems provides more than that.

    Best regards, Thomas

  11. So…

    If I’m managing >2,000 servers, I still need to force add data=ordered to all my /etc/fstab’s in order to ensure that I get data consistancy guarantees in the event that someone pushes out a vanilla kernel somewhere in my infrastructure. Then if data=guarded ever goes live, I’ll have to cook up some obnoxious regexp processing against the running kernel version to figure out if the kernel does or does not support that mode of operation and force that mode.

    Thanks LKML kernel devs for making a simple sysadmins life substantially more difficult… What was the problem with leaving things the way they are? What problem are they trying to solve other than having a soapbox on which they can destroy other people’s data and then lecture to them about how its their own fault…

    Heh, ironically, the code that I wrote that adds noatime automagically to every mountpoint on our servers opens up /etc/, writes to it, then renames it without calling fdatasync(), and that is probably the place I’d add the data=ordered hack….

    I’ve written that code about a hundred other times as well on a hundred other config files…

    Sometimes I really wish BSD had gotten all the corporate mindshare back around 1998 instead of Linux…

Comments are closed.

%d bloggers like this: