I found the link to Quibble, an open and extensible reverse engineering of the Windows kernel bootloader to be much more intriguing: https://github.com/maharmstone/quibble
Thinking of how I'd do this for ZFS... I think I'd do something like: add a layer that can read other filesystem types and synthesize ZFS block pointers, then ZFS could read other filesystems, and as it writes it could rewrite the whole thing slowly. If ZFS had block pointer rewrite (and I've explained here before why it does not and cannot have BP rewrite caoabilities, not being a proper CAS filesystem), one could just make it rewrite the whole thing to finish the conversion.
I've been using it for a few years now on my main PC (has a couple SSDs and a large HDD) and my laptop, it was the default of openSUSE and just used that. Then i realized that snapshots are a feature i didn't knew i wanted :-P.
Never had a problem, though it is annoying that whatever BTRFS thinks is free space and what the rest of the OS thinks is free space do not always align. It has rarely been a problem in practice though.
"Flagship"? I don't know a single person who uses it in production systems. It's the only filesystem I've lost data to. Ditto for friends.
Please go look up survivor bias. That's what all you btrfs fanboys don't seem to understand. It doesn't matter how well it has worked for 99.9% of you. Filesystems have to be the most reliable component in an operating system.
It's a flagship whose fsck requires you to contact developers to seek advice on how to use it because otherwise it might destroy your filesystem.
It's a flagship whose userspace tools, fifteen years in, are still seeing major changes.
It's a flagship whose design is so poor that fifteen years in the developers are making major changes to its structure and depreciating old features in ways that do not trigger an automatic upgrade or informative error to upgrade, but cause the filesystem to panic with error messages for which there is no documentation and little clue what the problem is.
Btrfs is in production all over the damn place, at big corporations and all kinds of different deployments. Synology has their own btrfs setup that they ship to customers with their NAS software for example.
I found it incredibly annoying the first time I ran out of disk space on btrfs, but many of these points are hyperbolic and honestly just silly. For example, btrfs doesn't really do offline fsck. fsck.btrfs has a zero percent chance of destroying your volume because it does nothing. As for the user space utilities changing... I'm not sure how that demonstrates the filesystem is not production ready.
Personally I usually use either XFS or btrfs as my root filesystem. While I've caught some snags with btrfs, I've never lost any data. I don't actually know anyone who has, I've merely just heard about it.
And it's not like other well-regarded filesystems have never ran into data loss situations: even OpenZFS recently (about a year ago) uncovered a data-eating bug that called its reliability into question.
I'm sure some people will angrily tell me that actually btrfs is shit and the worst thing to ever be created and honestly whatever. I am not passionate about filesystems. Wake me up when there's a better one and it's mainlined. Maybe it will eventually be bcachefs. (Edit: and just to be clear, I do realize bcachefs is mainline and Kent Overstreet considers it to be stable and safe. However, it's still young and it's upstream future has been called into question. For non-technical reasons, but still; it does make me less confident.)
For example, btrfs doesn't really do offline fsck. fsck.btrfs has a
zero percent chance of destroying your volume because it does nothing.
fsck.btrfs does indeed do nothing, but that's not the tool they were complaining about. From the btrfs-check(8) manpage:
Warning
Do not use --repair unless you are advised to do so by a
developer or an experienced user, and then only after having
accepted that no fsck can successfully repair all types of
filesystem corruption. E.g. some other software or hardware
bugs can fatally damage a volume.
[...]
DANGEROUS OPTIONS
--repair
enable the repair mode and attempt to fix problems where possible
Note there’s a warning and 10 second delay when this option is
run without --force to give users a chance to think twice
before running repair, the warnings in documentation have
shown to be insufficient
Yes, but that doesn't do the job that a fsck implementation does. fsck is something you stuff into your initrd to do some quick checks/repairs prior to mounting, but btrfs intentionally doesn't need those.
If you need btrfs-check, you have probably hit either a catastrophic bug or hardware failure. This is not the same as fsck for some other filesystems. However, ZFS is designed the same way and also has no fsck utility.
So whatever point was intended to be made was not, in any case.
Contrary to popular belief, people on a forum you happen to participate in are still just strangers. In line with popular belief, anecdotal evidence is not a good basis to form an opinion.
Exactly how do you propose to form an opinion on filesystem reliability then? Do my own testing with thousands of computers over the course of 15 years?
You don't determine what CPUs are fast or reliable by reading forum comments and guessing, why would filesystems be any different?
That said, you make a good point. It's actually pretty hard to quantify how "stable" a filesystem is meaningfully. It's not like anyone is doing Jepsen-style analysis of filesystems right now, so the best thing we can go off of is testimony. And right now for btrfs, the two types of data-points are essentially, companies that have been using it in production successfully, and people on the internet saying it sucks. I'm not saying either of those is great, and I am not trying to tell anyone that btrfs is some subjective measure of good. I'm just here to tell people it's apparently stable enough to be used in production... because, well, it's being used in production.
Would I argue it is a particularly stable filesystem? No, in large part because it's huge. It's a filesystem with an integrated volume manager, snapshots, transparent compression and much more. Something vastly simpler with a lower surface area and more time in the oven is simply less likely to run into bugs.
Would I argue it is perfectly reasonable to use btrfs for your PC? Without question. A home use case with a simple volume setup is exceedingly unlikely to be challenging for btrfs. It has some rough edges, but I don't expect to be any more likely to lose data to btrfs bugs as I expect to lose data from hardware failures. The bottom line is, if you absolutely must not lose data, having proper redundancy and backups is probably a much bigger concern than btrfs bugs for most people.
>You don't determine what CPUs are fast or reliable by reading forum comments and guessing, why would filesystems be any different?
Your premise is entirely wrong. How else would I determine what CPUs are fast or reliable? Buy dozens of them and stress-test them all? No, I use online sites like cpu-monkey.com that compare different CPUs' features and performance according to various benchmarks, for the performance part at least. For reliability, what way can you possibly think of other than simply aggregating user ratings (i.e. anecdotes)? If you aren't running a datacenter or something, you have no practical alternative.
At least for spinning-rust HDDs, the helpful folks at Backblaze have made a treasure trove of long-term data available to us. But this isn't available for most other things.
> It's not like anyone is doing Jepsen-style analysis of filesystems right now, so the best thing we can go off of is testimony.
This is exactly my point. We have nothing better, for most of this stuff.
>companies that have been using it in production successfully, and people on the internet saying it sucks
Companies using something doesn't always mean it's any good, especially for individual/consumer use. Companies can afford teams of professionals to manage stuff, and they can also make their own custom versions of things (esp. true with OSS code). They're also using things in ways that aren't comparable to individuals. These companies may be using btrfs in a highly feature-restricted way that they've found, through testing, is safe and reliable for their use case.
> It's a filesystem with an integrated volume manager, snapshots, transparent compression and much more. Something vastly simpler with a lower surface area and more time in the oven is simply less likely to run into bugs.
This is all true, but ZFS has generally all the same features, yet I don't see remotely as many testimonials from people saying "ZFS ate my data!" as I have with btrfs over the years. Maybe btrfs has gotten better over time, but as the American car manufacturers found out, it takes very little time to ruin your reputation for reliability, and a very long time to repair that reputation.
> Your premise is entirely wrong. How else would I determine what CPUs are fast or reliable? Buy dozens of them and stress-test them all? No, I use online sites like cpu-monkey.com that compare different CPUs' features and performance according to various benchmarks, for the performance part at least. For reliability, what way can you possibly think of other than simply aggregating user ratings (i.e. anecdotes)? If you aren't running a datacenter or something, you have no practical alternative.
My point is just that anecdotes alone don't tell you much. I'm not suggesting that everyone needs to conduct studies on how reliable something is, but if nobody has done the groundwork then the best thing we can really say is we're not sure how stable it is because the best evidence is not very good and it conflicts.
> Companies using something doesn't always mean it's any good, especially for individual/consumer use. Companies can afford teams of professionals to manage stuff, and they can also make their own custom versions of things (esp. true with OSS code). They're also using things in ways that aren't comparable to individuals. These companies may be using btrfs in a highly feature-restricted way that they've found, through testing, is safe and reliable for their use case.
For Synology you can take a look at what they're shipping since they're shipping it to consumers. It does seem like they're not using many of the volume management features, instead using some proprietary volume management scheme on the block layer. However otherwise there's nothing particularly special that I can see, it's just btrfs. Other advanced features like transparent compression are available and exposed in the UI.
(edit: Small correction. While I'm still pretty sure Synology has custom volume management for RAID which works on the block level, as it turns out, they are actually using btrfs subvolumes as well.)
I think the Synology case is an especially interesting bit of evidence because it's gotta be one of the worst cases of shipping a filesystem, since you're shipping it to customer machines you don't control and can't easily inspect later. It's not the only case of shipping btrfs to the customer either, I believe ChromeOS does this and even uses subvolumes, though I didn't actually look for myself when I was using it so I'm not actually 100% sure on that one.
> This is all true, but ZFS has generally all the same features, yet I don't see remotely as many testimonials from people saying "ZFS ate my data!" as I have with btrfs over the years. Maybe btrfs has gotten better over time, but as the American car manufacturers found out, it takes very little time to ruin your reputation for reliability, and a very long time to repair that reputation.
In my opinion, ZFS and other Solaris technologies that came out around that time period set a very high bar for reliable, genuinely innovative system features. I think we're going to have to live with the fact that just having a production-ready filesystem dropped onto the world is not going to be the common case, especially in the open source world: the filesystem will need to go through its growing pains in the open.
Btrfs has earned a reputation as the perpetually-unfinished filesystem. Maybe it's tainted and it will simply never approach the degree of stability that ZFS has. Or, maybe it already has, and it will just take a while for people to acknowledge it. It's hard to be sure.
My favorite option would be if I just simply don't have to find out, because an option arrives that quickly proves itself to be much better. bcachefs is a prime contender since it not only seems to have better bones but it's also faster than btrfs in benchmarks anyways (which is not saying much because btrfs is actually quite slow.) But for me, I'm still waiting. And until then, ZFS is not in mainline Linux, and it never will be. So for now, I'm using btrfs and generally OK recommending it for users that want more advanced features than ext4 can offer, with the simple caveat that you should always keep sufficient backups of your important data at all times.
I only joined in on this discussion because I think that the btrfs hysteria train has gone off the rails. Btrfs is a flawed filesystem, but it continues to be vastly overstated every time it comes up. It's just, simply put, not that bad. It does generally work as expected.
>Synology has their own btrfs setup that they ship to customers with their NAS software for example.
Synology infamously/hilariously does not use btrfs as the underlying file system because even they don't trust btrfs's RAID subsystem. Synology uses LVM RAID that is presented to btrfs as a single drive. btrfs isn't managing any of the volumes/disks.
Their reason for not using btrfs as a multi-device volume manager is not specified, though it's reasonable to infer that it is because btrfs's own built-in volume management/RAID wasn't suitable. That's not really very surprising: back in ~2016 when Synology started using btrfs, these features were still somewhat nascent even though other parts of the filesystem were starting to become more mature. To this day, btrfs RAID is still pretty limited, and I wouldn't recommend it. (As far as I know, btrfs RAID5/6 is even still considered incomplete upstream.) On the other hand, btrfs subvolumes as a whole are relatively stable, and that and other features are used in Synology DSM and ChromeOS.
That said, there's really nothing particularly wrong with using btrfs with another block-level volume manager. I'm sure it seems silly since it's something btrfs ostensibly supports, but filesystem-level redundancy is still one of those things that I think I would generally be afraid to lean on too hard. More traditional RAID at the block level is simply going to be less susceptible to bugs, and it might even be a bit easier to manage. (I've used ZFS raidz before and ran into issues/confusion when trying to manage the zpool. I have nothing but respect for the developers of ZFS but I think the degree to which people portray ZFS as an impeccable specimen of filesystem perfection is a little bit unrealistic, it can be confusing, limited, and even, at least very occasionally, buggy too.)
>That's not really very surprising: back in ~2016 when Synology started using btrfs, these features were still somewhat nascent even though other parts of the filesystem were starting to become more mature.
btrfs was seven years old at that point and declared "stable" three years before that.
ZFS is an example of amazingly written code by awesome engineers. It's simple to manage, scales well, and easy to grok. btrfs sadly will go the wayside once bcachefs reaches maturity. I wouldn't trust btrfs for important data, and neither should you. If you experience data loss on a Synology box, the answer you'll get from them is "tough shit, hope you have backups, and here's a coupon for a new Synology unit."
> btrfs was seven years old at that point and declared "stable" three years before that.
The on-disk format was declared stable in 2013[1]. That just meant that barring an act of God, they were not going to break the on-disk format, e.g. a filesystem created at that point would continue to be mountable for the foreseeable future. It was not a declaration that the filesystem was itself now stable necessarily, but especially was not suggesting that all of the features were stable. (As far as I know, many features still carried warning labels.)
Furthermore, the "it's been X years!" thing referring to open source projects has to stop. This is the same non-sense that happens with every other thing that is developed in the open. Who cares? What matters isn't how long it took to get here. What matters is where it's at. I know there's going to be some attempt at rationalizing this bit, but it's wasted on me because I'm tired of hearing this.
> ZFS is an example of amazingly written code by awesome engineers. It's simple to manage, scales well, and easy to grok.
Agreed. But ZFS was written by developers at Sun Microsystems for their commercial UNIX. We should all be gracious to live in a world where Sun Microsystems existed. We should also accept that Sun Microsystems is not the standard any more than Bell Labs was the standard, they are extreme outliers. If we measure everything based on whether it's as good as what Sun Microsystems was doing in the 2000s, we're going to have a bad time.
As an example, DTrace is still better than LTTng is right now. I hope that sinks in for everyone.
However, OpenZFS is not backed by Sun Microsystems, because Sun Microsystems is dead. Thankfully and graciously at that, it has been maintained for many years by volunteers, including at least one person who worked on ZFS at Sun. (Probably more, but I only know of one.)
Now if OpenZFS eats your data, there is no big entity to go to anymore than there is for btrfs. As far as I know, there's no big entity funding development, improvements, or maintenance. That's fine, that's how many filesystems are. But still, that's not what propelled ZFS to where it stood when Sun was murdered.
> btrfs sadly will go the wayside once bcachefs reaches maturity.
I doubt it will disappear quickly: it will probably continue to see ongoing development. Open Source is generally pretty good at keeping things alive in a zombie state. That's pretty important since it is typically non-trivial to do online conversion of filesystems. (Of course, we're in a thread about a tool that does seamless offline conversion of filesystems, which is pretty awesome and impressive in and of itself.)
But for what it's worth, I am fine with bcachefs supplanting btrfs eventually. It seems like it had a better start, it benchmarks faster, and it's maturing nicely. Is it safer today? Depends on who you ask. But it's hard to deny that it doesn't seem like the point at which bcachefs will be considered stable by most will take more than a year or two tops, assuming kernel drama doesn't hold back upstream.
Should users trust bcachefs with their data? I think you probably can right now with decent safety, if you're using mainline kernels, but bcachefs is still pretty new. Not aware of anyone using it in production yet. It really could use a bit more time before recommending people jump over to it.
> I wouldn't trust btrfs for important data, and neither should you.
I stand by my statement: you should always ensure you have sufficient backups for important data, but most users should absolutely fear hardware failures more than btrfs bugs. Hardware failures are an if, not a when. Hardware will always fail eventually. Data-eating btrfs bugs have certainly existed, but it's not like they just appear left and right. When such a bug appears, it is often newsworthy, and usually has to do with some unforeseen case that you are not so likely to run into by accident.
Rather than lose data, btrfs is instead more likely to just piss you off by being weird. There are known quirks that probably won't lose you any data, but that are horribly annoying. It is still possible, to my knowledge, to get stuck in a state where the filesystem is too full to delete files and the only way out is in recovery. This is pretty stupid.
It's also not particularly fast, so if someone isn't looking for a feature-rich CoW filesystem with checksums, I strongly recommend just going with XFS instead. But if you run Linux and you do want that, btrfs is the only mainline game in town. ZFS is out-of-tree and holds back your kernel version, not to mention you can never really ship products using it (with Linux) because of silly licensing issues.
> If you experience data loss on a Synology box, the answer you'll get from them is "tough shit, hope you have backups, and here's a coupon for a new Synology unit."
That suggests that their brand image somewhat depends on the rarity of btrfs bugs in their implementation, but Synology has a somewhat good reputation actually. If anything really hurts their reputation, it's mainly the usual stuff (enshittification.) The fact that DSM defaults to using btrfs is one of the more boring things at this point.
I agree with what you say, and I would never trust btrfs with my data because of issues that I've seen in the past, My last job I installed my Ubuntu desktop with btrfs and within three days it had been corrupted so badly because of a power outage that I had to completely wipe and reinstall the system.
That said:
> but cause the filesystem to panic with error messages for which there is no documentation and little clue what the problem is.
The one and only time I experimented with ZFS as a root filesystem I got bit in the ass because the zfs tools one day added a new feature flag to the filesystem that the boot loader (grub) didn't understand and therefore it refused to read the filesystem, even read-only. Real kick in the teeth, that one, especially since the feature flag was completely irrelevant to just reading enough of the filesystem for the boot loader to load the kernel and there was no way to override it without patching grub's zfs module on another system then porting it over.
Aside from that, ZFS has been fantastic, and now that we're all using UEFI and our kernels and initrds are on FAT32 filesystems I'm much less worried, but I'm still a bit gunshy. Not as much as with BTRFS, mind you, but somewhat.
> Please go look up survivor bias. That's what all you btrfs fanboys don't seem to understand. It doesn't matter how well it has worked for 99.9% of you. Filesystems have to be the most reliable component in an operating system.
Not sure. It's useful if they are reliable, but they only need to be roughly as reliable as your storage media. If your storage media breaks down once in a thousand years (or once a year for a thousand disks), then it doesn't matter much if your filesystem breaks down once in a million years or once in a trillion years.
Meta (Facebook) has millions of instances of Btrfs in production. More than any other filesystem by far. A few years ago when Fedora desktop variants started using Btrfs by default, Meta’s experience showed it was no less reliable than ext4 or XFS.
Haven't had any issues with it after using it for years on my work and home PCs. I use transparent compression, snapshots, and send/receive, and they all work great.
The main complaint was always about parity RAID, which I still wouldn't recommend running from what I've heard. But RAID 1-10 have been stable.
I tried btrfs for the first time a few weeks ago. I had been looking for mature r/w filesystems that support realtime compression, and btrfs seemed like a good choice.
My use case is this: I normally make full disk images of my systems and store them on a (100TB) NAS. As the number of systems grows, the space available for multiple backup generations shrinks. So compression of disk images is good, until you want to recover something from a compressed disk image without doing a full restore. If I put an uncompressed disk image in a compressed (zstd) btrfs filesystem, I can mount volumes and access specific files without waiting days to uncompress.
So I gave it a try and did a backup of an 8TB SSD image to a btrfs filesystem, and it consumed less than 4TB, which was great. I was able to mount partitions and access individual files within the compressed image.
The next thing I tried, was refreshing the backup of a specific partition within the disk image. That did not go well.
Here's what I did to make the initial backup:
(This was done on an up-to-date Ubuntu 24.04 desktop.)
cd /btrfs.backups
truncate -s 7696581394432 8TB-Thinkpad.btrfs
mkfs.btrfs -L btrfs.backup 8TB-Thinkpad.btrfs
mount -o compress=zstd 8TB-Thinkpad.btrfs /mnt/1
pv < /dev/nvme0n1 > /mnt/1/8TB-Thinkpad.nvme0n1
All good so far. The backup took about three hours, but would probably go twice as fast if I had used my TB4 dock instead of a regular USB-C port for the backup media.
Things went bad when I tried to update one of the backed up partitions:
kpartx -a /mnt/1/8TB-Thinkpad.nvme0n1
pv < /dev/nvme0n1p5 > /dev/mapper/loop20p5
This sort of thing works just fine on a normal, uncompressed ext4 filesystem.
I did not really expect this to work here, and my expectations were met.
The result was a bunch of kernel errors, the backup device being remounted r/o, and a corrupt btrfs filesystem with a corrupt backup file.
So for this use case, btrfs is a big improvement for reading compressed disk images on the fly, but it is not suitable for re-writing sections of disk images.
I’ve been using it for a few years on my NAS for the all the data drives (with Snapraid for parity and data validation), and as the boot drive on a few SBCs that run various services. Also use it as the boot drive for my Linux desktop PC. So far no problems at all and I make heavy use of snapshots, I have also had various things like power outages that have shut down the various machines multiple times.
I’ve never used BTRFS raid so can’t speak to that, but in my personal experience I’ve found BTRFS and the snapshot system to be reliable.
Seems like most (all?) stories I hear about corruption and other problems are all from years ago when it was less stable (years before I started using it). Or maybe I just got lucky ¯\_(ツ)_/¯
BTRFS Raid10 can seemlessly cpmbole multiple raw disks without trying to match capacity.
Next time I just replace my 4T disk in my 5 disk Raid10 with 20T. Currently I have 4+8+8+16+20 disks.
MD raid does not do checksumming. Although I believe XFS is about to add support for it in the future.
I have had my BTRFS raid filesystem survive a lot during the past 14 years:
- burned power: no loss of data
- failed ram that started corrupting memory: after a little hack 1) BTRFS scrub saved most of data even though the situation got so bad kernel would crash in 10 minutes
- buggy pcie SATA extension card: I tried to add 6th disk, but noticed after fee million write errors to one disk that it just randomly stopped passing data through: no data corruption, although btrfs write error counters are in 10s of millions now
- 4 disk failures: I have only one original disk still running and it is showing a lot of bad sectors
1) one of the corrupted sectors was in the btrfs tree that contains the checksums for rest of the filesystem and both copies were broken. It prevented access to some 200 files. I patched the kernel to log the exact sector in addition to the expected and actual value. Turns our it was just a single bit flip. So I used hex editor to flip it back to correct value and got the files back
I don’t use BTRFS raid, I don’t actually use any RAID. I use SnapRAID which is really more of a parity system than real RAID.
I have a bunch of data disks that are formatted BTRFS, then 2 parity disks formatted using ext4 since they don’t require any BTRFS features. Then I use snapraid-btrfs which is a wrapper around SnapRAID to automatically generate BTRFS snapshots on the data disks when doing a SnapRAID sync.
Since the parity is file based, it’s best to use it with snapshots, so that’s the solution I went with. I’m sure you could also use LVM snapshots with ext4 or ZFS snapshots, but BTRFS with SnapRAID is well supported and I like how BTRFS snapshots/subvolumes works so I went with that. Also BTRFS has some nice features over ext4 like CoW and checksumming.
I considered regular RAID but I don’t need the bandwidth increase over single disks and I didn’t ever want the chance of losing a whole RAID pool. With my SnapRAID setup I can lose any 2 drives and not lose any data, and if I lose 3 drives, I only lose the data on any lost data drives, not all the data. Also it’s easy to add a single drive at a time as I need more space. That was my thought process when choosing it anyway and it’s worked for my use case (I don’t need much IOPS or bandwidth, just lots of cheap fairly resilient and easy to expand storage).
BTRFS raid is usage-aware, so a rebuild will not need to do a bit-for-bit copy of the entire disk, but only the parts that are actually in use. Also, because btrfs has data checksumming, it can detect read errors even when the disk reports a successful read (however, it will not verify the checksum during regular operation, only during scrub).
More flexibility in drives. Btrfs's RAID1, isn't actually RAID1 where everything is written to all the drives, but closer to RAID10 it writes all data to copies on 2 drives. So you can have a 1+2+3 TB drive in an array and still get 3TB of usable storage, or even 1+1+1+1+4. And you can add/remove single drives easily.
You can also set different RAID levels for metadata versus data, because the raid knows the difference. At some point in the future you might be able to set it per-file too.
The documentation describes 'btrfs check' as being dangerous to run without consulting the mailing list first.
That sums up btrfs pretty well.
Fifteen years in, and the filesystem's design is still so half-baked that their "check" program can't reliably identify problems and fix them correctly. You have to have a developer look at the errors and then tell you wha to do. Fifteen years in.
Nobody cares about btrfs anymore because everyone knows someone who has been burned by it. Which is a shame, because it can do both metadata and data rebalancing and defragmentation, as well as do things like spread N copies of data across X drives (though this feature is almost entirely negated by metadata not having this capability. Again, fifteen years in, why is this still a thing?) and one can add/remove drives from a btrfs volume without consequence. But...it's not able to do stuff like have a volume made up of mirrored pairs, RAID5/6 are (still) unstable (fifteen years in, why is this still a thing?)
Do yourself a favor and just stick with ext4 for smaller/simple filesystem needs, XFS where you need the best possible speed or for anything big with lots of files (on md if necessary), or OpenZFS.
Now that the BSD and Linux folks have combined forces and are developing OpenZFS together it keeps getting better and better; btrfs's advantages over ZFS just aren't worth the headaches.
ZFS's major failing is that it offers no way to address inevitable filesystem data and free space fragmentation, and while you can remove devices from a ZFS pool, it incurs a permanent performance penalty, because they work around ZFS's architectural inflexibilities by adding a mapping table so it can find the moved chunks of data. That mapping table never goes away unless you erase and re-create the file. Which I suppose isn't the end of the world; technically, you could have a script that walked the filesystem re-creating files, but that brings its own problems.
That the fs can't address this stuff internally is particularly a bummer considering that ZFS is intended to be used in massive (petabyte to exobyte) filesystems where it would be completely impractical to "just" move data to a fresh ZFS filesystem and back again (the main suggestion for fragmentation.)
But...btrfs doesn't offer external (and mirrored!) transaction logging devices, SSD cache, or concepts like pairs of mirrored drives being used in stripes or contiguous chunks.
If ZFS ever manages to add maintenance to the list of things it excels at, there will be few arguments against it except for situations where its memory use isn't practical.
>ZFS's major failing is that it offers no way to address inevitable filesystem data and free space fragmentation, and while you can remove devices from a ZFS pool, it incurs a permanent performance penalty, because they work around ZFS's architectural inflexibilities by adding a mapping table so it can find the moved chunks of data.
I'm not an FS specialist but by chance a couple of days ago I found an interesting discussion about the reliability of SSD disks where there was a strong warning about the extreme wearing of commercial SSDs by ZFS (up to suggestion to never use ZFS unless you have heavy duty/raid version of SSD disks). So ZFS also has unfinished work and not only the mentioned by you software improvements-related.
BTW From my - nonspecialist - point of view, it is easier to resist the urge to use unreliable features of Btrfs than to replace a bunch of SSD drives. At least if you pay for them from your own pocket.
The standard technique is to reserve a big file on the old filesystem for the new filesystem metadata, and then walk all files on the old filesystem and use fiemap() to create new extents that point to the existing data - only writing to the space you reserved.
You only overwrite the superblock at the very end, and you can verify that the old and new filesystems have the same contents before you do.
I believe that is also the method [btrfs-convert](https://btrfs.readthedocs.io/en/latest/Convert.html) uses. A cool trick that tool uses is to keep the ext4 structures on disk (as a subvolume), which allows reverting to ext4 if the conversion didn't go as planned (as long as you don't do anything to mess with the ext4 extents, such as defragmenting or balancing the filesystem, and you can't revert after deleting the subvolume of course).
My conversion went fine, but there were so many misaligned sectors and constant strange checksum errors (on files written after the conversion). With the cherry on top being that if there’s more than X% of checksum errors, btrfs refuses to mount and you have to do multiple arcane incantations to get it to clear all its errors. Real fun if you need your laptop for a high priority problem to solve.
Lesson learned: despite whatever “hard” promises a conversion tool (and its creators) make, just backup, check the backup, then format and create your new filesystem.
I've never had the conversion corrupt a filesystem for me (plenty of segfaults halfway through, though). It's a neat trick for when you want to convert a filesystem that doesn't have much on it, but I wouldn't use it for anything critical. Better to format the drive and copy files back from a backup, and you probably want that anyway if you're planning on using filesystem features like snapshots.
Windows used to feature a similar tool to transition from FAT32 to NTFS. I'd have the same reservations about that tool, though. Apple also did something like this with an even weirder conversion step (source and target filesystem didn't have the same handling for case sensitivity!) and I've only read one or two articles about people losing data because of it. It can definitely be done safely, if given enough attention, but I don't think anyone cares enough to write a conversion tool with production grade quality.
I believe you are right. You can only convert back to the metadata from before. So any new or changed (different extents) files will be lost or corrupted.
So best to only mount ro when considering to rollback. Otherwise it's pretty risky.
No, it also covers the data. As long as you don't delete the rollback subvolume, all the original data should still be there, uncorrupted.
Even if you disable copy-on-write, as long as the rollback subvolume is there to lay claim to the old data, it's considered immutable and any modification will still have to copy it.
I understood it as "it doesn't touch/ignore the data". But I guess we mean the same thing.
You are right. All of the old files will be in areas btrfs should consider used. So it should correctly restore the state from before the migration.
Thanks!
This is a weird level of pedantry induced by holding many beers tonight, but I've always thought of "Hold my beer" as in "Holy shit the sonofabitch actually pulled it off, brilliant". I think it's perfectly fitting. Jumping a riding lawnmower over a car with a beer in hand but they actually did the math first. I love it.
It’s referring to a comment a drunk person would make before doing something extremely risky. They need someone to hold the beer so it isn’t spilled during what’s coming next.
{\off I think they used "hold my beer" correctly. It can be used for any weird idea, that a drunk person would actually try (usually with a stretch), regardless if they succeed or not. I don't think that "the SOAB actually pulled it off" is part of the usage.}
Apple did something like this with a billion live OS X/iOS deployments (HFS+ -> APFS). It can be done methodically at scale as other commenters point out, but obviously needs care).
You don’t need to look that far. Many of us here lived through the introduction of NTFS and did live migrations from FAT32 to NTFS in the days of Windows 2000 and Windows XP.
Yea thanks for recalling this. I totally forgot about that because I never trusted Windows upgrade, let alone filesystem conversion. Always backup and clean install. It's Microsoft software after all.
Craig Federighi on some podcast once said they conducted dry-runs of the process in previous iOS updates (presumably building the new APFS filesystem metadata in a file without promoting it to the superblock) and checking its integrity and submitting telemetry data to ensure success.
Apple doesn’t just deploy to the whole world in an instant though.
First it goes to the private beta users, then the public beta users, and then it slowly rolls out globally. Presumably they could slow down the roll out even more for a risky change to monitor it.
Sure, but still whoever wrote the patch had his ass on the line even shipping to a batch of beta users. Remember this is Apple not Google where the dude likely got promoted and left the team right after pressing click :)
"WinBtrfs is a Windows driver for the next-generation Linux filesystem Btrfs. A reimplementation from scratch, it contains no code from the Linux kernel, and should work on any version from Windows XP onwards. It is also included as part of the free operating system ReactOS."
Not sure what point you're making here. WinBtrfs is a driver for the same btrfs filesystem that Linux uses. It's most common use case is reading the Linux partitions in Windows on machines that dual-boot both operating systems
As someone who has witnessed Windows explode twice from in-place upgrades I would just buy a new disk or computer and start over. I get that this is different but the time that went into that data is worth way more than a new disk. It's just not worth the risk IMO. Maybe if you don't care about the data or have good backups and wish to help shake bugs out - go for it I guess.
The userland tools included with Windows are very lacking, but that's more of a distro problem than a filesystem problem. VSS works fine -- boringly, even -- and people take advantage of it all the time even if they don't know it.
Very cool, but nobody will hear about this until at least a week after they format their ntfs drives that they have been putting off formatting for 2 years
btrfs isn't that terrible for desktop use right now. I mean, I wouldn't personally use it, I lost data on it a couple times four plus years ago, but it's come a long way since then. (my preference is keep everything I care about on a fileserver like truenas running zfs with proper snapshotting, replication and backup and live dangerously on the desktop testing out bcachefs, but I recognize not everyone can live my life and some people just want a laptop with a reasonable filesystem resistant to bit rot.
I don't know, I just use FDE because I don't trust filesystem level encryption to protect against the many side channel attacks one can inject into a running system. LUKS is good enough for me.
Honestly the political situation will probably be a /good/ thing for long term stability, because I get a few months without any stupid arguments with upstream and finally get to write code in peace :)
It sucks for users though, because now you have to get my tree if you want the latest fixes, and there's some stuff that should be backported for forwards compatibility with the scalability improvements coming soon [1].
I have to say I do see Linus' point though. Mainline is for stuff that's production-ready and well tested.
Having said that it's a great initiative and I hope it becomes ready for prime time. I think btrfs is taking too long and is infused with too much big tech interest.
Don't worry about the users, we'll manage somehow, it's such a tiny burden compared to the actual development. I'm just really happy to see you're not discouraged by the petty political mess and keep pushing through. Thank you!
Hey Kent, I just wanted to thank you for all your hard work. For those of us that would like to use the code in your tree to get the latest fixes and help with testing, I was wondering how hard it would be to set up OBS to automatically build and package fresh kernels and userland tools for OpenSUSE/Debian/Arch/etc, nightly and/or for each tag. I think it would help adoption as well as comfort, knowing that your improvements will arrive as soon as they are available.
I personally stopped compiling your code in my personal repo when bcachefs was upstreamed. It often was a pain to rebase against the latest hardened code and I'm happier since it's upstream. I use your fs for 7-8 years now and I hope your latest changes to the disk format will actually improve mount performance (yes I'm one of the silent "victims" you were talking about). I hope nothing breaks...
Anyway thank you for your work and I wish you all the best on the lkml and your work.
Echoing the sibling comment Kent, bcachefs is a really wonderful and important project. The whole world wants your filesystem to become the de-facto standard Linux filesystem for the next decade. One more month of LKML drama is a small price for that (at LKML prices).
Been using it since 6.7 on my root partition. Around 6.9 there were issues that needed fsck. Now on 6.12 it is pretty stable already. And fast, it is easy to run thousands of Postgres tests on it. Not something zfs or btrfs really could do without tuning...
So if you're a cowboy, now it's a good time to test. If not, wait one more year.
I haven't lost any data yet. It did something stupid on my laptop that looked like it was about to repeat btrfs's treatment of my data a few months ago but 15 minutes of googling on my phone and I figured out the right commands to get it to fix whatever was broken and get to a bootable state. I'm a decade away from considering it for a file server holding data I actually care about but as my main desktop and my laptop file system (with dotfiles backed up to my git instance via yadm and everything I care about nfs mounted in from my fileservers) it's totally fine.
We're still six months or so from taking the experimental label off, yeah. Getting close, though: filesystem-is-offline bugs have slowed to a trickle, and it's starting to be performance issues that people are complaining about.
Hoping to get online fsck and erasure coding finished before taking off experimental, and I want to see us scaling to petabyte sized filesystems as well.
Oh wow... That's actually really fast progress all things considered. Well done! I really hope all the... umm... misunderstandings get worked out because you're doing great work.
It's all stuff that's been in the pipeline for a long time.
(And we'll see when online fsck and erasure coding actually land, I keep getting distracted by more immediate issues).
Really, the bigger news right now is probably all the self healing work that's been going on. We're able to repair all kinds of damage without an explicit fsck now, without any user intervention: some things online, other things will cause us to go emergency read only and be repaired on the next mount (e.g. toasted btree nodes).
One of the choices you've made that I really like is sharing the kernel and userspace filesystem code so directly in the form of libbcachefs. I get the impression this means the kernel can do practically everything userspace can, and vice versa. (I think the only exception is initialising devices by writing a superblock, although the kernel can take over the initialisation of the rest of the filesystem from that point onwards? And maybe turning passphrases into keys for encrypted-fs support which does an scrypt thing?)
As well as giving you really powerful userspace tools for manipulating filesystems, this also suggests that a stripped down busybox module for bcachefs could consist of superblock writing and pretty much nothing else? Maybe a few ioctls to trigger various operations. "Just leave it all to the kernel."
The btrfs volume on my Synology NAS is 6 years old now, with 24/7 operation (including hosting web sites with constant read/write database activity) and going through several volume resize operations. No issues.
There's nothing "artificial" in his requirements, data checksums and efficient snapshots are required for some workloads (for example, we use them for end-to-end testing on copies of the real production database that are created and thrown away in seconds), and building your own kernel modules is a stupid idea in many cases outside of two extremes of the home desktop or a well-funded behemoth like Facebook.
Data checksums in particular is 90% of the reason I want to use a newer filesystem than XFS or Ext4. This is useful for almost any usage.
Snapshots a bit less so, but even for my laptop this would be useful, mostly for backups (that is: create snapshot, backup that, and then delete the snapshot – this is how I did things on FreeBSD back in the day). A second use case would be safer system updates and easier rollbacks – not useful that often, but when you need it, it's pretty handy.
NILFS2 is upstream, stores checksums for all data, and has the most efficient snapshot system bar none. Although I don't think i'd push the recommendation, solely because there are fewer eyes on it.
Heh, NILFS2. The first flash-friendly filesystem in Linux. When Flash was still expensive, I used it on an SD card to keep my most important source trees to speed up builds and things like svn (yes yes!) diff. I thought it had languished due to better resourced and more sophisticated efforts like F2FS.
I may be wrong but I don't think it's just "excessive freeness", the CDDL also has restrictions the GPL does not have (stuff about patents), it's mutual incompatibility.
Apache v2 and GPLv3 were made explicitly compatible while providing different kinds of freedom.
The CDDL is more permissive, it's a weak copyleft license while the GPL is strong copyleft, and that makes the two incompatible. Calling it "excessive freeness" is inflammatory, but they're broadly correct.
> The CDDL is more permissive, it's a weak copyleft license while the GPL is strong copyleft, and that makes the two incompatible. Calling it "excessive freeness" is inflammatory, but they're broadly correct.
It's not really. Many aspects of the license are free-er, but that's not what causes the incompatibility. The GPL does not have any kind of clause saying code that is distributed under too permissive of a license may not be incorporated into a derived/combined work. It's not that it's weak copyleft, it is that it contains particular restrictions that makes it incompatible with GPL's restrictions.
BSD licenses do not have that incompatible restriction (= are freer than CDDL, in that aspect) and can be compatible with the GPL.
The basic problem is that both licenses have a "you must redistribute under this license, and you can't impose additional restrictions"-type clause. There are some other differences between the licences, but that's what the incompatibility is about.
Some have argued that this may not actually constitute an incompatibility, but not many are keen to "fuck around and find out" with Oracle's lawyers. So here we are.
I'm not sure if it's that easy, because all the contributors to OpenZFS also need to sign off an a license change, right?
That said, they did relicense DTrace to GPL a few years back. but I don't really know the details on that, or how hard it was, or if that's also possible for OpenZFS.
This is a complete misunderstanding. People take MIT code and put it in the kernel all the time. The issue is that the CDDL is not GPL-compatible because it has restrictions which the GPL doesn't have.
I think the whole license stuff is controlling software too much. All that legal blah has nothing to do with software and is only good for feeding overpriced lawyers.
When I publish code I don't pick any license. It's just free for anyone to use for whatever. I don't like the GPL for this, it's way too complicated. I don't want to deal with that.
It's a shame some legal BS like that is holding back the deployment of a great filesystem like ZFS.
I just don't care about jurisdictions or legal frameworks. I just care about technology.
It's also really only a big thing in the US. Here in Europe things are usually settled amicably and the courts frown on people taking up their time with bullshit. They only really get involved if all else fails and only really for businesses. Nobody really cares about software licensing except the top 200 companies.
I'm admittedly a bit biased though. Unfortunately I sometimes have to deal with our internal legal dept. in the US (I'm based in Europe) and I find them such nasty people to deal with. Always really pushy. This made me hate them and their trade.
It wouldn't have been a candidate for "the standard UNIX filesystem" even if it was in macOS because SUN made an intentionally GPL-incompatible license for it.
Yeah the situation is unfortunate. There's a decent chance I'd be using ZFS if not for the licensing issues, but as a practical matter I'm getting too old to be futzing with kernel modules on my daily driver.
DKMS solved these "licensing issues." Dell is mum on official motivation- but it provides a licensing demarcation point, and a way for kernels to update without breaking modules- so it's easier for companies to develop for Linux.
_Windows Drivers work the same way and nobody huffs and puffs about that_
I'd love to have an intelligent discussion on how one person's opinion on licensing issues stacks up against the legal teams of half the fortune 50's. Licensing doesn't work on "well, I didn't mean it THAT way."
I admit I'm not fully up to date on whether it's actually "license issues" or something else. I'm not a lawyer. As a layman here's what I know. I go to the Arch wiki (https://wiki.archlinux.org/title/ZFS) and I see this warning under the DKMS section (as you advised):
> Warning: Occasionally, the dkms package might not compile against the newest kernel packages in Arch. Using the linux-lts kernel may provide better compatibility with out-of-tree kernel modules, otherwise zfs-dkms-staging-gitAUR backports compatibility patches and fixes for the latest kernel package in Arch on top of the stable zfs branch
So... my system might fail to boot after updates. If I use linux-lts, it might break less often. Or I can use zfs-dkms-staging-git, and my system might break even less often... or more often, because it looks like that's installing kernel modules directly from the master branch of some repo.
As a practical matter I could care less if my system fails to boot because of "license issues" or some other reason, I just want the lawyers to sort their shit out so I don't have to risk my system becoming unbootable at some random inopportune time. Until then, I've never hit a btrfs bug, so I'm going to keep on using it for every new build.
I've been bitten by kernel module incompatibility making my data unavailable enough times that I no longer consider ZFS to be viable under Linux. Using an LTS kernel only delays the issue until the next release is LTS. I really hope that bcachefs goes stable soon.
I used ZFS with DKMS on CentOS back in the day, and I found it a pain. It took a long time to compile, and I had some issues with upgrades as well (it's been a few years, so I have forgotten what the exact issues were).
When it comes to filesystems, I very much appreciate the "it just works" experience – not having a working filesystem is not having a working system and it a pain to solve.
Again, all of this has been a while. Maybe it's better now and I'm not opposed to trying, but I consider "having to use DKMS" to be a downside of ZFS.
You don't have to be to see bug reports, bug fix patches, or test it yourself, of course. "Filesystem guys" also tend to wear rose colored glasses when it comes to their filesystem, at times (ext2, xfs, btrfs, etc.)
Linus absolutely does jump in on VFS level bugs when necessary (and major respect for that; he doesn't rest on laurels and he's always got his priorities in the right place) - but there's only so much he can keep track of, and the complexity of a modern filesystem tends to dwarf other subsystems.
The people at the top explicitly don't and can't keep track of everything, there's a lot of stuff (e.g. testing) that they leave to other people - and I do fault them for that a bit; we badly need to get a bit more organized on test infrastructure.
And I wouldn't say that filesystem people in general wear rose colored glasses; I would categorize Dave Chinner and Ted T'so more in the hard nosed realistic category, and myself as well. I'd say it's just the btrfs folks who've had that fault in the past, and I think Josef has had more than enough experience at this point to learn that lesson.
> Linus absolutely does jump in on VFS level bugs when necessary (and major respect for that; he doesn't rest on laurels and he's always got his priorities in the right place) - but there's only so much he can keep track of, and the complexity of a modern filesystem tends to dwarf other subsystems.
The point is he doesn't have to understand details of the code to see bug reports and code churn and bug fix commits and have a reasonable idea of whether it's stable enough for end users. I would trust him to make that call more than a "filesystem guy", in fact.
> The people at the top explicitly don't and can't keep track of everything, there's a lot of stuff (e.g. testing) that they leave to other people - and I do fault them for that a bit; we badly need to get a bit more organized on test infrastructure.
Absolutely not on the top nodes. Testing has to be distributed and pushed down to end nodes where development happens or even below otherwise it does not scale.
> And I wouldn't say that filesystem people in general wear rose colored glasses;
Perhaps that's what you see through your rose colored glasses? (sorry, just a cheeky dig).
> I would categorize Dave Chinner and Ted T'so more in the hard nosed realistic category, and myself as well. I'd say it's just the btrfs folks who've had that fault in the past, and I think Josef has had more than enough experience at this point to learn that lesson.
Dave Chinner for example would insist the file-of-zeroes problem of XFS is really not a problem. Not because he was flat wrong or consciously being biased for XFS I'm sure, but because according to the filesystem design and the system call interface and the big customers they talked to at SGI, it was operating completely as per specification.
I'm not singling out filesystem developers or any one person or even software development specifically. All complex projects need advocates and input from outside stakeholders (users, other software, etc) for this exact reason, is that those deep in the guts of it usually don't understand all perspectives.
> The point is he doesn't have to understand details of the code to see bug reports and code churn and bug fix commits and have a reasonable idea of whether it's stable enough for end users. I would trust him to make that call more than a "filesystem guy", in fact.
No, that's not enough, and I would not call that kind of slagging good communication to users.
Seeing bugfixes go by doesn't tell you that much, and it definitely doesn't tell you which filesystem to recommend to users because other filesystems simply may not be fixing critical bugs.
Based on (a great many) user reports that I've seen, I actually have every reason to believe that your data is much safer on bcachefs than btrfs. I'm not shouting about that while I still have hardening to do, and my goal isn't just to beat btrfs, it's to beat ext4 and xfs as well: but given what I see I have to view Linus's communications as irresponsible.
> Absolutely not on the top nodes. Testing has to be distributed and pushed down to end nodes where development happens or even below otherwise it does not scale.
No, our testing situation is crap, and we need leadership that says more than "not my problem".
> Dave Chinner for example would insist the file-of-zeroes problem of XFS is really not a problem. Not because he was flat wrong or consciously being biased for XFS I'm sure, but because according to the filesystem design and the system call interface and the big customers they talked to at SGI, it was operating completely as per specification.
Well, he had a point, and you don't want to be artificially injecting fsyncs because for applications that don't need them that gets really expensive. Fsync is really expensive, and it impacts the whole system.
Now, it turned out there is a clever and more practical solution to this (which I stole from ext4), but you simply cannot expect any one person to know the perfect solution to every problem.
By way of example, I was in an argument with Linus a month or so ago where he was talking about filesystems that "don't need fsck" (which is blatently impossible), and making "2GB should be enough for anyone" arguments. No one is right all the time, no one has all the answers - but if you go into a conversation assuming the domain experts aren't actually the experts, that's not a recipe for a productive conversation.
> No, that's not enough, and I would not call that kind of slagging good communication to users.
It is enough. Users need to be told when something is not stable or good enough.
> Seeing bugfixes go by doesn't tell you that much, and it definitely doesn't tell you which filesystem to recommend to users because other filesystems simply may not be fixing critical bugs.
Cherry picking what I wrote. Bugfixes, code churn, and bug reports from users. It certainly tells someone like Linus a great deal without ever reading a single line of code.
> Based on (a great many) user reports that I've seen, I actually have every reason to believe that your data is much safer on bcachefs than btrfs. I'm not shouting about that while I still have hardening to do, and my goal isn't just to beat btrfs, it's to beat ext4 and xfs as well: but given what I see I have to view Linus's communications as irresponsible.
Being risk adverse with my data, I think Linus's comment is a helpful and responsible one to balance other opinions.
> No, our testing situation is crap, and we need leadership that says more than "not my problem".
No. Testing is crap because developers and employers don't put enough time into testing. They know what has to be done, leadership has told them what has to be done, common sense says what has to be done. They refuse to do it.
When code gets to a pull request for Linus it should have had enough testing (including integration testing via linux-next) that it is ready to be taken up by early user testers via Linus' tree. Distros and ISVs and IHVs and so on need to be testing there if not linux-next.
> Well, he had a point, and you don't want to be artificially injecting fsyncs because for applications that don't need them that gets really expensive. Fsync is really expensive, and it impacts the whole system.
No it was never about fsync, it was about data writes that extend a file hitting persistent storage before inode length metadata write does. By careful reading of posix it may be allowed, as a quality of implementation for actual users (aside from administrator-intensive high end file servers and databases etc from SGI), it is the wrong thing to do. ext3 for example solved it with "ordered" journal mode (not fsync).
You can accept it is poor quality but decide you will do it anyway, but you can't just say it's not a problem because you language-lawyered POSIX and found out its okay, when you have application developers and users complaining about it.
> By way of example, I was in an argument with Linus a month or so ago where he was talking about filesystems that "don't need fsck" (which is blatently impossible), and making "2GB should be enough for anyone" arguments. No one is right all the time, no one has all the answers - but if you go into a conversation assuming the domain experts aren't actually the experts, that's not a recipe for a productive conversation.
I didn't see that so I can't really comment. It does not seem like it provides a counter example to what I wrote. I did not say Linus is never wrong. I have got into many flame wars with him so I would be the last to say he is always right. Domain experts are frequently wrong about their field of expertise too, especially in places where it interacts with things outside their field of expertise.
> I didn't see that so I can't really comment. It does not seem like it provides a counter example to what I wrote. I did not say Linus is never wrong. I have got into many flame wars with him so I would be the last to say he is always right. Domain experts are frequently wrong about their field of expertise too, especially in places where it interacts with things outside their field of expertise.
You came in with an argument to authority, and now you're saying you disagree with that authority yourself, but you trust that authority more than domain experts?
I don't think you've fully thought this through...
Everyone believes what they read in the news, until they see it reporting on something they know about - and then they forget about it a week later and go back to trusting the news.
> You came in with an argument to authority, and now you're saying you disagree with that authority yourself, but you trust that authority more than domain experts?
I wrote what I wrote. I didn't "come in" with the argument to authority though, that was you (or perhaps the OP you replied to first). Anyway, I gave examples where domain experts are myopic or don't actually have the expertise in what other stakeholders (e.g., users) might require.
honestly I think btrfs isn't bloated enough for today's VM-enabled world. ext4 and xfs and hell, exfat haven't gone anywhere, and if those fulfill your needs, just use those. but if you need more advanced features that btrfs or zfs bring, those added features are quite welcome. imo, btrfs could use the benefits of being a cluster filesystem on top of everything it already does because having a VM be able to access a disk that is currently mounted by the host or another VM would useful. imagine if the disk exported to the VM could be mounted by another VM, either locally or remote simultaneously. arguably ceph fills this need, but having a btrfs-native solution for that would be useful.
CoW won't necessarily make the VM image bloated. In fact, as I've foolishly found out, BTRFS can be quite useful for deduplicating very similar VMs at the block level, at the cost of needing to re-allocate new disk space on writes. In my VM archive, six 50 GiB virtual machines took up 52 GiB rather than 300 GiB and that was quite impressive.
Many downsides to CoW are also present with many common alternatives (i.e. thin LVM2 snapshots). Best to leave all of that off if you're using spinning rust or native compression features, though.
ZFS performs much better than btrfs with the many small writes that VMs produce. Why exactly is a great question. Maybe it has to do with the optimizations around the ZIL, the temporary area where sync writes are accumulated before they are written to the long-term spot.
Checksum self healing on ZFS and BTRFS saved my data from janky custom NAS setups more times that I can count. Compression is also nice but the thing I like most is the possibility of creating many partition-like sub volumes without needing to allocate or manage space.
"Endless errors"? Are you talking about disk errors? Or are you referring to that one time long ago that it had a bug wrt a specific uncommon RAID setup?
APFS and ZFS aren't very interesting to me honestly, because neither are, or can be, in the Linux kernel. I also don't understand why APFS is in the same conversation as ZFS and BTRFS.
The only reason I can think of is so that they can use the same FS in both windows and linux -but with ntfs, they already can.
Mind you, with openzfs (https://openzfsonwindows.org/) you get windows (flakey), freebsd, netbsd and linux but -as I said; I'm not sure zfs is super reliable on windows at this point.
Mind you, I just stick with ntfs -linux can see it, windows can see it and if there's extra features btrfs provides they're not ones I am missing.
I’m a die-hard ZFS fan and heavy user since the Solaris days (and counting) but I believe the WinBtrfs project is in better (more useable) shape than the OpenZFS for Windows project.
With ntfs you have to create a separate partition though. With btrfs you could create a subvolume and just have one big partition for both linux and windows.
> what ?!?! NTFS has no case sensitivity no compression.
As the sibling comment mentioned, NTFS does have a case-sensitive mode, for instance for the POSIX subsystem (which no longer exists, but it existed back when NTFS was new); I think it's also used for WSL1. And NTFS does have per-file compression, I've used it myself back in the early 2000s (as it was a good way to free a bit of space on the small disks from back then); there was even a setting you could enable on Windows Explorer which made compressed files in its listing blue-colored.
NTFS has per-folder case sensitivity flag. You could set it online at anytime prior to Windows 11, but as of 11 you can now only change it on an empty folder (probably due to latent bugs they didn’t want to fix).
NTFS had mediocre compression support from the very start that could be enabled on a volume or directory basis, but gained modern LZ-based compression (that could be extended to whatever algorithm you wanted) in Windows 10, but it’s unfortunately a per-file process that must be done post-write.
I activated it back in mid 2010 or so. I had the most amazing pikachuface when random things stopped working because it could no longer find that file it wanted to load with an all-lowercase-string even though the project builds it with CapitalCase. Sigh...
I found the link to Quibble, an open and extensible reverse engineering of the Windows kernel bootloader to be much more intriguing: https://github.com/maharmstone/quibble
Thinking of how I'd do this for ZFS... I think I'd do something like: add a layer that can read other filesystem types and synthesize ZFS block pointers, then ZFS could read other filesystems, and as it writes it could rewrite the whole thing slowly. If ZFS had block pointer rewrite (and I've explained here before why it does not and cannot have BP rewrite caoabilities, not being a proper CAS filesystem), one could just make it rewrite the whole thing to finish the conversion.
Is anyone here using BTRFS and can comment on its current-day stability? I used to read horror stories about it
I've been using it for a few years now on my main PC (has a couple SSDs and a large HDD) and my laptop, it was the default of openSUSE and just used that. Then i realized that snapshots are a feature i didn't knew i wanted :-P.
Never had a problem, though it is annoying that whatever BTRFS thinks is free space and what the rest of the OS thinks is free space do not always align. It has rarely been a problem in practice though.
I've used BTRFS exclusively for over a decade now on all my personal laptops, servers, and embedded devices. I've never had a single problem.
It's the flagship Linux filesystem: outside of database workloads, I don't understand why anybody uses anything else.
"Flagship"? I don't know a single person who uses it in production systems. It's the only filesystem I've lost data to. Ditto for friends.
Please go look up survivor bias. That's what all you btrfs fanboys don't seem to understand. It doesn't matter how well it has worked for 99.9% of you. Filesystems have to be the most reliable component in an operating system.
It's a flagship whose fsck requires you to contact developers to seek advice on how to use it because otherwise it might destroy your filesystem.
It's a flagship whose userspace tools, fifteen years in, are still seeing major changes.
It's a flagship whose design is so poor that fifteen years in the developers are making major changes to its structure and depreciating old features in ways that do not trigger an automatic upgrade or informative error to upgrade, but cause the filesystem to panic with error messages for which there is no documentation and little clue what the problem is.
No other filesystem has these issues.
Btrfs is in production all over the damn place, at big corporations and all kinds of different deployments. Synology has their own btrfs setup that they ship to customers with their NAS software for example.
I found it incredibly annoying the first time I ran out of disk space on btrfs, but many of these points are hyperbolic and honestly just silly. For example, btrfs doesn't really do offline fsck. fsck.btrfs has a zero percent chance of destroying your volume because it does nothing. As for the user space utilities changing... I'm not sure how that demonstrates the filesystem is not production ready.
Personally I usually use either XFS or btrfs as my root filesystem. While I've caught some snags with btrfs, I've never lost any data. I don't actually know anyone who has, I've merely just heard about it.
And it's not like other well-regarded filesystems have never ran into data loss situations: even OpenZFS recently (about a year ago) uncovered a data-eating bug that called its reliability into question.
I'm sure some people will angrily tell me that actually btrfs is shit and the worst thing to ever be created and honestly whatever. I am not passionate about filesystems. Wake me up when there's a better one and it's mainlined. Maybe it will eventually be bcachefs. (Edit: and just to be clear, I do realize bcachefs is mainline and Kent Overstreet considers it to be stable and safe. However, it's still young and it's upstream future has been called into question. For non-technical reasons, but still; it does make me less confident.)
Yes, but that doesn't do the job that a fsck implementation does. fsck is something you stuff into your initrd to do some quick checks/repairs prior to mounting, but btrfs intentionally doesn't need those.
If you need btrfs-check, you have probably hit either a catastrophic bug or hardware failure. This is not the same as fsck for some other filesystems. However, ZFS is designed the same way and also has no fsck utility.
So whatever point was intended to be made was not, in any case.
>I don't actually know anyone who has, I've merely just heard about it.
Well "yarg", a few comments up in this conversation, says he lost all his data to it with the last year.
I've seen enough comments like that that I don't see it as a trustworthy filesystem. I never see comments like that about ext4 or ZFS.
Contrary to popular belief, people on a forum you happen to participate in are still just strangers. In line with popular belief, anecdotal evidence is not a good basis to form an opinion.
Exactly how do you propose to form an opinion on filesystem reliability then? Do my own testing with thousands of computers over the course of 15 years?
You don't determine what CPUs are fast or reliable by reading forum comments and guessing, why would filesystems be any different?
That said, you make a good point. It's actually pretty hard to quantify how "stable" a filesystem is meaningfully. It's not like anyone is doing Jepsen-style analysis of filesystems right now, so the best thing we can go off of is testimony. And right now for btrfs, the two types of data-points are essentially, companies that have been using it in production successfully, and people on the internet saying it sucks. I'm not saying either of those is great, and I am not trying to tell anyone that btrfs is some subjective measure of good. I'm just here to tell people it's apparently stable enough to be used in production... because, well, it's being used in production.
Would I argue it is a particularly stable filesystem? No, in large part because it's huge. It's a filesystem with an integrated volume manager, snapshots, transparent compression and much more. Something vastly simpler with a lower surface area and more time in the oven is simply less likely to run into bugs.
Would I argue it is perfectly reasonable to use btrfs for your PC? Without question. A home use case with a simple volume setup is exceedingly unlikely to be challenging for btrfs. It has some rough edges, but I don't expect to be any more likely to lose data to btrfs bugs as I expect to lose data from hardware failures. The bottom line is, if you absolutely must not lose data, having proper redundancy and backups is probably a much bigger concern than btrfs bugs for most people.
>You don't determine what CPUs are fast or reliable by reading forum comments and guessing, why would filesystems be any different?
Your premise is entirely wrong. How else would I determine what CPUs are fast or reliable? Buy dozens of them and stress-test them all? No, I use online sites like cpu-monkey.com that compare different CPUs' features and performance according to various benchmarks, for the performance part at least. For reliability, what way can you possibly think of other than simply aggregating user ratings (i.e. anecdotes)? If you aren't running a datacenter or something, you have no practical alternative.
At least for spinning-rust HDDs, the helpful folks at Backblaze have made a treasure trove of long-term data available to us. But this isn't available for most other things.
> It's not like anyone is doing Jepsen-style analysis of filesystems right now, so the best thing we can go off of is testimony.
This is exactly my point. We have nothing better, for most of this stuff.
>companies that have been using it in production successfully, and people on the internet saying it sucks
Companies using something doesn't always mean it's any good, especially for individual/consumer use. Companies can afford teams of professionals to manage stuff, and they can also make their own custom versions of things (esp. true with OSS code). They're also using things in ways that aren't comparable to individuals. These companies may be using btrfs in a highly feature-restricted way that they've found, through testing, is safe and reliable for their use case.
> It's a filesystem with an integrated volume manager, snapshots, transparent compression and much more. Something vastly simpler with a lower surface area and more time in the oven is simply less likely to run into bugs.
This is all true, but ZFS has generally all the same features, yet I don't see remotely as many testimonials from people saying "ZFS ate my data!" as I have with btrfs over the years. Maybe btrfs has gotten better over time, but as the American car manufacturers found out, it takes very little time to ruin your reputation for reliability, and a very long time to repair that reputation.
> Your premise is entirely wrong. How else would I determine what CPUs are fast or reliable? Buy dozens of them and stress-test them all? No, I use online sites like cpu-monkey.com that compare different CPUs' features and performance according to various benchmarks, for the performance part at least. For reliability, what way can you possibly think of other than simply aggregating user ratings (i.e. anecdotes)? If you aren't running a datacenter or something, you have no practical alternative.
My point is just that anecdotes alone don't tell you much. I'm not suggesting that everyone needs to conduct studies on how reliable something is, but if nobody has done the groundwork then the best thing we can really say is we're not sure how stable it is because the best evidence is not very good and it conflicts.
> Companies using something doesn't always mean it's any good, especially for individual/consumer use. Companies can afford teams of professionals to manage stuff, and they can also make their own custom versions of things (esp. true with OSS code). They're also using things in ways that aren't comparable to individuals. These companies may be using btrfs in a highly feature-restricted way that they've found, through testing, is safe and reliable for their use case.
For Synology you can take a look at what they're shipping since they're shipping it to consumers. It does seem like they're not using many of the volume management features, instead using some proprietary volume management scheme on the block layer. However otherwise there's nothing particularly special that I can see, it's just btrfs. Other advanced features like transparent compression are available and exposed in the UI.
(edit: Small correction. While I'm still pretty sure Synology has custom volume management for RAID which works on the block level, as it turns out, they are actually using btrfs subvolumes as well.)
I think the Synology case is an especially interesting bit of evidence because it's gotta be one of the worst cases of shipping a filesystem, since you're shipping it to customer machines you don't control and can't easily inspect later. It's not the only case of shipping btrfs to the customer either, I believe ChromeOS does this and even uses subvolumes, though I didn't actually look for myself when I was using it so I'm not actually 100% sure on that one.
> This is all true, but ZFS has generally all the same features, yet I don't see remotely as many testimonials from people saying "ZFS ate my data!" as I have with btrfs over the years. Maybe btrfs has gotten better over time, but as the American car manufacturers found out, it takes very little time to ruin your reputation for reliability, and a very long time to repair that reputation.
In my opinion, ZFS and other Solaris technologies that came out around that time period set a very high bar for reliable, genuinely innovative system features. I think we're going to have to live with the fact that just having a production-ready filesystem dropped onto the world is not going to be the common case, especially in the open source world: the filesystem will need to go through its growing pains in the open.
Btrfs has earned a reputation as the perpetually-unfinished filesystem. Maybe it's tainted and it will simply never approach the degree of stability that ZFS has. Or, maybe it already has, and it will just take a while for people to acknowledge it. It's hard to be sure.
My favorite option would be if I just simply don't have to find out, because an option arrives that quickly proves itself to be much better. bcachefs is a prime contender since it not only seems to have better bones but it's also faster than btrfs in benchmarks anyways (which is not saying much because btrfs is actually quite slow.) But for me, I'm still waiting. And until then, ZFS is not in mainline Linux, and it never will be. So for now, I'm using btrfs and generally OK recommending it for users that want more advanced features than ext4 can offer, with the simple caveat that you should always keep sufficient backups of your important data at all times.
I only joined in on this discussion because I think that the btrfs hysteria train has gone off the rails. Btrfs is a flawed filesystem, but it continues to be vastly overstated every time it comes up. It's just, simply put, not that bad. It does generally work as expected.
>Synology has their own btrfs setup that they ship to customers with their NAS software for example.
Synology infamously/hilariously does not use btrfs as the underlying file system because even they don't trust btrfs's RAID subsystem. Synology uses LVM RAID that is presented to btrfs as a single drive. btrfs isn't managing any of the volumes/disks.
Their reason for not using btrfs as a multi-device volume manager is not specified, though it's reasonable to infer that it is because btrfs's own built-in volume management/RAID wasn't suitable. That's not really very surprising: back in ~2016 when Synology started using btrfs, these features were still somewhat nascent even though other parts of the filesystem were starting to become more mature. To this day, btrfs RAID is still pretty limited, and I wouldn't recommend it. (As far as I know, btrfs RAID5/6 is even still considered incomplete upstream.) On the other hand, btrfs subvolumes as a whole are relatively stable, and that and other features are used in Synology DSM and ChromeOS.
That said, there's really nothing particularly wrong with using btrfs with another block-level volume manager. I'm sure it seems silly since it's something btrfs ostensibly supports, but filesystem-level redundancy is still one of those things that I think I would generally be afraid to lean on too hard. More traditional RAID at the block level is simply going to be less susceptible to bugs, and it might even be a bit easier to manage. (I've used ZFS raidz before and ran into issues/confusion when trying to manage the zpool. I have nothing but respect for the developers of ZFS but I think the degree to which people portray ZFS as an impeccable specimen of filesystem perfection is a little bit unrealistic, it can be confusing, limited, and even, at least very occasionally, buggy too.)
>That's not really very surprising: back in ~2016 when Synology started using btrfs, these features were still somewhat nascent even though other parts of the filesystem were starting to become more mature.
btrfs was seven years old at that point and declared "stable" three years before that.
ZFS is an example of amazingly written code by awesome engineers. It's simple to manage, scales well, and easy to grok. btrfs sadly will go the wayside once bcachefs reaches maturity. I wouldn't trust btrfs for important data, and neither should you. If you experience data loss on a Synology box, the answer you'll get from them is "tough shit, hope you have backups, and here's a coupon for a new Synology unit."
> btrfs was seven years old at that point and declared "stable" three years before that.
The on-disk format was declared stable in 2013[1]. That just meant that barring an act of God, they were not going to break the on-disk format, e.g. a filesystem created at that point would continue to be mountable for the foreseeable future. It was not a declaration that the filesystem was itself now stable necessarily, but especially was not suggesting that all of the features were stable. (As far as I know, many features still carried warning labels.)
Furthermore, the "it's been X years!" thing referring to open source projects has to stop. This is the same non-sense that happens with every other thing that is developed in the open. Who cares? What matters isn't how long it took to get here. What matters is where it's at. I know there's going to be some attempt at rationalizing this bit, but it's wasted on me because I'm tired of hearing this.
> ZFS is an example of amazingly written code by awesome engineers. It's simple to manage, scales well, and easy to grok.
Agreed. But ZFS was written by developers at Sun Microsystems for their commercial UNIX. We should all be gracious to live in a world where Sun Microsystems existed. We should also accept that Sun Microsystems is not the standard any more than Bell Labs was the standard, they are extreme outliers. If we measure everything based on whether it's as good as what Sun Microsystems was doing in the 2000s, we're going to have a bad time.
As an example, DTrace is still better than LTTng is right now. I hope that sinks in for everyone.
However, OpenZFS is not backed by Sun Microsystems, because Sun Microsystems is dead. Thankfully and graciously at that, it has been maintained for many years by volunteers, including at least one person who worked on ZFS at Sun. (Probably more, but I only know of one.)
Now if OpenZFS eats your data, there is no big entity to go to anymore than there is for btrfs. As far as I know, there's no big entity funding development, improvements, or maintenance. That's fine, that's how many filesystems are. But still, that's not what propelled ZFS to where it stood when Sun was murdered.
> btrfs sadly will go the wayside once bcachefs reaches maturity.
I doubt it will disappear quickly: it will probably continue to see ongoing development. Open Source is generally pretty good at keeping things alive in a zombie state. That's pretty important since it is typically non-trivial to do online conversion of filesystems. (Of course, we're in a thread about a tool that does seamless offline conversion of filesystems, which is pretty awesome and impressive in and of itself.)
But for what it's worth, I am fine with bcachefs supplanting btrfs eventually. It seems like it had a better start, it benchmarks faster, and it's maturing nicely. Is it safer today? Depends on who you ask. But it's hard to deny that it doesn't seem like the point at which bcachefs will be considered stable by most will take more than a year or two tops, assuming kernel drama doesn't hold back upstream.
Should users trust bcachefs with their data? I think you probably can right now with decent safety, if you're using mainline kernels, but bcachefs is still pretty new. Not aware of anyone using it in production yet. It really could use a bit more time before recommending people jump over to it.
> I wouldn't trust btrfs for important data, and neither should you.
I stand by my statement: you should always ensure you have sufficient backups for important data, but most users should absolutely fear hardware failures more than btrfs bugs. Hardware failures are an if, not a when. Hardware will always fail eventually. Data-eating btrfs bugs have certainly existed, but it's not like they just appear left and right. When such a bug appears, it is often newsworthy, and usually has to do with some unforeseen case that you are not so likely to run into by accident.
Rather than lose data, btrfs is instead more likely to just piss you off by being weird. There are known quirks that probably won't lose you any data, but that are horribly annoying. It is still possible, to my knowledge, to get stuck in a state where the filesystem is too full to delete files and the only way out is in recovery. This is pretty stupid.
It's also not particularly fast, so if someone isn't looking for a feature-rich CoW filesystem with checksums, I strongly recommend just going with XFS instead. But if you run Linux and you do want that, btrfs is the only mainline game in town. ZFS is out-of-tree and holds back your kernel version, not to mention you can never really ship products using it (with Linux) because of silly licensing issues.
> If you experience data loss on a Synology box, the answer you'll get from them is "tough shit, hope you have backups, and here's a coupon for a new Synology unit."
That suggests that their brand image somewhat depends on the rarity of btrfs bugs in their implementation, but Synology has a somewhat good reputation actually. If anything really hurts their reputation, it's mainly the usual stuff (enshittification.) The fact that DSM defaults to using btrfs is one of the more boring things at this point.
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
I agree with what you say, and I would never trust btrfs with my data because of issues that I've seen in the past, My last job I installed my Ubuntu desktop with btrfs and within three days it had been corrupted so badly because of a power outage that I had to completely wipe and reinstall the system.
That said:
> but cause the filesystem to panic with error messages for which there is no documentation and little clue what the problem is.
The one and only time I experimented with ZFS as a root filesystem I got bit in the ass because the zfs tools one day added a new feature flag to the filesystem that the boot loader (grub) didn't understand and therefore it refused to read the filesystem, even read-only. Real kick in the teeth, that one, especially since the feature flag was completely irrelevant to just reading enough of the filesystem for the boot loader to load the kernel and there was no way to override it without patching grub's zfs module on another system then porting it over.
Aside from that, ZFS has been fantastic, and now that we're all using UEFI and our kernels and initrds are on FAT32 filesystems I'm much less worried, but I'm still a bit gunshy. Not as much as with BTRFS, mind you, but somewhat.
I list data on btrfs on a raspberry pi with a slightly dodgy PSU.
We need more testing of filesystems and pulling the power.
I switched to a NAS with battery backup and it's been better.
So that was inconclusive, before that the last time I lost data like that was to Reiserfs in the early 2000s.
> Please go look up survivor bias. That's what all you btrfs fanboys don't seem to understand. It doesn't matter how well it has worked for 99.9% of you. Filesystems have to be the most reliable component in an operating system.
Not sure. It's useful if they are reliable, but they only need to be roughly as reliable as your storage media. If your storage media breaks down once in a thousand years (or once a year for a thousand disks), then it doesn't matter much if your filesystem breaks down once in a million years or once in a trillion years.
That being said, I had some trouble with BTRFS.
Meta (Facebook) has millions of instances of Btrfs in production. More than any other filesystem by far. A few years ago when Fedora desktop variants started using Btrfs by default, Meta’s experience showed it was no less reliable than ext4 or XFS.
I had it go boom on Tumbleweed (when the drive filled up) less than a year ago.
I tried accessing and fixing the fubar partition from a parallel install, but to no avail.
I've used it for my personal machines and backups (via btrbk) for years without any issues
Btrfs has been slowly eating my data; randomly small files or sectors of larger files will be replaced with all nulls.
Haven't had any issues with it after using it for years on my work and home PCs. I use transparent compression, snapshots, and send/receive, and they all work great.
The main complaint was always about parity RAID, which I still wouldn't recommend running from what I've heard. But RAID 1-10 have been stable.
I tried btrfs for the first time a few weeks ago. I had been looking for mature r/w filesystems that support realtime compression, and btrfs seemed like a good choice.
My use case is this: I normally make full disk images of my systems and store them on a (100TB) NAS. As the number of systems grows, the space available for multiple backup generations shrinks. So compression of disk images is good, until you want to recover something from a compressed disk image without doing a full restore. If I put an uncompressed disk image in a compressed (zstd) btrfs filesystem, I can mount volumes and access specific files without waiting days to uncompress.
So I gave it a try and did a backup of an 8TB SSD image to a btrfs filesystem, and it consumed less than 4TB, which was great. I was able to mount partitions and access individual files within the compressed image.
The next thing I tried, was refreshing the backup of a specific partition within the disk image. That did not go well.
Here's what I did to make the initial backup:
(This was done on an up-to-date Ubuntu 24.04 desktop.)
cd /btrfs.backups
truncate -s 7696581394432 8TB-Thinkpad.btrfs
mkfs.btrfs -L btrfs.backup 8TB-Thinkpad.btrfs
mount -o compress=zstd 8TB-Thinkpad.btrfs /mnt/1
pv < /dev/nvme0n1 > /mnt/1/8TB-Thinkpad.nvme0n1
All good so far. The backup took about three hours, but would probably go twice as fast if I had used my TB4 dock instead of a regular USB-C port for the backup media.
Things went bad when I tried to update one of the backed up partitions:
kpartx -a /mnt/1/8TB-Thinkpad.nvme0n1
pv < /dev/nvme0n1p5 > /dev/mapper/loop20p5
This sort of thing works just fine on a normal, uncompressed ext4 filesystem. I did not really expect this to work here, and my expectations were met.
The result was a bunch of kernel errors, the backup device being remounted r/o, and a corrupt btrfs filesystem with a corrupt backup file.
So for this use case, btrfs is a big improvement for reading compressed disk images on the fly, but it is not suitable for re-writing sections of disk images.
I’ve been using it for a few years on my NAS for the all the data drives (with Snapraid for parity and data validation), and as the boot drive on a few SBCs that run various services. Also use it as the boot drive for my Linux desktop PC. So far no problems at all and I make heavy use of snapshots, I have also had various things like power outages that have shut down the various machines multiple times.
I’ve never used BTRFS raid so can’t speak to that, but in my personal experience I’ve found BTRFS and the snapshot system to be reliable.
Seems like most (all?) stories I hear about corruption and other problems are all from years ago when it was less stable (years before I started using it). Or maybe I just got lucky ¯\_(ツ)_/¯
Why use BTRfS raid rather than good old MDadm raid?
BTRFS Raid10 can seemlessly cpmbole multiple raw disks without trying to match capacity.
Next time I just replace my 4T disk in my 5 disk Raid10 with 20T. Currently I have 4+8+8+16+20 disks.
MD raid does not do checksumming. Although I believe XFS is about to add support for it in the future.
I have had my BTRFS raid filesystem survive a lot during the past 14 years: - burned power: no loss of data
- failed ram that started corrupting memory: after a little hack 1) BTRFS scrub saved most of data even though the situation got so bad kernel would crash in 10 minutes
- buggy pcie SATA extension card: I tried to add 6th disk, but noticed after fee million write errors to one disk that it just randomly stopped passing data through: no data corruption, although btrfs write error counters are in 10s of millions now
- 4 disk failures: I have only one original disk still running and it is showing a lot of bad sectors
1) one of the corrupted sectors was in the btrfs tree that contains the checksums for rest of the filesystem and both copies were broken. It prevented access to some 200 files. I patched the kernel to log the exact sector in addition to the expected and actual value. Turns our it was just a single bit flip. So I used hex editor to flip it back to correct value and got the files back
I don’t use BTRFS raid, I don’t actually use any RAID. I use SnapRAID which is really more of a parity system than real RAID.
I have a bunch of data disks that are formatted BTRFS, then 2 parity disks formatted using ext4 since they don’t require any BTRFS features. Then I use snapraid-btrfs which is a wrapper around SnapRAID to automatically generate BTRFS snapshots on the data disks when doing a SnapRAID sync.
Since the parity is file based, it’s best to use it with snapshots, so that’s the solution I went with. I’m sure you could also use LVM snapshots with ext4 or ZFS snapshots, but BTRFS with SnapRAID is well supported and I like how BTRFS snapshots/subvolumes works so I went with that. Also BTRFS has some nice features over ext4 like CoW and checksumming.
I considered regular RAID but I don’t need the bandwidth increase over single disks and I didn’t ever want the chance of losing a whole RAID pool. With my SnapRAID setup I can lose any 2 drives and not lose any data, and if I lose 3 drives, I only lose the data on any lost data drives, not all the data. Also it’s easy to add a single drive at a time as I need more space. That was my thought process when choosing it anyway and it’s worked for my use case (I don’t need much IOPS or bandwidth, just lots of cheap fairly resilient and easy to expand storage).
BTRFS raid is usage-aware, so a rebuild will not need to do a bit-for-bit copy of the entire disk, but only the parts that are actually in use. Also, because btrfs has data checksumming, it can detect read errors even when the disk reports a successful read (however, it will not verify the checksum during regular operation, only during scrub).
More flexibility in drives. Btrfs's RAID1, isn't actually RAID1 where everything is written to all the drives, but closer to RAID10 it writes all data to copies on 2 drives. So you can have a 1+2+3 TB drive in an array and still get 3TB of usable storage, or even 1+1+1+1+4. And you can add/remove single drives easily.
You can also set different RAID levels for metadata versus data, because the raid knows the difference. At some point in the future you might be able to set it per-file too.
The documentation describes 'btrfs check' as being dangerous to run without consulting the mailing list first.
That sums up btrfs pretty well.
Fifteen years in, and the filesystem's design is still so half-baked that their "check" program can't reliably identify problems and fix them correctly. You have to have a developer look at the errors and then tell you wha to do. Fifteen years in.
Nobody cares about btrfs anymore because everyone knows someone who has been burned by it. Which is a shame, because it can do both metadata and data rebalancing and defragmentation, as well as do things like spread N copies of data across X drives (though this feature is almost entirely negated by metadata not having this capability. Again, fifteen years in, why is this still a thing?) and one can add/remove drives from a btrfs volume without consequence. But...it's not able to do stuff like have a volume made up of mirrored pairs, RAID5/6 are (still) unstable (fifteen years in, why is this still a thing?)
Do yourself a favor and just stick with ext4 for smaller/simple filesystem needs, XFS where you need the best possible speed or for anything big with lots of files (on md if necessary), or OpenZFS.
Now that the BSD and Linux folks have combined forces and are developing OpenZFS together it keeps getting better and better; btrfs's advantages over ZFS just aren't worth the headaches.
ZFS's major failing is that it offers no way to address inevitable filesystem data and free space fragmentation, and while you can remove devices from a ZFS pool, it incurs a permanent performance penalty, because they work around ZFS's architectural inflexibilities by adding a mapping table so it can find the moved chunks of data. That mapping table never goes away unless you erase and re-create the file. Which I suppose isn't the end of the world; technically, you could have a script that walked the filesystem re-creating files, but that brings its own problems.
That the fs can't address this stuff internally is particularly a bummer considering that ZFS is intended to be used in massive (petabyte to exobyte) filesystems where it would be completely impractical to "just" move data to a fresh ZFS filesystem and back again (the main suggestion for fragmentation.)
But...btrfs doesn't offer external (and mirrored!) transaction logging devices, SSD cache, or concepts like pairs of mirrored drives being used in stripes or contiguous chunks.
If ZFS ever manages to add maintenance to the list of things it excels at, there will be few arguments against it except for situations where its memory use isn't practical.
>ZFS's major failing is that it offers no way to address inevitable filesystem data and free space fragmentation, and while you can remove devices from a ZFS pool, it incurs a permanent performance penalty, because they work around ZFS's architectural inflexibilities by adding a mapping table so it can find the moved chunks of data.
I'm not an FS specialist but by chance a couple of days ago I found an interesting discussion about the reliability of SSD disks where there was a strong warning about the extreme wearing of commercial SSDs by ZFS (up to suggestion to never use ZFS unless you have heavy duty/raid version of SSD disks). So ZFS also has unfinished work and not only the mentioned by you software improvements-related.
BTW From my - nonspecialist - point of view, it is easier to resist the urge to use unreliable features of Btrfs than to replace a bunch of SSD drives. At least if you pay for them from your own pocket.
Thank you for the thorough explanation
I would have needed that like 2 months ago, when I had to format a hard drive with more than 10TB of data into from NTFS... ^^
Nic project!
I would be very surprised if it supported files that are under LZX compression.
(Not to be confused with Windows 2000-era file compression, this is something you need to activate with "compact.exe /C /EXE:LZX (filename)")
it seems to contain code that handles LZX, among other formats
https://github.com/search?q=repo%3Amaharmstone%2Fntfs2btrfs%...
I tried this one before, resulted in a read-only disk. Hope it improves since then.
That's 50% better than losing all your data!
The degree of hold-my-beer here is off the charts.
It's not quite as dangerous as you'd think.
The standard technique is to reserve a big file on the old filesystem for the new filesystem metadata, and then walk all files on the old filesystem and use fiemap() to create new extents that point to the existing data - only writing to the space you reserved.
You only overwrite the superblock at the very end, and you can verify that the old and new filesystems have the same contents before you do.
I believe that is also the method [btrfs-convert](https://btrfs.readthedocs.io/en/latest/Convert.html) uses. A cool trick that tool uses is to keep the ext4 structures on disk (as a subvolume), which allows reverting to ext4 if the conversion didn't go as planned (as long as you don't do anything to mess with the ext4 extents, such as defragmenting or balancing the filesystem, and you can't revert after deleting the subvolume of course).
I tried that on a system in 2020 and it just corrupted my new FS. Cool idea though.
My conversion went fine, but there were so many misaligned sectors and constant strange checksum errors (on files written after the conversion). With the cherry on top being that if there’s more than X% of checksum errors, btrfs refuses to mount and you have to do multiple arcane incantations to get it to clear all its errors. Real fun if you need your laptop for a high priority problem to solve.
Lesson learned: despite whatever “hard” promises a conversion tool (and its creators) make, just backup, check the backup, then format and create your new filesystem.
I've never had the conversion corrupt a filesystem for me (plenty of segfaults halfway through, though). It's a neat trick for when you want to convert a filesystem that doesn't have much on it, but I wouldn't use it for anything critical. Better to format the drive and copy files back from a backup, and you probably want that anyway if you're planning on using filesystem features like snapshots.
Windows used to feature a similar tool to transition from FAT32 to NTFS. I'd have the same reservations about that tool, though. Apple also did something like this with an even weirder conversion step (source and target filesystem didn't have the same handling for case sensitivity!) and I've only read one or two articles about people losing data because of it. It can definitely be done safely, if given enough attention, but I don't think anyone cares enough to write a conversion tool with production grade quality.
You don't understand. You did get a btrfs that worked normally. /s
I believe you are right. You can only convert back to the metadata from before. So any new or changed (different extents) files will be lost or corrupted.
So best to only mount ro when considering to rollback. Otherwise it's pretty risky.
No, it also covers the data. As long as you don't delete the rollback subvolume, all the original data should still be there, uncorrupted.
Even if you disable copy-on-write, as long as the rollback subvolume is there to lay claim to the old data, it's considered immutable and any modification will still have to copy it.
I understood it as "it doesn't touch/ignore the data". But I guess we mean the same thing.
You are right. All of the old files will be in areas btrfs should consider used. So it should correctly restore the state from before the migration. Thanks!
This is a weird level of pedantry induced by holding many beers tonight, but I've always thought of "Hold my beer" as in "Holy shit the sonofabitch actually pulled it off, brilliant". I think it's perfectly fitting. Jumping a riding lawnmower over a car with a beer in hand but they actually did the math first. I love it.
It’s referring to a comment a drunk person would make before doing something extremely risky. They need someone to hold the beer so it isn’t spilled during what’s coming next.
Right, but in those situations they succeed, kind of like "the cameraman never dies".
{\off I think they used "hold my beer" correctly. It can be used for any weird idea, that a drunk person would actually try (usually with a stretch), regardless if they succeed or not. I don't think that "the SOAB actually pulled it off" is part of the usage.}
A couple of years ago it was more like juggling chainsaws: https://github.com/maharmstone/ntfs2btrfs/issues/9
I tracked down a couple of nasty bugs at that time playing around with it, hopefully it's more stable now.
Apple did something like this with a billion live OS X/iOS deployments (HFS+ -> APFS). It can be done methodically at scale as other commenters point out, but obviously needs care).
You don’t need to look that far. Many of us here lived through the introduction of NTFS and did live migrations from FAT32 to NTFS in the days of Windows 2000 and Windows XP.
I still remember the syntax: convert C: /fs:ntfs
Yea thanks for recalling this. I totally forgot about that because I never trusted Windows upgrade, let alone filesystem conversion. Always backup and clean install. It's Microsoft software after all.
IIRC there was a similar conversion tool in Windows 98 for FAT16 -> FAT32.
When this first showed up I took 3 backups: two on networked drives and one on an external drive which was then disconnected from the system.
The second time I just went “meh” and let it run.
Craig Federighi on some podcast once said they conducted dry-runs of the process in previous iOS updates (presumably building the new APFS filesystem metadata in a file without promoting it to the superblock) and checking its integrity and submitting telemetry data to ensure success.
You can do all the testing in the world, but clicking deploy on that update must have been nerve wracking.
Apple doesn’t just deploy to the whole world in an instant though.
First it goes to the private beta users, then the public beta users, and then it slowly rolls out globally. Presumably they could slow down the roll out even more for a risky change to monitor it.
Sure, but still whoever wrote the patch had his ass on the line even shipping to a batch of beta users. Remember this is Apple not Google where the dude likely got promoted and left the team right after pressing click :)
[flagged]
Note this is not the Linux btrfs:
"WinBtrfs is a Windows driver for the next-generation Linux filesystem Btrfs. A reimplementation from scratch, it contains no code from the Linux kernel, and should work on any version from Windows XP onwards. It is also included as part of the free operating system ReactOS."
This is from the ntfs2btrfs maintainer's page.
https://github.com/maharmstone/btrfs
It's the same file system, with two different drivers for two different operating systems.
The metadata is adjusted for Windows in a way that is foreign to Linux.
Do Linux NTFS drivers deal with alternate streams?
"Getting and setting of Access Control Lists (ACLs), using the xattr security.NTACL"
"Alternate Data Streams (e.g. :Zone.Identifier is stored as the xattr user.Zone.Identifier)"
Not sure what point you're making here. WinBtrfs is a driver for the same btrfs filesystem that Linux uses. It's most common use case is reading the Linux partitions in Windows on machines that dual-boot both operating systems
What? Why would you need a Linux NTFS driver to read a btrfs filesystem? that makes no sense.
Storing Windows ACLs in xattrs is also pretty common (Samba does the same)
I'd delete my comment if I could at this point.
Yes it is?
As someone who has witnessed Windows explode twice from in-place upgrades I would just buy a new disk or computer and start over. I get that this is different but the time that went into that data is worth way more than a new disk. It's just not worth the risk IMO. Maybe if you don't care about the data or have good backups and wish to help shake bugs out - go for it I guess.
If only it had a native filesystem with snapshotting capability...
Which "it"?
Both btrfs and NTFS have snapshot capabilities.
It's so well hidden from users it might not as well exist. And you can't snapshot only a part of the ntfs filesystem.
The userland tools included with Windows are very lacking, but that's more of a distro problem than a filesystem problem. VSS works fine -- boringly, even -- and people take advantage of it all the time even if they don't know it.
And the new disk is also likely to have more longevity left in it, doesn't it?
[dead]
Very cool, but nobody will hear about this until at least a week after they format their ntfs drives that they have been putting off formatting for 2 years
[flagged]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html
btrfs isn't that terrible for desktop use right now. I mean, I wouldn't personally use it, I lost data on it a couple times four plus years ago, but it's come a long way since then. (my preference is keep everything I care about on a fileserver like truenas running zfs with proper snapshotting, replication and backup and live dangerously on the desktop testing out bcachefs, but I recognize not everyone can live my life and some people just want a laptop with a reasonable filesystem resistant to bit rot.
I recently found out Fedora defaults to btrfs with zstd compression enabled by default. Seems to work well enough.
On my personal devices I prefer BTRFS' snapshotting ability over the risk of having to restore from backup at some point.
Is there decent encryption support, or are we stuck using full disk encryption at the block level?
I don't know, I just use FDE because I don't trust filesystem level encryption to protect against the many side channel attacks one can inject into a running system. LUKS is good enough for me.
It looks like there's slow but steady progress on FSCRYPT support on BTRFS: https://bkhome.org/news/202403/linux-kernel-btrfs-supports-f... No idea what kind of state it is in currently.
The latter at this point.
How is bcachefs for personal use these days?
The political situation for bcachefs is far from good, with pressure from Linus and a CoC violation.
The net effect will likely delay stability.
https://www.phoronix.com/news/Bcachefs-Fixes-Two-Choices
https://www.phoronix.com/news/Linux-CoC-Bcachefs-6.13
Honestly the political situation will probably be a /good/ thing for long term stability, because I get a few months without any stupid arguments with upstream and finally get to write code in peace :)
It sucks for users though, because now you have to get my tree if you want the latest fixes, and there's some stuff that should be backported for forwards compatibility with the scalability improvements coming soon [1].
[1]: https://www.patreon.com/posts/more-expensive-116975457
I'm hoping to use your filesystem when it's ready.
Everyone wishes that this were easier for you.
I have to say I do see Linus' point though. Mainline is for stuff that's production-ready and well tested.
Having said that it's a great initiative and I hope it becomes ready for prime time. I think btrfs is taking too long and is infused with too much big tech interest.
Don't worry about the users, we'll manage somehow, it's such a tiny burden compared to the actual development. I'm just really happy to see you're not discouraged by the petty political mess and keep pushing through. Thank you!
Hey Kent, I just wanted to thank you for all your hard work. For those of us that would like to use the code in your tree to get the latest fixes and help with testing, I was wondering how hard it would be to set up OBS to automatically build and package fresh kernels and userland tools for OpenSUSE/Debian/Arch/etc, nightly and/or for each tag. I think it would help adoption as well as comfort, knowing that your improvements will arrive as soon as they are available.
I personally stopped compiling your code in my personal repo when bcachefs was upstreamed. It often was a pain to rebase against the latest hardened code and I'm happier since it's upstream. I use your fs for 7-8 years now and I hope your latest changes to the disk format will actually improve mount performance (yes I'm one of the silent "victims" you were talking about). I hope nothing breaks...
Anyway thank you for your work and I wish you all the best on the lkml and your work.
Echoing the sibling comment Kent, bcachefs is a really wonderful and important project. The whole world wants your filesystem to become the de-facto standard Linux filesystem for the next decade. One more month of LKML drama is a small price for that (at LKML prices).
I've been running bcachefs on one of my laptops since it hit linux stable. Just wanted to say thank you for all your constant work on it.
I switched back to the arch default kernel for my 32TB home media server, would you recommend going back to compiling your kernel for the time being?
Not unless you've been hitting a bug you need the fix for
Been using it since 6.7 on my root partition. Around 6.9 there were issues that needed fsck. Now on 6.12 it is pretty stable already. And fast, it is easy to run thousands of Postgres tests on it. Not something zfs or btrfs really could do without tuning...
So if you're a cowboy, now it's a good time to test. If not, wait one more year.
I haven't lost any data yet. It did something stupid on my laptop that looked like it was about to repeat btrfs's treatment of my data a few months ago but 15 minutes of googling on my phone and I figured out the right commands to get it to fix whatever was broken and get to a bootable state. I'm a decade away from considering it for a file server holding data I actually care about but as my main desktop and my laptop file system (with dotfiles backed up to my git instance via yadm and everything I care about nfs mounted in from my fileservers) it's totally fine.
I believe the main bcachefs mantainer does not advocate production use yet
We're still six months or so from taking the experimental label off, yeah. Getting close, though: filesystem-is-offline bugs have slowed to a trickle, and it's starting to be performance issues that people are complaining about.
Hoping to get online fsck and erasure coding finished before taking off experimental, and I want to see us scaling to petabyte sized filesystems as well.
Oh wow... That's actually really fast progress all things considered. Well done! I really hope all the... umm... misunderstandings get worked out because you're doing great work.
It's all stuff that's been in the pipeline for a long time.
(And we'll see when online fsck and erasure coding actually land, I keep getting distracted by more immediate issues).
Really, the bigger news right now is probably all the self healing work that's been going on. We're able to repair all kinds of damage without an explicit fsck now, without any user intervention: some things online, other things will cause us to go emergency read only and be repaired on the next mount (e.g. toasted btree nodes).
One of the choices you've made that I really like is sharing the kernel and userspace filesystem code so directly in the form of libbcachefs. I get the impression this means the kernel can do practically everything userspace can, and vice versa. (I think the only exception is initialising devices by writing a superblock, although the kernel can take over the initialisation of the rest of the filesystem from that point onwards? And maybe turning passphrases into keys for encrypted-fs support which does an scrypt thing?)
As well as giving you really powerful userspace tools for manipulating filesystems, this also suggests that a stripped down busybox module for bcachefs could consist of superblock writing and pretty much nothing else? Maybe a few ioctls to trigger various operations. "Just leave it all to the kernel."
After a bit of peer pressure from a friend, I decided ended up using btrfs with my laptop about three months ago. It’s been fine thus far.
Emphasize on ‘this far’…
To me that kind of experiment is the equivalent of changing your car brake for something ‘open source’, maybe better, maybe not.
But when you need them you’re going to want to make sure they’re working.
How much of an assurance do you want? I've been using btrfs on my desktop for 3+ years. And I've also experienced no issues thus far.
The btrfs volume on my Synology NAS is 6 years old now, with 24/7 operation (including hosting web sites with constant read/write database activity) and going through several volume resize operations. No issues.
What filesystem would you suggest that has data checksums, efficient snapshots, and doesn't require compiling an out of tree kernel module?
If you artificially tailor your criteria such that the only answer you your question is btrfs then that is the answer you will get.
There's nothing "artificial" in his requirements, data checksums and efficient snapshots are required for some workloads (for example, we use them for end-to-end testing on copies of the real production database that are created and thrown away in seconds), and building your own kernel modules is a stupid idea in many cases outside of two extremes of the home desktop or a well-funded behemoth like Facebook.
Data checksums in particular is 90% of the reason I want to use a newer filesystem than XFS or Ext4. This is useful for almost any usage.
Snapshots a bit less so, but even for my laptop this would be useful, mostly for backups (that is: create snapshot, backup that, and then delete the snapshot – this is how I did things on FreeBSD back in the day). A second use case would be safer system updates and easier rollbacks – not useful that often, but when you need it, it's pretty handy.
ZFS… using a BSD kernel :)
Zfs is in tree if you use a different kernel. :p
no, that is not in in tree. it's in build. big difference.
There is an OpenZFS port to Windows, but I'm not sure how to find it from here:
https://github.com/openzfsonwindows/ZFSin
There is also Microsoft's own ReFS:
https://en.m.wikipedia.org/wiki/ReFS
NILFS2 is upstream, stores checksums for all data, and has the most efficient snapshot system bar none. Although I don't think i'd push the recommendation, solely because there are fewer eyes on it.
Heh, NILFS2. The first flash-friendly filesystem in Linux. When Flash was still expensive, I used it on an SD card to keep my most important source trees to speed up builds and things like svn (yes yes!) diff. I thought it had languished due to better resourced and more sophisticated efforts like F2FS.
zfs is licensed too freely to be in-tree, but it’s still an excellent choice.
I may be wrong but I don't think it's just "excessive freeness", the CDDL also has restrictions the GPL does not have (stuff about patents), it's mutual incompatibility.
Apache v2 and GPLv3 were made explicitly compatible while providing different kinds of freedom.
The CDDL is more permissive, it's a weak copyleft license while the GPL is strong copyleft, and that makes the two incompatible. Calling it "excessive freeness" is inflammatory, but they're broadly correct.
> The CDDL is more permissive, it's a weak copyleft license while the GPL is strong copyleft, and that makes the two incompatible. Calling it "excessive freeness" is inflammatory, but they're broadly correct.
It's not really. Many aspects of the license are free-er, but that's not what causes the incompatibility. The GPL does not have any kind of clause saying code that is distributed under too permissive of a license may not be incorporated into a derived/combined work. It's not that it's weak copyleft, it is that it contains particular restrictions that makes it incompatible with GPL's restrictions.
BSD licenses do not have that incompatible restriction (= are freer than CDDL, in that aspect) and can be compatible with the GPL.
The basic problem is that both licenses have a "you must redistribute under this license, and you can't impose additional restrictions"-type clause. There are some other differences between the licences, but that's what the incompatibility is about.
Some have argued that this may not actually constitute an incompatibility, but not many are keen to "fuck around and find out" with Oracle's lawyers. So here we are.
> but not many are keen to "fuck around and find out" with Oracle's lawyers
Oracle could spend 10 minutes and clear this up, but the fact they don't should be fear enough about any large company shipping with OpenZFS code.
See Oracle vs Google
I'm not sure if it's that easy, because all the contributors to OpenZFS also need to sign off an a license change, right?
That said, they did relicense DTrace to GPL a few years back. but I don't really know the details on that, or how hard it was, or if that's also possible for OpenZFS.
This is a complete misunderstanding. People take MIT code and put it in the kernel all the time. The issue is that the CDDL is not GPL-compatible because it has restrictions which the GPL doesn't have.
I think the whole license stuff is controlling software too much. All that legal blah has nothing to do with software and is only good for feeding overpriced lawyers.
When I publish code I don't pick any license. It's just free for anyone to use for whatever. I don't like the GPL for this, it's way too complicated. I don't want to deal with that.
It's a shame some legal BS like that is holding back the deployment of a great filesystem like ZFS.
>When I publish code I don't pick any license. It's just free for anyone to use for whatever.
That's not good in all cases as in some jurisdictions it means that any use is forbidden...
I just don't care about jurisdictions or legal frameworks. I just care about technology.
It's also really only a big thing in the US. Here in Europe things are usually settled amicably and the courts frown on people taking up their time with bullshit. They only really get involved if all else fails and only really for businesses. Nobody really cares about software licensing except the top 200 companies.
I'm admittedly a bit biased though. Unfortunately I sometimes have to deal with our internal legal dept. in the US (I'm based in Europe) and I find them such nasty people to deal with. Always really pushy. This made me hate them and their trade.
In the Linux tree.
In Windows, Satya would need to write Larry a check. It would probably be hefty.
Edit: there was a time that this was planned for MacOS.
https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-ap...
Edit: there was a time that this was planned for MacOS.
That was a joyous prospect. A single volume manager/filesystem across all UNIX platforms would be wonderful.
We had the UNIX wars of the 1990s. Since Linux won, they have been replaced by the filesystem wars.
It wouldn't have been a candidate for "the standard UNIX filesystem" even if it was in macOS because SUN made an intentionally GPL-incompatible license for it.
For what it's worth, I upvoted you.
I see no reason for the down votes.
Yeah the situation is unfortunate. There's a decent chance I'd be using ZFS if not for the licensing issues, but as a practical matter I'm getting too old to be futzing with kernel modules on my daily driver.
DKMS solved these "licensing issues." Dell is mum on official motivation- but it provides a licensing demarcation point, and a way for kernels to update without breaking modules- so it's easier for companies to develop for Linux.
_Windows Drivers work the same way and nobody huffs and puffs about that_
I'd love to have an intelligent discussion on how one person's opinion on licensing issues stacks up against the legal teams of half the fortune 50's. Licensing doesn't work on "well, I didn't mean it THAT way."
I admit I'm not fully up to date on whether it's actually "license issues" or something else. I'm not a lawyer. As a layman here's what I know. I go to the Arch wiki (https://wiki.archlinux.org/title/ZFS) and I see this warning under the DKMS section (as you advised):
> Warning: Occasionally, the dkms package might not compile against the newest kernel packages in Arch. Using the linux-lts kernel may provide better compatibility with out-of-tree kernel modules, otherwise zfs-dkms-staging-gitAUR backports compatibility patches and fixes for the latest kernel package in Arch on top of the stable zfs branch
So... my system might fail to boot after updates. If I use linux-lts, it might break less often. Or I can use zfs-dkms-staging-git, and my system might break even less often... or more often, because it looks like that's installing kernel modules directly from the master branch of some repo.
As a practical matter I could care less if my system fails to boot because of "license issues" or some other reason, I just want the lawyers to sort their shit out so I don't have to risk my system becoming unbootable at some random inopportune time. Until then, I've never hit a btrfs bug, so I'm going to keep on using it for every new build.
I've been bitten by kernel module incompatibility making my data unavailable enough times that I no longer consider ZFS to be viable under Linux. Using an LTS kernel only delays the issue until the next release is LTS. I really hope that bcachefs goes stable soon.
Its. Not. The. Lawyers.
I used ZFS with DKMS on CentOS back in the day, and I found it a pain. It took a long time to compile, and I had some issues with upgrades as well (it's been a few years, so I have forgotten what the exact issues were).
When it comes to filesystems, I very much appreciate the "it just works" experience – not having a working filesystem is not having a working system and it a pain to solve.
Again, all of this has been a while. Maybe it's better now and I'm not opposed to trying, but I consider "having to use DKMS" to be a downside of ZFS.
I think ZFS on Debian is pretty safe. Debian is conservative enough that I don’t expect the DKMS build to fail on me.
I would never use a DKMS filesystem for / though.
Xfs ? Bcachefs ? Whatever you like because those features may not be implemented at the filesystem layer ?
xfs doesn't have data checksumming.
> Note: Unlike Btrfs and ZFS, the CRC32 checksum only applies to the metadata and not actual data.
https://wiki.archlinux.org/title/XFS
---
bcachefs isn't stable enough to daily drive.
> Nobody sane uses bcachefs and expects it to be stable
—Linus Torvalds (2024)
https://lore.kernel.org/lkml/CAHk-%3Dwj1Oo9-g-yuwWuHQZU8v%3D...
You can get data checksumming for any filesystem with dm-integrity.
It has its own journaling mode just to really keep its integrity. Some things are just better solved at the filesystem level.
Linus is not a filesystem guy
Bcachefs is literally labeled as experimental.
You don't have to be to see bug reports, bug fix patches, or test it yourself, of course. "Filesystem guys" also tend to wear rose colored glasses when it comes to their filesystem, at times (ext2, xfs, btrfs, etc.)
Linus absolutely does jump in on VFS level bugs when necessary (and major respect for that; he doesn't rest on laurels and he's always got his priorities in the right place) - but there's only so much he can keep track of, and the complexity of a modern filesystem tends to dwarf other subsystems.
The people at the top explicitly don't and can't keep track of everything, there's a lot of stuff (e.g. testing) that they leave to other people - and I do fault them for that a bit; we badly need to get a bit more organized on test infrastructure.
And I wouldn't say that filesystem people in general wear rose colored glasses; I would categorize Dave Chinner and Ted T'so more in the hard nosed realistic category, and myself as well. I'd say it's just the btrfs folks who've had that fault in the past, and I think Josef has had more than enough experience at this point to learn that lesson.
> Linus absolutely does jump in on VFS level bugs when necessary (and major respect for that; he doesn't rest on laurels and he's always got his priorities in the right place) - but there's only so much he can keep track of, and the complexity of a modern filesystem tends to dwarf other subsystems.
The point is he doesn't have to understand details of the code to see bug reports and code churn and bug fix commits and have a reasonable idea of whether it's stable enough for end users. I would trust him to make that call more than a "filesystem guy", in fact.
> The people at the top explicitly don't and can't keep track of everything, there's a lot of stuff (e.g. testing) that they leave to other people - and I do fault them for that a bit; we badly need to get a bit more organized on test infrastructure.
Absolutely not on the top nodes. Testing has to be distributed and pushed down to end nodes where development happens or even below otherwise it does not scale.
> And I wouldn't say that filesystem people in general wear rose colored glasses;
Perhaps that's what you see through your rose colored glasses? (sorry, just a cheeky dig).
> I would categorize Dave Chinner and Ted T'so more in the hard nosed realistic category, and myself as well. I'd say it's just the btrfs folks who've had that fault in the past, and I think Josef has had more than enough experience at this point to learn that lesson.
Dave Chinner for example would insist the file-of-zeroes problem of XFS is really not a problem. Not because he was flat wrong or consciously being biased for XFS I'm sure, but because according to the filesystem design and the system call interface and the big customers they talked to at SGI, it was operating completely as per specification.
I'm not singling out filesystem developers or any one person or even software development specifically. All complex projects need advocates and input from outside stakeholders (users, other software, etc) for this exact reason, is that those deep in the guts of it usually don't understand all perspectives.
> The point is he doesn't have to understand details of the code to see bug reports and code churn and bug fix commits and have a reasonable idea of whether it's stable enough for end users. I would trust him to make that call more than a "filesystem guy", in fact.
No, that's not enough, and I would not call that kind of slagging good communication to users.
Seeing bugfixes go by doesn't tell you that much, and it definitely doesn't tell you which filesystem to recommend to users because other filesystems simply may not be fixing critical bugs.
Based on (a great many) user reports that I've seen, I actually have every reason to believe that your data is much safer on bcachefs than btrfs. I'm not shouting about that while I still have hardening to do, and my goal isn't just to beat btrfs, it's to beat ext4 and xfs as well: but given what I see I have to view Linus's communications as irresponsible.
> Absolutely not on the top nodes. Testing has to be distributed and pushed down to end nodes where development happens or even below otherwise it does not scale.
No, our testing situation is crap, and we need leadership that says more than "not my problem".
> Dave Chinner for example would insist the file-of-zeroes problem of XFS is really not a problem. Not because he was flat wrong or consciously being biased for XFS I'm sure, but because according to the filesystem design and the system call interface and the big customers they talked to at SGI, it was operating completely as per specification.
Well, he had a point, and you don't want to be artificially injecting fsyncs because for applications that don't need them that gets really expensive. Fsync is really expensive, and it impacts the whole system.
Now, it turned out there is a clever and more practical solution to this (which I stole from ext4), but you simply cannot expect any one person to know the perfect solution to every problem.
By way of example, I was in an argument with Linus a month or so ago where he was talking about filesystems that "don't need fsck" (which is blatently impossible), and making "2GB should be enough for anyone" arguments. No one is right all the time, no one has all the answers - but if you go into a conversation assuming the domain experts aren't actually the experts, that's not a recipe for a productive conversation.
> No, that's not enough, and I would not call that kind of slagging good communication to users.
It is enough. Users need to be told when something is not stable or good enough.
> Seeing bugfixes go by doesn't tell you that much, and it definitely doesn't tell you which filesystem to recommend to users because other filesystems simply may not be fixing critical bugs.
Cherry picking what I wrote. Bugfixes, code churn, and bug reports from users. It certainly tells someone like Linus a great deal without ever reading a single line of code.
> Based on (a great many) user reports that I've seen, I actually have every reason to believe that your data is much safer on bcachefs than btrfs. I'm not shouting about that while I still have hardening to do, and my goal isn't just to beat btrfs, it's to beat ext4 and xfs as well: but given what I see I have to view Linus's communications as irresponsible.
Being risk adverse with my data, I think Linus's comment is a helpful and responsible one to balance other opinions.
> No, our testing situation is crap, and we need leadership that says more than "not my problem".
No. Testing is crap because developers and employers don't put enough time into testing. They know what has to be done, leadership has told them what has to be done, common sense says what has to be done. They refuse to do it.
When code gets to a pull request for Linus it should have had enough testing (including integration testing via linux-next) that it is ready to be taken up by early user testers via Linus' tree. Distros and ISVs and IHVs and so on need to be testing there if not linux-next.
> Well, he had a point, and you don't want to be artificially injecting fsyncs because for applications that don't need them that gets really expensive. Fsync is really expensive, and it impacts the whole system.
No it was never about fsync, it was about data writes that extend a file hitting persistent storage before inode length metadata write does. By careful reading of posix it may be allowed, as a quality of implementation for actual users (aside from administrator-intensive high end file servers and databases etc from SGI), it is the wrong thing to do. ext3 for example solved it with "ordered" journal mode (not fsync).
You can accept it is poor quality but decide you will do it anyway, but you can't just say it's not a problem because you language-lawyered POSIX and found out its okay, when you have application developers and users complaining about it.
> By way of example, I was in an argument with Linus a month or so ago where he was talking about filesystems that "don't need fsck" (which is blatently impossible), and making "2GB should be enough for anyone" arguments. No one is right all the time, no one has all the answers - but if you go into a conversation assuming the domain experts aren't actually the experts, that's not a recipe for a productive conversation.
I didn't see that so I can't really comment. It does not seem like it provides a counter example to what I wrote. I did not say Linus is never wrong. I have got into many flame wars with him so I would be the last to say he is always right. Domain experts are frequently wrong about their field of expertise too, especially in places where it interacts with things outside their field of expertise.
> I didn't see that so I can't really comment. It does not seem like it provides a counter example to what I wrote. I did not say Linus is never wrong. I have got into many flame wars with him so I would be the last to say he is always right. Domain experts are frequently wrong about their field of expertise too, especially in places where it interacts with things outside their field of expertise.
You came in with an argument to authority, and now you're saying you disagree with that authority yourself, but you trust that authority more than domain experts?
I don't think you've fully thought this through...
Everyone believes what they read in the news, until they see it reporting on something they know about - and then they forget about it a week later and go back to trusting the news.
> You came in with an argument to authority, and now you're saying you disagree with that authority yourself, but you trust that authority more than domain experts?
I wrote what I wrote. I didn't "come in" with the argument to authority though, that was you (or perhaps the OP you replied to first). Anyway, I gave examples where domain experts are myopic or don't actually have the expertise in what other stakeholders (e.g., users) might require.
[flagged]
honestly I think btrfs isn't bloated enough for today's VM-enabled world. ext4 and xfs and hell, exfat haven't gone anywhere, and if those fulfill your needs, just use those. but if you need more advanced features that btrfs or zfs bring, those added features are quite welcome. imo, btrfs could use the benefits of being a cluster filesystem on top of everything it already does because having a VM be able to access a disk that is currently mounted by the host or another VM would useful. imagine if the disk exported to the VM could be mounted by another VM, either locally or remote simultaneously. arguably ceph fills this need, but having a btrfs-native solution for that would be useful.
Running VMs (and database servers) on btrfs performs really bad so you have to disable CoW for them.
Otherwise you'll get situations where your 100GB VM image will use over a TB of physical disk space.
It's a shame really that this still isn't solved.
CoW won't necessarily make the VM image bloated. In fact, as I've foolishly found out, BTRFS can be quite useful for deduplicating very similar VMs at the block level, at the cost of needing to re-allocate new disk space on writes. In my VM archive, six 50 GiB virtual machines took up 52 GiB rather than 300 GiB and that was quite impressive.
Many downsides to CoW are also present with many common alternatives (i.e. thin LVM2 snapshots). Best to leave all of that off if you're using spinning rust or native compression features, though.
What’s the underlying issue? I used VMs with ZFS for storage for well over a decade with no issue.
ZFS performs much better than btrfs with the many small writes that VMs produce. Why exactly is a great question. Maybe it has to do with the optimizations around the ZIL, the temporary area where sync writes are accumulated before they are written to the long-term spot.
I don't think thin provisioning btrfs makes a lot of sense. Before disabling CoW I'd rather use a different filesystem.
Are you sure your TRIM is working and the VM disk image is compacting properly? It's working for me but not really great for fragmentation.
Checksum self healing on ZFS and BTRFS saved my data from janky custom NAS setups more times that I can count. Compression is also nice but the thing I like most is the possibility of creating many partition-like sub volumes without needing to allocate or manage space.
[flagged]
"Endless errors"? Are you talking about disk errors? Or are you referring to that one time long ago that it had a bug wrt a specific uncommon RAID setup?
APFS and ZFS aren't very interesting to me honestly, because neither are, or can be, in the Linux kernel. I also don't understand why APFS is in the same conversation as ZFS and BTRFS.
Why would someone do that? NTFS is stable, faster than btrfs and has all the same features.
The only reason I can think of is so that they can use the same FS in both windows and linux -but with ntfs, they already can.
Mind you, with openzfs (https://openzfsonwindows.org/) you get windows (flakey), freebsd, netbsd and linux but -as I said; I'm not sure zfs is super reliable on windows at this point.
Mind you, I just stick with ntfs -linux can see it, windows can see it and if there's extra features btrfs provides they're not ones I am missing.
I’m a die-hard ZFS fan and heavy user since the Solaris days (and counting) but I believe the WinBtrfs project is in better (more useable) shape than the OpenZFS for Windows project.
With ntfs you have to create a separate partition though. With btrfs you could create a subvolume and just have one big partition for both linux and windows.
For fun? To prove that it is possible? As a learning activity?
There are millions of reasons to write software other than "faster" or "more features".
I can imagine this being convenient (albeit risky) when migrating Windows to Linux if you really can't afford a spare disk to backup all your data.
what ?!?! NTFS has no case sensitivity no compression. And I guess a couple of more things I do not want to miss.
> what ?!?! NTFS has no case sensitivity no compression.
As the sibling comment mentioned, NTFS does have a case-sensitive mode, for instance for the POSIX subsystem (which no longer exists, but it existed back when NTFS was new); I think it's also used for WSL1. And NTFS does have per-file compression, I've used it myself back in the early 2000s (as it was a good way to free a bit of space on the small disks from back then); there was even a setting you could enable on Windows Explorer which made compressed files in its listing blue-colored.
NTFS has per-folder case sensitivity flag. You could set it online at anytime prior to Windows 11, but as of 11 you can now only change it on an empty folder (probably due to latent bugs they didn’t want to fix).
NTFS had mediocre compression support from the very start that could be enabled on a volume or directory basis, but gained modern LZ-based compression (that could be extended to whatever algorithm you wanted) in Windows 10, but it’s unfortunately a per-file process that must be done post-write.
NTFS does have case-sensitivity, just nobody dares to activate it. Compression is big, but I thought I've read winbtrfs neither
I activated it back in mid 2010 or so. I had the most amazing pikachuface when random things stopped working because it could no longer find that file it wanted to load with an all-lowercase-string even though the project builds it with CapitalCase. Sigh...