8TiB HDD, 10^14 bit error rate, approaching certainty of error for each "drive of data" read

Zenaan Harkness zen at freedbms.net
Wed Jun 10 01:50:48 UTC 2015


Seems ZFS' and BTRFS' time has come. ZFS on Linux (ZFSoL) seems more
stable to me, and has 10 years of deployment under its belt too.

Any news on Debian GNU/Linux distributing ZFSoL? We see ZFS on Debian
GNU/kFreeBSD being distributed by Debian...

FYI
Zenaan


---------- Forwarded message ----------
From: Zenaan Harkness
Date: Tue, 26 May 2015 20:31:41 +1000
Subject: Re: Thank Ramen for ddrescue!!!

On 5/25/15, Michael wrote:
> The LVM volumes on the external drives are ok.

Reminds me, also that I've been reading heaps about zfs over the last
couple days, HDD error rates are close to biting us with current gen
filesystems (like ext4). Armour plate your arse with some ZFS- or
possibly the less battle tested BTRFS- armour.

At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive
(most consumer drives are 10^14 - one advertises 2^15, and enterprise
drives are usually 2^16), we're talking 1 bit flip, on average, in
10^14 bits read, whilst:

8TiB drive =
8 * 1024^4 * 8bits =
70368744177664 bits

So if we read each bit once, say in a mirror recovery/ disk rebuild
situation, where that mirror disk has failed and a new one has been
connected and refilled with the data of the sole surviving disk, there
is an (8 * 1024^4 * 8) / 10^14, or ~70% chance that that "whole disk
read" (of the "good" disk) will itself produce an unrecoverable
bit-flip error, and so if you're using RAID hardware, you're now
officially rooted - you can't rebuild your mirror (RAID1) disk array.

Now think about a 4-disk (8TiB disks) RAID5 array (one parity disk),
and it's as good as an absolute certainty that when (not if) one disk
fails in that array, you will simply never recover/ rebuild the array,
due to one of the remaining disks producing its own error - and at the
point the first drive fails, the remaining drives are quite likely
closer to failure anyway...

Concerning stuff for data junkies like myself.

Thus RAID6, RAID7, or better yet the ZFS solutions to this problem -
RAIDZ2 and RAIDZ3 - where you have 2 or 3 parity disks respectively
and funky ZFS magic built in (disk scrubbing, hot spare disks and
more, all on commodity consumer disks and dumb controllers), where
-any- 2 (or 3) disks in your "raid" set can fail, and the set can
still rebuild itself - or if it's just sectors failing (random bit
flips), ZFS will automatically detect and repair those sectors with
bit flips, and warn you in the logs that this is happening - and it
will otherwise keep using a drive that's on the way out until you
replace it.

See here to wake us all up:
http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/

http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/1/

(That second article slags ZFS with (what seems to me as) a claim that
ZFS COW (copy on write) functionality is per-file, not per-block,
which AIUI is total bollocks - ZFS most certainly is a per-block COW
filesystem, not per-file, but that's just a reflection of the bold
assumptions and lack of fact checking of that article's author -
otherwise I think the article is useful!)

Z

---------- Forwarded message ----------
From: Zenaan Harkness
Date: Tue, 26 May 2015 22:34:50 +1000
Subject: Re: Thank Ramen for ddrescue!!!

> On 26 May 2015 12:31, "Zenaan Harkness" wrote:
>> Reminds me, also that I've been reading heaps about zfs over the last
>> couple days, HDD error rates are close to biting us with current gen
>> filesystems (like ext4). Armour plate your arse with some ZFS- or
>> possibly the less battle tested BTRFS- armour.
>>
>> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive
>> (most consumer drives are 10^14 - one advertises 2^15, and enterprise
>> drives are usually 2^16), we're talking 1 bit flip, on average, in
>> 10^14 bits read, whilst:
>>
>
> Base 10 or base 2? It's an order of magnitude of difference here, or one
> thousand more errors, so kinda a big deal...

Base 10. And the difference is much more than an order of magnitude:
2^14 = 16384
10^14 = 100000000000000

Unless I'm not understanding what you're asking...

For current HDDs:
10^15 URE rate means an order of magnitude less likely to have a problem.
10^16, one O better again.

The problem is, 10^14, with a 10T drive, is now at certainty - you are
all but guaranteed an random unrecoverable read error on that drive,
every time you read it - or rather, everytime you read a drives worth
of data off of that drive, which could be "quite a bit worse in
practice" depending on your usage environment for the drive.

I believe the URE rate's been roughly the same since forever - the
only "problem" is that we've gone from 10MB drives, to (very soon)
10TB drives - i.e. 6 orders of magnitude increase in storage capacity,
with no corresponding improvement in the read error rate, or in that
ballpark anyway.

Z

---------- Forwarded message ----------
From: Zenaan Harkness
Date: Wed, 27 May 2015 00:34:44 +1000
Subject: Re: Thank Ramen for ddrescue!!!

> On 05/26/2015 08:45 AM, Zenaan Harkness wrote:
>> ZFS is f*ing awesome! Even for a single drive that's large enough to
>> guarantee errors, ZFS makes the pain go away. I think BTRFS is
>> designed to have similar functionality - but it's got a ways to go yet
>> on various fronts, even though ultimately it may end up a "better"
>> filesystem than ZFS (but who knows).
>>
>> Z
>> I guess that's Z for ZFS then ehj? :)
>
> What about XFS?? It's being recommended on the Proxmox list as requiring
> less memory. I know next to nothing about this. Ric

Yesterday I read that that's a long standing falsity about ZFS - the
only situation in ZFS where RAM becomes significant (for performance)
is in Data deduplication - which is different again from COW and its
benefits. See here:
http://en.wikipedia.org/wiki/ZFS#Deduplication

These days an SSD for storing the deduplication tables is an easy way
to handle this situation if memory (and performance) is precious in
your deployment [[and you want to enable deduplication]].

Either way, it appears just about everything including memory use is
configurable - so it would make sense to get at least a little
familiar with it if you made your root filesystem ZFS.

I can't speak to XFS - it may be better for a single user workstation
root drive, I don't know sorry. I do know that for large disks (by
today's standards), ZFS nails the "certainty of bitrot" problem -
which, if one's data or photos or whatever is precious, is probably
significant no matter how small the storage is, just that with a small
dataset, it's easy to duplicate manually, but even then, automatically
provided (e.g. ZFS time-period scrubbing) is less error prone than
manual backups of course [[when combined with some form of ZFS
ZRAID]].

These pages seemed quite useful yesterday:
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/
https://calomel.org/zfs_raid_speed_capacity.html
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Z

---------- Forwarded message ----------
From: Zenaan Harkness
Date: Wed, 27 May 2015 00:46:29 +1000
Subject: Re: Thank Ramen for ddrescue!!!

> On 26 May 2015 14:34, "Zenaan Harkness" <zen at freedbms.net> wrote:
>> > On 26 May 2015 12:31, "Zenaan Harkness" <zen at freedbms.net> wrote:
>> >> Reminds me, also that I've been reading heaps about zfs over the last
>> >> couple days, HDD error rates are close to biting us with current gen
>> >> filesystems (like ext4). Armour plate your arse with some ZFS- or
>> >> possibly the less battle tested BTRFS- armour.
>> >>
>> >> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive
>> >> (most consumer drives are 10^14 - one advertises 2^15, and enterprise
>> >> drives are usually 2^16), we're talking 1 bit flip, on average, in
>> >> 10^14 bits read, whilst:
>> >
>> > Base 10 or base 2? It's an order of magnitude of difference here, or
>> > one
>> > thousand more errors, so kinda a big deal...
>>
>> Base 10. And the difference is much more than an order of magnitude:
>> 2^14 = 16384
>> 10^14 = 100000000000000
>>
>> Unless I'm not understanding what you're asking...
>
> You've used both bases in your post, and it's not clear whether you meant
> that or it was a typo.

Indeed. The numbers are staggering. And the fact we can now buy
consumer 8TB drives, which essentially guarantee the buyer a bit flip
on reading (and or bit rot as stored) every drive's worth of data is
really mind blowing - also that such error-guarantees are not yet
widely discussed or realized - I guess the "average home user" just
dumps photos, music and movies on their drives, and relatively rarely
reads them back off, and so the awareness is just not there.

And up until yesterday I've been an average home user from a drive URE
rate perspective - been all but oblivious. It's sorta been like "oh
yeah, I know they include error rates if you look at the specs, but
this is like, you know, an engineered product, and products have you
know, at least one year warranties, and it's all engineering
tolerances and stuff and those engineers know what they're doing, so I
don't have to worry. Right? Well, turns out we need to worry, and in
fact these bit flips are now all but a certainty.

There's the odd web page about where a fastidious individual has kept
a record over the years of corrupt files. Those error rates are actual
- neither optimistic nor pessimistic it seems. Of course they're
average and they're rates, but from everything I've read in the last
two days, they're relatively accurate engineering guarantees. It used
to be a guarantee that you would get no bit flips, on average, except
if you'd read/written simply enormous amounts. Now that engineering
amount is equal to about one (large) drive of data!

I just keep shaking my head, having never realized the significance of
all this prior to, oh idk, roughly say, yesterday. Might have been
about 11pm. Although it's now tomorrow, so if my engineering
calculations are right, that may have actually been the day before. I
think I need sleep.

:)
Z



More information about the D-community-offtopic mailing list