Twirrim 23 hours ago

Backblaze uses erasure encoding, which is currently the best and most efficient way to do storage. It's how every major object storage platform works.

The very quick high level explanation is that in storage you talk about "stretch factor". For every byte of file, how many bytes do you have to store to get the desired durability. If your approach to durability is you make 3 copies, that's a 3x stretch factor. Assuming you're smart, you'll have these spread across different servers, or at least different hard disks, so you'd be able to tolerate the loss of 2 servers.

With erasure encoding you apply a mathematical transformation to the incoming object and shard it up. Out of those shards you need to retrieve a certain number to be able to reproduce the original object. The number of shards you produce and how many you need to recreate the original are configurable. Let's say it shards to 12, and you need 9 to recreate. The amount of storage that takes up is the ratio 9:12, so that's a 1.3x. For every byte that comes in, you need to store just 1.3x bytes.

As before you'd scatter them across 12 shards and only needing any 9 means you can tolerate losing 3 hard disks (servers?) and still be able to retrieve the original object. That's better durability despite taking up 2.7x less storage.

The drawback is that to retrieve the object, you have to fetch shards from 9 different locations and apply the transformation to recreate the original object, which adds a small bit of latency, but it's largely negligible these days. The cost of extra servers for your retrieval layer is significantly less than a storage server, and you wouldn't need anywhere near the same number as you'd otherwise need for storage.

The underlying file system doesn't really have any appreciable impact under those circumstances. I'd argue ZFS is probably even worse, because you're spending more resources on overhead. You want something as fast and lightweight as possible. Your fixity checks will catch any degradation in shards, and recreating shards in the case of failure is pretty cheap.

2
srean 19 hours ago

> It's how every major object storage platform works.

Very interesting. Could you name a few, am curious. I would be happy if erasure codes are actually being used commercially.

What I find interesting is the interaction of compression and durability -- if you lose a few compressed bytes to reconstruction error, you lose a little more than a few. Seems right up rate-distortion alley.

Twirrim 17 hours ago

That I know of (and is public so I'm not breaching any NDA) AWS S3[1], Azure[2], GCP[3], Backblaze[4], Facebook's storage layer uses it[5][6], and Oracle Cloud's Object Storage Platform[7].

The economies of scale mean that you really have to have something like erasure encoding in place to operate at large scale. The biggest single cost for cloud providers is the per-rack operational costs, so keeping the number of racks down is critical.

[1]https://d1.awsstatic.com/events/Summits/reinvent2022/STG203_...

[2]https://www.usenix.org/system/files/conference/atc12/atc12-f...

[3]https://cloud.google.com/storage/docs/availability-durabilit...

[4]https://www.backblaze.com/blog/reed-solomon/

[5]https://www.usenix.org/conference/hotstorage13/workshop-prog...

[6]https://research.facebook.com/publications/a-hitchhikers-gui... they even do some interesting things with erasure encoding and HDFS

[7] https://blogs.oracle.com/cloud-infrastructure/post/first-pri...

pas 19 hours ago

Ceph has a very stable EC feature. And lot of companies use Ceph as a storage backend. Unfortunately I cannot find any straightforward statement about a commercial offering, but I would bet that DreamHost's DreamObjects does use it.

While it's not "commercial", but CERN uses it and many other institutions.

https://indico.cern.ch/event/941278/contributions/4104604/at... --- 50PB

...

ah, okay, finally an AWS S3 presentation that mentions EC :)

https://d1.awsstatic.com/events/Summits/reinvent2022/STG203_...

cantrecallmypwd 18 hours ago

More or less.

XFS or ext2 (or 3 or 4 wo journal), without LVM or mdraid.

There's no point to adding RAID at the hardware or OS level for object storage boxes when redundancy exists at the application level. A drive with too many errors will be marked "dead" and just spun down and ignored.

Metadata servers OTOH tend to be engineered to be much more reliable beasts.