mlhpdx 7 days ago

> 92% of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors

This. If you take nothing else away from the article (which has a lot) take this: fail well, don’t fail poorly.

6
senthil_rajasek 7 days ago

It would also be nice to list some "best practices" on how to handle non-fatal errors. I would be definitely interested to know of any sources.

jerf 7 days ago

One of the nice things about "errors as values" is that it is generally easier to shim in an error rather than shim in an exception. Not that it's impossible to do the latter, but it's just generally easier because you can have that error serving as a value in your test code.

I have a lot of Go testing shims that look like:

    type UserGetter interface {
        GetUser(userID string) (User, error)
    }

    type TestUsers struct {
        Users map[string]User
        Error error
    }

    func (t TestUsers) GetUser(userID string) (User, error) {
        if t.Error != nil {
            return User{}, t.Error
        }
        user, have := t.Users[userID]
        if !have {
            return User{}, ErrUserNotFound
        }
        return user, nil
    }
This allows easily testing errors upon retrieving users and ensuring the correct thing happens. I'm not a 100% maniacal "get 100% test coverage in everything" kind of guy, but on the flip side, if your test coverage only lights up the "happy path", your testing is not good enough and as you scale up the probability that your system is going to do something very wrong when an error occurs very rapidly approaches 1.

It's more complicated when you have something like a byte stream where you want to simulate a failure at arbitrary points in the stream, but similar techniques can get you as close as you like, depending on how close that is.

From there, in terms of "how do you handle non-fatal errors", there really isn't a snap rule to give. Quite often you just propagate because there isn't anything else to do. Sometimes you retry some bounded number of times, maybe with backoff. Sometimes you log things and move on. Sometimes you have a fallback you may try. It just depends on your needs. I write a lot of network code, and I find that once my systems mature it's actually the case that rather a lot of the errors in the system get some sort of handling beyond "just propagate it up", but it's hard for me to ever guess in advance what they will be. It's a lot easy to mentally design all the happy paths than it is to figure out all the ways the perverse external universe will screw them up and how we can try to mitigate the issues.

skydhash 7 days ago

The same way you handle fatal errors, by specifying the exceptional circumstances and how to handle them (retry, alternative actions, or signaling to another handler up the call/request tree). Something’s correct output may not be our thing’s correct input.

harrall 7 days ago

I think the best practice is to handle them with equal attention as the happy path. Error handling is usually afterthought from my experience.

What is the system state when it does error?

What is the best possible recovery from each error state?

What can the user/caller expect for an error?

hiddencost 7 days ago

One I see a lot is not being careful to use the correct error type / status code.

E.g. if you're in python and raise a value error when an API is rate limited, someone down stream from you is going to have a bad time.

Marazan 7 days ago

For me the most catastrophic situations happen when a fatal error is treated as a non-fatal error and suddenly instead of the system crashing the system starts promulgating nulls everywhere and into storage.

thfuran 7 days ago

And then someone decides to "fix it" by adding a null check and things really go off the rails.

dnw 7 days ago

This paper from 11 years ago had the exact same finding!!(Finding 10). https://www.usenix.org/system/files/conference/osdi14/osdi14...

ad_hockey 7 days ago

Same paper, they're just referencing it:

> In 2014, Yuan et al. found that 92% of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors.

dnw 5 days ago

Oh, thank you for pointing that out. Appreciate it.

bravesoul2 7 days ago

I experienced this. The Go/Rust/Haskell way of avoiding exceptions in the language is better than the C#/Java/JavaScript way.

To the point I've seen it cause real bugs in production.

The problem in Node is you can either throw to make a 4xx or you can return a 4xx so downstream there are 2 things to check.

trebligdivad 7 days ago

Yeh I think a lot of security screwups are also in error handling paths. This article gave me two terms I'd not heard before - 'Happy-case castrophic failures' - is a great one, and lower down there is 'goodput'.

Supermancho 7 days ago

How much effort should be put into "failing well"? I rather see the program crash than output a liability. Fail well is too broad to be useful, in my industry.

ad_hockey 7 days ago

It gets tricky in a distributed system, or I suppose any server process. When the program crashes it just starts up again, and sometimes picks up the same input that caused the crash.

Typical example would be processing an event that you can't handle from a message queue. You don't want to crashloop, so you'd probably have to throw it away on a dead letter queue and continue processing. But then, is your system still correct? What happens if you later receive another event relating to the same entity, which depends on the first event? Or sometimes you can't even tell which entity the malformed or bug-triggering event relates to, and then it's a real problem.

skydhash 7 days ago

In distributed systems, that means you either wants the whole system to crash (consistency) or that the node that crash is not critical for any operation and can be out until you clean out the state that makes it crash (a partition). The issue is when you fail to be in those two categories. Meaning some aspects are concurrent, but some are sequential.