Item 44136679

Supermancho • 7 days ago

How much effort should be put into "failing well"? I rather see the program crash than output a liability. Fail well is too broad to be useful, in my industry.

ad_hockey • 7 days ago

It gets tricky in a distributed system, or I suppose any server process. When the program crashes it just starts up again, and sometimes picks up the same input that caused the crash.

Typical example would be processing an event that you can't handle from a message queue. You don't want to crashloop, so you'd probably have to throw it away on a dead letter queue and continue processing. But then, is your system still correct? What happens if you later receive another event relating to the same entity, which depends on the first event? Or sometimes you can't even tell which entity the malformed or bug-triggering event relates to, and then it's a real problem.

skydhash • 7 days ago

In distributed systems, that means you either wants the whole system to crash (consistency) or that the node that crash is not critical for any operation and can be out until you clean out the state that makes it crash (a partition). The issue is when you fail to be in those two categories. Meaning some aspects are concurrent, but some are sequential.