sebstefan 7 days ago

>Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.

This is fucking amazing

12
jeffreygoesto 7 days ago

Loom for Rust does this. We also adapted it for some C++ with good success (found actual bugs that slipped tests and reviewes). Sits lower than i.e. TLA+ and is not a proof but super useful as you check the actual implementation.

https://github.com/tokio-rs/loom

wwilson 7 days ago

DST was i̶n̶v̶e̶n̶t̶e̶d̶ popularized at FoundationDB a little over a decade ago, and has been quietly gathering steam ever since. If you’re interested in the technique, the FDB paper has some good info in section 4 and section 6.2: https://www.foundationdb.org/files/fdb-paper.pdf

(Disclosure: I am an author.)

I also gave a talk overview of it at Strange Loop back in 2015, but don’t have the youtube link handy.

If you’re interested in trying DST, we have a new company that aims to make it a much easier lift, especially for existing projects that can’t be retrofitted onto one of the many simulation frameworks: https://antithesis.com

Happy to answer any questions about the approach, here or over email.

mjb 7 days ago

Hey Will. I'm a huge fan of the work you all are doing, and of FoundationDB, but I don't believe it's accurate that DST was invented at FoundationDB (or, maybe it was, but was also used in other places around the same time or before).

For example, the first implementations of AWS's internal lock service (Alf) used DST as a key part of the testing strategy, sometime around 2009. Al Vermeulen was influential in introducing it at AWS, and I believe it built on some things he'd worked on before.

Still, Anithesis is super cool, and I really admire how you all are changing the conversation around systems correctness. So this is a minor point.

we6251 7 days ago

Also a huge proponent of Antithesis and their current work, but there definitely were some notable precedents at or around that time e.g. MODIST from 2009 (https://www.usenix.org/legacy/event/nsdi09/tech/full_papers/...), which similarly tried to build a "model checker for distributed systems".

As another interesting historical side note, I have wondered about the similarities between Antithesis and "Corensic", a startup spun out of UW around a similar time period (circa 2009). Apparently, they "built a hypervisor that could on-demand turn a guest operating system into deterministic mode" (see "Deterministic Multiprocessing" at https://homes.cs.washington.edu/~oskin/). My impression is that their product was not a significant commercial success, and the company was acquired by F5 Networks in 2012 (https://comotion.uw.edu/startups/corensic/).

Overall, I don't over-index on novelty, and think it is generally good for ideas to be recycled/revived/re-explored, with updated, modern perspectives. I believe that most rigorous systems designers/engineers likely converge to similar ideas (model checking, value of determinism, etc.) after dealing with these types of complex systems for long enough. But, it is nevertheless interesting to trace the historical developments.

wwilson 7 days ago

Corensic was impressive tech. I actually debriefed with one of their founders years ago. IIRC, their product was focused on finding single-process concurrency bugs.

Deterministic hypervisors are by no means new. Somebody once told me that VMWare used to support a deterministic emulation mode (mostly used for internal debugging). Apparently they lost the capability some time ago.

wwilson 7 days ago

Hi Marc, thank you for the correction! We started doing it around 2010, and were not aware of any prior art. But I am not surprised to hear that others had the idea before us. I will give Al credit in the future.

joshstrange 7 days ago
agentultra 7 days ago

The TigerBeetle team does this too and it's an interesting approach. One of their developers did a talk on it at HYTRADBOI this year [0].

[0] https://www.hytradboi.com/2025/c222d11a-6f4d-4211-a243-f5b7f...

slt2021 7 days ago

they also made a browser game to showcase their DST, where you can inject faults and see how tigerbeetle recovers

https://sim.tigerbeetle.com/

mrngm 5 days ago

Here's their blog post from 2023 explaining what they did: https://tigerbeetle.com/blog/2023-07-11-we-put-a-distributed...

slayerjain 7 days ago

similar to the integration testing component in keploy.io, which includes time freezing.

basically certain system clocks are rolled back and incremented in a deterministic manner to emulate real issues recorded from production bugs. Had lot of fun working with building this and the effect of messing with different system clocks by intercepting system calls.

https://keploy.io/docs/keploy-cloud/time-freezing/

supriyo-biswas 7 days ago

Also somewhat similar to what Antithesis is doing.

keploy 7 days ago

Keploy is doing the same thing, in same space - https://github.com/keploy/keploy

jen20 7 days ago

Which is largely an extended version of what FoundationDB does (from many of the same people).

algorithmsRcool 7 days ago

Microsoft has a flavor of this for .NET called Coyote[0]

[0] https://microsoft.github.io/coyote/#overview/how/

aoli-al 7 days ago

https://github.com/cmu-pasta/fray

Is a concurrency testing framework for Java. It also does deterministic simulation.

whateveracct 7 days ago
jiggawatts 7 days ago

It ought to be the norm, and it is (slowly) getting there. For example, the .NET 8 standard library now has a TimeProvider type to allow test frameworks to provide substitute implementations of the system clock.

lucianbr 7 days ago

Guessing an ordering that will trigger a bug and exploring all orderings should both be basically impossible for non-trivial scenarios, no?

teeray 7 days ago

Perhaps, but there is still some value in "capturing the bug with a failing test." If some ordering presents a defect, you can recreate it and prove that you've fixed the bug.

spenczar5 7 days ago

I want something like this, but I work almost entirely in Go.

Am I correct in thinking that this would require a fork of the main Go implementation, in order to make a deterministic (and controllable) goroutine scheduler and network stack?

Does one also need to run on a particularly predictable OS?

ideal_gas 7 days ago

For in-process go scheduling, some progress has been made here; see: https://go.dev/blog/synctest

But `synctest.Wait` won't work for non-durably blocked goroutines (such as in the middle of a TCP syscall) so requires memory-based implementations of e.g. net.Conn (I've plugged in https://pkg.go.dev/google.golang.org/grpc/test/bufconn with good success)

jerf 7 days ago

That's not enough for proof purposes. It allows you to build a test that deterministically tests one path, but it does not give you control over all possible tests.

In fact I seem to be a bit iconoclastic on this matter but I'm not even a fan for testing purposes. Even non-proof-based testing needs the ability to test that goroutines may execute out of order. Nothing prevents a goroutine scheduled for 2ms from now to run all the way to completion before a goroutine scheduled to run 1ms from now even starts, but AFAIK this approach doesn't let you test that case. Such logic is not valid in a multithreaded system; it is at most the most likely progression, not the only possible progression.

But since we live in the world where the question is more will anyone write a concurrency test at all, moaning about not having a perfect one is missing the point, I suppose. I won't deny a deterministic test is better than a non-deterministic test in general.

geminiiii9 6 days ago

There may be more there than meets the eye at first glance.

I'm not saying this easy to do at the moment, but the possibility is there.

If you can take control of all timers/sleep in your Go program, then you have, in effect, complete control of Goroutine scheduling.

If you assign a distinct time point to each goroutine, say based on its ID, and then have the goroutine sleep until that point, then you have also assigned the order in which those goroutines will run. Each will "wake up" alone, only at the next point at which the clock is forced to advance--which is when the previous goroutine blocked.

jerf 7 days ago

"Am I correct in thinking that this would require a fork of the main Go implementation, in order to make a deterministic (and controllable) goroutine scheduler and network stack?"

Yes, because the Go runtime actually explicitly and deliberately leans in the opposite direction. Additional non-determinism is deliberately added to the system, most notably that a "select" call with multiple valid channels will be psuedorandomly chosen, and iteration on a map is somewhat scrambled on each iteration. It's not entirely random and as such not suitable for pretty much any randomness requirement you may have, but enough to generally prevent accidental dependencies on deterministic iteration.

This has the effect of "spreading out" concurrency issues and hopefully reducing the sort of issues you get when you only discover a major issue when something else in the environment changes three years later. It's a nice default but nothing like a proof, of course.

You can also go the "predictable OS" route, since the entire Go runtime sits on top of it. Nothing can non-deterministically run on top of a fully deterministic OS image. But a fully deterministic OS would probably want some integration with the Go runtime to be able to more deeply control it through mechanisms other than varying inputs to the random function, which is a very indirect way to go down a code path.

slayerjain 7 days ago

I worked on adding golang time freezing in keploy by using the approach from the go playground which used a fixed time (2009-11-10 23:00:00 UTC) but also increments the system clock used by goroutines.

carbondating 7 days ago

This is what hermit does, although it's no longer actively developed.

https://github.com/facebookexperimental/hermit