Sept 2016 Derecho update

Sep 23, 2016 at 1:19 PM
Edited Sep 23, 2016 at 7:12 PM
Well, we have nice progress to report! Derecho is really starting to work well now! It is a little embarrassing that it works so well, actually (I think I mentioned that it is at least 15,000x faster than Vsync, and frankly, if that doesn't sound embarrassing, you aren't a "real" systems person! People like me do not enjoy writing code that turns out to be too slow by 4 orders of magnitude). But on the positive side, since my team did Derecho, I ate my own lunch.

You can read about the new system by following the links on my home page at Cornell (I prefer not to have two many different copies of the papers floating about, so this way all readers see the ones posted on my Cornell web page). So that would be here.

Quick summary: First, Derecho is coded in C++ 17, which means you compile with Clang because most other C++ tool-chains lack constexpr and other features we are using. The code is very, very efficient: it generates very few threads (Vsync spawns threads at a crazy pace), it moves data not at all or just once (Vsync copies stuff a lot), and when you do a multicast it uses RDMA hardware (Infiniband or RoCE, pronounced "Rocky"), so the data goes straight from source memory to memory in the receiver.

We can persist a multicast to SSD or other persistent memory solutions, so g.OrderedSend ends up having a Paxos mode (like g.SafeSend but without needing a different name for the API). In practice we specify the desired level of quality when you join the group, which in Derecho looks like the Vsync join but with an extra argument: g.Join<Paxos>(args), or g.Join<AtomicMulticast>(args), or even g.Join<Raw>(args).

Then (very cool, seriously), we found a way to offer subgroups of a group that are automatically generated for you from a pattern supplied using a function. So for example we can have a group of 1025 members (Vsync would have trouble at that scale but the expectation is that for Derecho this might be a common scale of use), and then it could have a subgroup that has 3 members doing load-balancing over a cache of 1000 members that are doing read-only work sharing, and then perhaps a back-end set of 22 more members keeping persistent state in SSD files.

That cache is probably sharded and we have a way to automate the shard-generation, too: you can tell us to make 500 shards (regular subgroups) of size 2 each, for example. State transfer and multicast all work in the subgroups and shards (each has its own associated class and the class defines the state we should marshal, much like in Vsync). And all of this runs at these crazy data rates that RDMA enables.

So, very cool progress.

Next question: when can you use it? I had hoped we could have a full release by now, but as usual, everything takes longer than it should. So a more realistic target looks like early winter. Our code is actually on, so everything is accessible as we stabilize it, but some of the new features aren't at all stable yet. At first the full feature set will work only on Linux and only with Mellanox RDMA.

Then we also want this to work both on Windows Azure and on Linux container environments like Docker, as well as in pure Linux settings. Hopefully small steps. We want to support other vendors and not just Mellanox. And we also want to integrate with various forms of persistent memory, not just flash. Last, via SoftRoCE we hope to be able to run on pure TCP networks, for settings where RDMA itself isn't a hardware option.

So, lots of progress, lots more work to be done!