Derecho update

Coordinator
May 30, 2016 at 7:18 PM
So for those who are wondering, here's a status report.

First, we have RDMC working very well now and also a simple shared-memory mechanism we call the shared-state table or SST. These are key components of Derecho. You can read about RDMC on http://www.cs.cornell.edu/ken by clicking the relevant links in the first paragraphs of my research activities writeup. You might also find the Freeze Frame File System work interesting.

As promised on the thread this spring, RDMC really does give line-rate replication for up to maybe 8 replicas (so we're seeing 100Gbps on Mellanox's fastest routers), then gradually tails off. We've tested mostly on RDMA over Infiniband, but have started to work on RoCE as well.

Next, we've been working on Derecho itself, and we actually have a basic version of that working too, offering virtual synchrony over RDMC. Basically, we see the identical speeds. We also have a persisted mode that behaves like Paxos. It runs roughly 2x faster than Corfu, which we think is pretty amazing.

All of this code could be downloaded from Derecho.codeplex.com and FFFS.codeplex.com, although I wouldn't recommend doing that yet. Not as stable as you might want.

Right now we just finished getting polymorophic multicast and upcalls working on C++ 17, which involved some really cool twists on variadic templates and compile-time reflection. But it works! So the API of Derecho will be almost identical to the API of Vsync. Pretty much a trivial porting task.

Next step for us is to glue all of this together and port the main Vsync tools: the locking tool, the DHT and aggregation layer, ordered subgroup multicast, and (trivial but important) the startup logic where nodes discover one-another and join automatically.

So I think we're on track for an ~August release of the system in what could be called a complete V1.000 form. You would use it in C++ 17 or from any language that can import a C++ 17 DLL, but the target platforms would be mostly Linux and variations on Linux, like Mesos+Docker (which works on some cloud systems now, notably Azure).

And as we suspected, all of this should be 10,000x faster than Vsync...
Jun 13, 2016 at 12:00 PM
Edited Jun 13, 2016 at 12:01 PM
Good job!

About Derecho - it's unstable for the moment, ok, but does it already have implemented things like ordered queries, p2p queries, unicast mode and searching for oracle?
So could anybody that uses vsync now, switch to Derecho, at least for testing purposes?
And as we suspected, all of this should be 10,000x faster than Vsync...
That's impressive.
So, is it algorithmic optimization or you get such speed boost just by using C-plus-plus instead of C#?
Coordinator
Jun 13, 2016 at 12:42 PM
No, it isn't stable enough for you to work with yet.

As of right now, the version on derecho.codeplex.com only has the basic virtually synchronous multicast.

But here at Cornell we are starting to integrate a layer onto it that has all of those features, pretty much the same API style as for Vsync, so shifting from one to the other would be easy once this is all stabilized and debugged. I'm also moving the locking and DHT services over. This will probably take us some weeks but we do plan to do periodic checkpoints into the codeplex repository as we reach points where the code compiles and doesn't crash instantly.

Interestingly, your list is pretty much in the order we are doing it. So first we need to map from the Derecho API (this mimic of Vsync) to the Derecho reliable multicast (the work that is most solid at this point), then add things like P2P queries and unicast and state transfer, then integrate with a fault-tolerant Oracle. The first steps on this list are underway and the last ones are sort of easier tasks we postponed but will get to later in June or (more likely) in July.

To actually use this you would need access to a platform with RDMA in some form, or SoftRoCE, which emulates RDMA over TCP but will give way worse performance, like maybe 1000x slower. Among cloud providers, Microsoft Azure (the new Mesos container option) would be the most obvious option, and then there are others who also have this, but keep in mind that on Amazon AWS it isn't an option yet. On your cluster in an enterprise setting, if you have a modern switch and NICs you would just need to enable these features. Mellanox is the leading vendor in this space but not at all the only one to support these options.

The performance is 99% coming from the RDMA hardware, which completely avoids any copying from user-space to kernel or any UDP messaging. Data just moves directly from the address space of a sender to the receiver, with zero copying beyond what the DMA hardware does for us. Then we have a way to transform the one-to-one RDMA behavior into multicast using a tree-structured forwarding algorithm Jonathan Behrens developed. That particular aspect is algorithmic, but not our algorithm -- it came from work on parallel computing at Stanford, and he extended and adapted it for an RDMA over Infiniband or RoCE network.

But in fact beyond this, the use of C++ 17 does pay off. Lots of things that C# does at runtime can be done at compile time in C++ 17, and garbage collection is fully under own control. Obviously the user will need to know about some fairly modern features of C++ 17, but we are hoping it will end up being as easy to use as Vsync, and stylistically similar.

Ken