Vsync: What will V2.3 include?

Coordinator
Feb 1, 2016 at 3:57 PM
Just a quick glimpse of where Vsync is now heading.

First, after many failed attempts to find a way to more or less auto-translate to Java, I'm tabling that whole idea. C# and Java are just too different. I actually got pretty far with the tool from Tangible, down to maybe 200 hand-edits of things like Linq expressions and Annotations. But the prospect of doing 200 hand-edits every time we release a major enhancement is unappealing. So we have a different plan for getting there, as explained below.

Right now my main focus is on replacing the implementation of the OOB layer with a much better RDMA solution Jonathan Behrens and Sagar Jha have built, in C++. This has two modules: one called RDMC (Reliable Multicast on RDMA) and the other called SST (Shared State Table), and they are both insanely fast, maybe 10,000x faster than doing the identical things in Vsync. RDMC even leaves the old OOB implementation in the dust, and that's saying something!

My basic plan is simple: I want to offer a new multicast API called g.RDMCSend() and then rewire g.OOBSend to use it, so that we won't duplicate functionality. The existing infiniband dll (ib.dll) will be replaced by rdmc.dll. With RDMC we are seeing that we can deliver one to maybe 32 replicas at most of the speed of an RDMA or Infiniband network device, and 1024 at a large fraction of that same speed, even for gigabyte objects. For small objects you can get totally insane send rates too, and very low latency.

Then I want to also offer g.SST(predicate, upcall, mode), which will register a shared state test (predicate) and when it becomes true, issue the upcall and either do that once, or do it repeatedly (the mode). The table itself would be some struct composed of native data types that fit within a single cache-line (hence int, double, byte, bool but not byte[], string...). The idea is that each group member has one instance of this struct, a "local row" and can update it. Then it can also see the ones from other members. And it can run at speeds of maybe half a million of these events per second.

These would be hooked into the Vsync group membership layer, so that the SST is automatically the full set of group members, and RDMC automatically sends to the full group membership reliably.

Meanwhile, under the covers, RDMC itself and SST are in C++ 11, and we are extending them to create this thing I've called Derecho, which will be like a subset of Vsync, entirely in C++, and hence useful from Java, C++, or of course from C# too. I'm thinking that Derecho won't need to duplicate what Vsync has, since we can just use it from Vsync (call that the future Vsync V2.4). But it would be a free-standing library you could also use without the Vsync layer. In that future a lot of what Vsync does now can probably be replaced with called into Derecho, but on the other hand, I'm not sure that it makes sense for Derecho to try and be a full runtime environment like Vsync.

In some sense we think of Vsync as our preferred developer environment for complex cloud applications, and Derecho will eventually be more like a kernel extension doing the basic core functionality. With RDMC and SST now working, we're about 1/3 of the way to having Derecho. But I think this is a feasible target for late summer or early fall.

We will deliver on all these ambitious plans? No promises: this is research and sometimes setbacks occur because research is inherently risky. For example, I have honestly been convinced I could translate Vsync into C++ 11 or Java for ages now and am only just now accepting that this is simply way harder than it seemed. Sigh. But one advantage of the plan for Derecho and Vsync 2.3 and 2.4 is that building up from the basics is always way easier than taking an existing thing and hacking it to work in some foreign setting. And I do have a great group with incredible software skills. So I'm optimistic in a cautious way, but the odds are actually quite good for this plan. In contrast each time I thought I saw a path forward for turning Vsync into JavaVsync or CPPVsync, I was being naïve in some sense -- I just wasn't understanding how different C# is from Java these days, or how weird Vsync would look if levered into C++ 11, which was what I mostly pulled off last summer (I had that close to working, but gave up not because I doubted it would work, but because it was ugly and horribly complex to use -- a mismatch to the standard C++ 11 threading and callback model). So you can run into these deeper philosophic barriers, and this was where it went wrong, not so much that it was technically infeasible. Same with Java: technically, I am already within 200 edits or so. But philosophically, this is too high a barrier to risk having to do those 200 edits again and again for eternity. That's why I'm drawing the line and saying no.

So in this sense by betting on ground-up approaches, the risk is lower. But not zero.
Coordinator
Apr 21, 2016 at 1:22 PM
Edited Apr 21, 2016 at 1:24 PM
konst3d wrote:
Can we expect the next vsync version this autumn?
Most likely late August. The situation is that we actually have most aspects of Derecho running now (our pure C++ version of virtually synchronous groups). It can move data at 80Gbps on a Mellanox 100Gbps network switch, both RDMA/IB and RDMA/ROCE and we can support RDMA/SoftROCE too, so I'm hoping this can run even in a situation where RDMA hardware isn't available. But auto-detection of which option to use will be a small puzzle and right now SoftROCE requires rebuilding the Linux kernel for some reason (maybe to get 1-copy RDMA emulation from the TCP layer), which wouldn't be a good option for Vsync. So I have to work that out.

The remarkable thing is that this 80Gbps was measured in a group of 4 nodes (e.g. we are making 1 replica at around 97Gbps and can make 3 additional ones with a loss of less than 20% of the speed). We've tested with dozens of replicas and in fact the underlying code can run with hundreds, and even then is running at 55Gbps or so! As if the extra replicas were nearly free...

Derecho itself offers virtual synchrony (in fact it also offers a Paxos mode, using SSD NVRAM for persistence, but that slows it down: the SSD data transfer rate is 8Gbps, so we only can get about 7.5Gbps if we persist the data instead of just streaming it into memory), and is built on something called RDMC that offers a reliable but bare-bones multicast on RDMA, and on a shared state table we build on RDMA that we call SST. Both of those will be available under free-BSD licensing too (derecho.github.com, rdmc.codeplex.com, sst.codeplex.com). None is ready for prime time yet, however. We think we'll have all three stable by end of May. And in fact we have something else too, the Freeze Frame File System. Freeze Frame FS is built a bit like Microsoft's Corfu, but with lots of logs instead of just one, and we use it to capture and hold real-time data. We use RDMA here too, for reads and writes, and are hoping to eventually use Derecho to rapidly upload data to groups of readers on Spark or a similar high speed Hadoop/MapReduce platform.

Now, Derecho has virtual synchrony, but just for individual groups: in effect we could offer g.OrderedDerechoSend() and (g.OrderedDerechoDeliver)(...), but with some quirks. First, it only supports byte array transfers right now, and second, they have to be in a memory region allocated from the Derecho version of malloc. But the concept at least would be that you could call this malloc, obtain a big segment of contiguous memory, capture video or something into it, and then use the OrderedDerechoSend to move it and make hundreds of copies, and the data rate would be at these crazy numbers (I won't compare directly with Vsync, but let's say that this is perhaps 1000x or even 25,000x faster... and yet Vsync isn't a particularly slow technology).

Then I could also expose RDMC and SST in a similar way, through the group API.

So these are the development steps and they start with having stable, more or less final releases of RDMC, SST and Derecho for C++ 11. At that point the integration into Vsync would be pretty direct. I would also strip out the existing IB support, and get the OOB layer to run via Derecho too, since there is no reason to have a duplicate layer talking to RDMA. API for OOB stuff would remain as is, but the implementation would be on Derecho too.

If we stay on target and RDMC, SST and Derecho are all in solid shape by late May, I might actually have a beta of all of this available by late June. But we would want to give people a bit of time to play with it before making it into our next stable/recommended Vsync release; I kind of like Vsync being rock solid and reliable, and this degree of change to the platform seems a bit much...

If you do the add-on you are proposing for supporting nodes with dual IP addresses and locked-down routing rules, most likely that would be part of this beta too: we might introduce your stuff as a Beta V2.3, and then add in the Derecho, RDMC and SST logic, then redo OOB to run on it too, and then call that another round of beta, and then once it seems solid for long enough, we could shift to it becoming the next recommended release in the early fall, maybe targeting Sept 15 or so.

So something along those lines feels right to me. I do have the time for these steps, and we do have Derecho working in the lab now (but not fully completed), and RDMC and SST seem pretty stable. There are also papers to be published on all of these things, except perhaps not on Vsync -- I've never written a paper on the system, although it might be fun to do that once we add all these extra bells and whistles. Perhaps a target for the late fall...