Testing your understanding of the g.OrderedSend+g.Flush approach versus g.SafeSend (easy)
This is offered as a much easier small project. As we saw in the videos (and you must watch the videos on ordering and other properties), Vsync is unusual in offering a range of ways to implement a given kind of functionality. In the case of data replicated within a group, the system has a fast option for replication within processes hosted on "soft state" machines, and a slower one for processes hosted on machines with "durable" storage.

Build a testing program aimed at carefully measuring the performance of a server that uses g.OrderedSend+g.Flush in the manner explained in the video, versus using g.SafeSend. (Note that you can actually see what we got in a similar experiment at http://www.cs.cornell.edu/projects/quicksilver/public_pdfs/2012%20Finding%20D%20in%20CAPS.pdf, but our experiment used g.Send, which can be faster than g.OrderedSend. So you'll be repeating an experiment we did in the past, but without knowing precisely what the results will be.

In your experiment, be sure to do 10,100 or so operations and to exclude the first 100 or so, because startup costs can give unusual performance outliers. Make sure to show error bars. And you may want to vary the size of data objects: look at 10 bytes, 100 bytes, 1000 bytes, etc.

Does the cost of g.Flush vary depending on how many senders are active in the group? What did this experiment teach you about scalability and other costs in Vsync? Is it best to use g.Flush with no arguments, or to specify a value like g.Flush(2) or g.Flush(3)? Draw a time-line picture, cartoon style with panels, showing exactly how a system that uses g.OrderedSend+g.Flush(2) could behave differently from g.SafeSend. If you knew that machines fail once every 10 days for an average period of 5 minutes (e.g. they are back on their feet after 5 minutes), and that the failures are independent, calculate the probability of ever observing this specific sequence of events as a number. Now suppose that failures can be correlated, but with some other probability (e.g. perhaps with probability 1/1000, when machine A fails, machine B fails within 100ms). How would correlated failures change the likelihood of seeing the effect you identified?

Last edited Nov 19, 2015 at 9:17 PM by birman, version 2