Applying Vsync in an Air Traffic Control scenario (hard)
Air Traffic Control systems are good examples of high assurance systems that can find a solution like Vsync helpful, but on the other hand can't just be implemented directly on Vsync without a lot of thought. We'll describe an approach that was first used by IBM in a system they built for the US FAA in 1995. (Their effort wasn't completely successful, but they didn't know about Vsync!)

Start by writing down, on paper, the requirements of your system. A fairly typical ATC system might support up to 100 simultaneously active flights (if handling a small area) or perhaps 10,000 (if handling a very large and busy area). Each flight has a flight plan, and at all times, each flight plan must be assigned to a single controller, who will need to be notified when handoffs occur. Only the lead controller can update the flight plan (e.g. by telling the pilot to climb 1500 feet, or to turn left onto some new heading, etc). Other kinds of updates normally flow "through" the lead controller's workstation, like requests from the pilot to the group. Radar tracks are treated separately, by a "lead radar track announcing system", as are radar images.

Most ATC systems have a core loop: the lead controller updates the flight plan, then hands it to a fault-tolerant (i.e. replicated) data sharing layer that must securely log each new flight plan change. Then the new record is reliably shared via multicast with all the controller workstations (or perhaps with a subset that subscribe to this particular flight plan, if your system is very large scale). The resulting data distribution system, or DDS, has a mix of consistency, reliability, durability and performance requirements. Try and write these out, using a simple design for your DDS -- perhaps, it should just be a small process group with a primary and a backup member, or perhaps it should share load more evenly.

One question to think about is this: suppose controllers A and B both are looking at flight plans X and Y. Would it be important to ensure that updates to X and Y reach A and B in the identical order? What guarantees do you think are needed here? Make sure your DDS design will have the needed properties, and be ready to defend your decisions when the NASA/FAA team comes to review your architecture and will be asking hard questions about the safety and cost (performance, scalability) implications of every single decision you made!

At this point you'll have at least a rough plan of how to build the system. Now you can design (still on paper) the GUI an ATC controller might see. Probably it should show a radar background image, flight tracks labeled with flight data, and then have some form of table of active flights (and a search tool for finding other flights). Flight tracks evolve over time, so obviously these tracks are 4-D objects: a kind of region in space and time where the plane is expected to be. You can design various ATC controller tools to help the controller plan: will two planes come within 5 miles of one-another at some point? If so, the controller needs to plan a course change to keep them separated. And so forth. One can imagine quite a few such tools, all on a pull-down or push-button menu.

Consoles can crash and controllers can need bio-breaks. Could your design allow controllers to work in teams of 2-5, each side by side with shared and basically identical screen data, but with one person planning ahead, one doing "right now" control, one entering data for new flights, etc? What sort of process group structures might be useful to support this model? Controllers can't tolerate more than about 15s of "outage" at the very most. Can Isis2 be configured to sense failures and reconfigure fast enough? (Hint: Yes it can. But how? We'll leave that to you.) How should a controller-station failure be "shown" to the controller team? Remember the rules: at all time, each flight must have a single lead controller -- always one, never none, never two. So we now understand this to also mean "and if a controller system crashes, handoff must happen within a few seconds, 15s at the absolute maximum."

The DDS server architecture is next on your agenda. It will be a kind of web service, and the client systems will presumably talk to it via a RESTful or WCF (Windows Communication Foundation) approach -- REST is more common on Linux and WCF on Windows, but both are very similar. WCF is more easily used for people who program in Windows: Visual Studio has built-in templates for this case, and you'll end up just replacing a series of TODO comments with your specialized code). But neither is particularly hard to use.

So you'll now be in a position of having a client who is the legal owner of a flight record and who has made a change to it, and wants to share that change. The client uses REST or WCF to communicate the change to the DDS. How should the DDS log the record? (Hint: Flight records can be fairly large, a few megabytes might not be unusual. You'll probably want to do a single disk write that appends to some form of log that can later be cleaned up: read/write updates to a random-access database will be too slow. Another hint: once a flight plan is updated, the prior version can be replaced with a "redirection" pointer to the newer one, and then the older copy can be garbage collected). Now how should the DDS fault-tolerantly share this record within the system? How will you handle failures of the DDS group members? Is there an opportunity load-balance within the DDS? Which protocols should be used to share information within the DDS itself? Which protocols would work best for reporting updates out to the clients who have read-only access to the record (e.g. they aren't the leader, but they are tracking the flight)?

One interesting topic to think about is the performance of the resulting solution. Can you optimize it to the point that no member of the entire system ever does more than a single disk read for each record? Would these large records be better off included as data "within" a Vsync multicast, or would it be better to multicast some sort of pointer to the record ("look in the log for Tuesday 8-4-2014, and fetch a flight record that starts at offset 0xA76102 of length 232kb.")? Should you use the Vsync OOB tool for the data transfer? An important performance goal might be to ensure that no record is read on any machine more than once, or demarshalled more than once, and perhaps even that no record is demarshalled until the data within it is actually consulted.

Which Vsync protocol is the best choice for real-time data updates of this kind, given the critical nature of the solution? How does the answer change if we wish to support a form of disconnected operation in which a controller who loses connectivity to the DDS can still perform a full range of actions in a kind of "offline" mode, using previously received read-only records, as opposed to a solution in which only a controller who is "online" can be sure of seeing the data for current flights, and take control actions? Would it make sense to further break down the failure cases into ones in which just one controller has experienced a crash versus ones in which all the controllers in an ATC center have simultaneously lost connectivity? (After all, in the former case the controller's workstation goes down, but her colleagues, sitting right next to her, are still online and she'll know that).

How should the DDS repair itself after a failure that occurs while a flight plan update was being transmitted? Does virtual synchrony help you analyze this case and convince yourself of the correctness of the solution you've selected?

Finally, how would you argue the full end-to-end correctness of your entire solution? Can you prove that the solution will in fact have the safety properties you specified at the outset? Could you design a testing strategy to confirm that the implementation actually works in the way required by your algorithm?

Demo of your system: Time-lagged data on real ATC flights and radar are available for free from some web sources, and older traces are also available (you "replay" the latter). Use them to show your faculty advisor how your system really works under realistic conditions!

Add-ons: Most real ATC systems operate in centers that need a second center-to-center protocol to communicate with one-another. How would you interconnect the ATC centers in your basic system?

Last edited Nov 19, 2015 at 9:20 PM by birman, version 2