Using Vsync on cloud-hosted machines with dual IP addresses

Apr 19, 2016 at 12:42 PM
Hello mr. Birman.

I'm trying to run 2 vsync apps on different machines but looks like they are unable to see each other, could you please help me?
Here is the details:
One of the machines is my local virtual machine (client), another one is amazon EC2 machine (server).
On local machine I use VSYNC_UNICAST_ONLY=True and VSYNC_HOSTS=amazon_machine_ip. On amazon machine VSYNC_UNICAST_ONLY=True. Amazon (server) app starts first, initializes vsync and creates a group, only after this all I'm trying to start app on my local (client) machine - to join existing vsync. Both firewalls are disabled, I've double checked - there are no connection problems, I'm able to send both UDP\Tcp packets in both directions, using 9753 & 9754 ports. Moreover, I can crash vsync runing on amazon server sending random data to these UDP ports.
I've tried to run server app on one of my another virtual machines (another subnetwork) and everything looks working ok - my client app is able to find the oracle, join the group and start exchange messages over vsync.
The only difference I see is - both my virtual machines are connected using local IPs - 192.168.0.17 (client) and 192.168.10.3 (server), BUT in case with amazon, server machine has 2 different IPs: private - that is visible inside the amazon network and external one - visible to client.
Client uses server's public IP to connect it - VSYNC_HOSTS=public_amazon_machine_ip but server thinks it has different ip - private (writes it to the log).
Could this cause the problem? Anything else I can do?
Coordinator
Apr 19, 2016 at 12:56 PM
Edited Apr 19, 2016 at 11:56 PM
This is actually not something Vsync can support, unfortunately. It has to do with Amazon's firewall rules: you might be able to get it to work on other cloud platforms, but you would have to try it one by one.

Amazon creates one "internal" IP address per instance within AWS. The addresses in this set (call them A1, A2, ...) allow the instances to talk to one-another. So A1 would be an IP address similar to what a wireless router might assign to your computers, inside your home. Vsync can set up groups with membership over this set. Thus if you have 3 Vsync-enabled instances of some service running on three AWS nodes, they end up with IP addresses A1, A2 an A3, and they can talk to one-another.

Then Amazon assigns an "external" IP address to your entire virtual private cloud. Call this "konst3d.com" if you wish. When applications running web clients try to access a service in konst3d.com, Amazon intercepts the request and directs it to one of your instances, using a load balancing policy that you can control. If they make a TCP connection, the applications will actually see a fake remote (AWS-side) endpoint which is in fact the IP address of the AWS load balancer for konst3d.com. It operates as a NAT box and dynamically translates IP addresses and port numbers to keep connectivity for TCP.

But the effect of this is that individual servers do not have any directly accessible IP address with which your client program C can actually talk directly to A1, A2, ...

This blocks Vsync from forming groups (including its internal concept of a "Vsync Client Proxy") if one application will run outside the AWS firewall and the remainder will run inside the AWS firewall. In effect, the firewall blocks the needed connectivity.

The only way you can solve this is to have your client program, that runs outside AWS, not use the Vsync library and instead talk to Amazon using RestFUL RPC or some other RPC package for web services (there are a half dozen of these and Amazon supports most of them, because they all boil down to the web services SOAP standard, and this is what its load balancers understand). Then have your server inside AWS act as a kind of proxy and relay actions on behalf of the external client, so for example of the external client updates some record (maybe you are tracking mobile cars and this is a new GPS location record for my Subaru), the client (runs on the Subaru) sends in an update web request, the proxy receives it, and then it turns around and issues a g.Send() in group you use to track GPS data. Same idea for a query: the client issues a query and the AWS proxy does the query.

Thus the proxy uses Vsync. In fact probably it plays several roles: Proxy for clients, but also "server" in the sense of using Vsync and holding the GPS data, etc.

Sorry you had trouble with this! I don't know any way to work around that would be transparent to the developer. So what I've described is pretty much how we solve this sort of thing.

PS: If this is hard to visualize, here's an analogy I use with students. Imagine that there is a furniture story called "Burns Family Furniture" and three identical triplets run it: Harry Burns, George Burns and Willy Burns. In fact they can recognize each other because Harry has a scar on his eyebrow from when George whacked him with a rock at age 3, and Willy once had a pierced nostril, so you can still see that little hole. But customers don't know this.

So customers (clients) always call and speak to "Mr. Burns". They just have no idea which Burns they got. In fact two different calls might reach two different versions of Mr. Burns. The triplets get a blast out of that and are masters at not giving away who is who.

But within the store, the three of them know perfectly well who is who, and they use first names. "Harry, I have a new order over on State Street. Can you fit one more table into the truck?" "Sorry, George, full up! But I know Willy will have some room later, when he gets back from his delivery in Freeville."

The idea would be that Vsync inside AWS (inside the store) is helping Harry, George and Willy cooperate and coordinate. But it is not available to the outside customer, who just knows that well, he called (perhaps, using RestFUL RPC) to order a table, and spoke to "Mr. Burns".
Apr 20, 2016 at 7:52 AM
Edited Apr 20, 2016 at 9:59 AM
Thank you for the quick and very detailed answer.
Ok, I got the idea - you cannot run vsync while part of its group members are hidden behind an amazon load balancer or something else that disallow you communicate directly to the server using its IP.
But we have a slightly different use case - each of our amazon servers has its own public IP, so there are no load balancer employed at the moment and you will always access the same machine using its public IP.
The major difference from the case when all machines are in the same network is - amazon machine has 2 different IPs, private one - visible to amazon server and public - visible to the machines outside of the amazon.
I've done some vsync source code research and as I see, machine IP is used to create vsync server\node address - IP_address+process_ID (+some ports). So the question is - could this IP-address-duality (public\private) prevent our local machine to join vsync running on amazon server, something like: local server expects to find oracle at public_IP but amazon machine believes that public_IP is somebody else (it only able to see its private IP) and so ignores requests from our local machine?

Added:
Well, I'm afraid that things are even worse - our local machine is also has 2 IP addresses - local one, visible to the machine itself, and external one (router's address).
So I guess, in this case, remote machine will try to answer to local machine, using it's internal IP - because this IP was used to create vsync address of the machine.
I have to check anyway.

Added #2:
Yes, this is exactly what happens in our case - amazon machine receives messages from local machine and trying to reply using vsync address' IP part, which is local IP in this case.
I've enabled vsync debug output like this:
internal const long Debug = MESSAGELAYER | LOWLEVELMSGS | VERBOSEADDRS;
And here is what I see in amazon machine log:
In group VSYNC_TEST_GROUP_0 after callbacks for request STATEXFER, msg 0:5 from (3512:172.31.16.31/9753:9754)
ReliableSender.SendGroup to <VSYNCMEMBERS>... type=multicast, Msg=Msg<(3512:172.31.16.31/9753:9754)::0:0, dest=(0:224.0.75.50/0:0), flags = { }>
PendingSendBuffer.Add: sender=(3512:172.31.16.31/9753:9754), msgid=0:0, ACKID 12, theView.vid=0, dests=[(3512:172.31.16.31/9753:9754)]
Send preparing to multicast outgoming msg: type 2, seqn12, sender (3512:172.31.16.31/9753:9754), dest (0:0.0.0.0/0:0), gaddr (0:224.0.75.50/0:0), minStable -1, buffer len 752
SocketSend CAST: 752 bytes
[2] Loopback and return in doSend
finished in doSend
GotAMsg<VSYNCMEMBERS>: event type=multicast, sender (3512:172.31.16.31/9753:9754) rank is 0, m.offWire is null
Msg<(3512:172.31.16.31/9753:9754)::0:0, dest=(0:224.0.75.50/0:0), flags = { }>
In group VSYNCMEMBERS about to do callbacks for request IPMC Views, msg 0:0 from (3512:172.31.16.31/9753:9754)
In group VSYNCMEMBERS after callbacks for request IPMC Views, msg 0:0 from (3512:172.31.16.31/9753:9754)
Receive successfully parsed incoming msg (phys len 648): type 2, seqn1, sender (968:192.168.0.17/9753:9754), dest (0:0.0.0.0/0:0), gaddr (0:224.0.19.136/0:0), buffer len 496
GotAMsg<ORACLE>: event type=multicast, sender (968:192.168.0.17/9753:9754) rank is -1, m.offWire is null
Msg<(968:192.168.0.17/9753:9754)::0:-1, dest=(0:224.0.19.136/0:0), flags = { }>
In group ORACLE about to do callbacks for request JOIN, msg 0:-1 from (968:192.168.0.17/9753:9754)
NextP2PSeqn<sendp2p> dest (968:192.168.0.17/9753:9754), using P2P seqn 0
Send preparing to send pt-to-pt outgoming msg: type 4, seqn13, sender (3512:172.31.16.31/9753:9754), dest (968:192.168.0.17/9753:9754), gaddr (0:0.0.0.0/0:0), minStable -1, buffer len 368
finished in doSend
NextP2PSeqn<sendp2p> dest (968:192.168.0.17/9753:9754), using P2P seqn 1
Send preparing to send pt-to-pt outgoming msg: type 5, seqn14, sender (3512:172.31.16.31/9753:9754), dest (968:192.168.0.17/9753:9754), gaddr (0:224.0.19.136/0:0), minStable -1, buffer len 368
finished in doSend
In group ORACLE after callbacks for request JOIN, msg 0:-1 from (968:192.168.0.17/9753:9754)
SendAck(g, sender, UID);
SendAck() to 192.168.0.17.
Sending an Ack: dest=(968:192.168.0.17/9753:9754), AckID=1
Exiting doReceive();
Message received.
Resender.sendto[Now:22401, md.resendTime:22170]: dest (968:192.168.0.17/9753:9754), ID 14/0:1, len 368 bytes
Resender.sendto[Now:22402, md.resendTime:22168]: dest (968:192.168.0.17/9753:9754), ID 13/0:0, len 368 bytes
Resender.sendto[Now:22915, md.resendTime:22902]: dest (968:192.168.0.17/9753:9754), ID 14/0:1, len 368 bytes
Resender.sendto[Now:22916, md.resendTime:22902]: dest (968:192.168.0.17/9753:9754), ID 13/0:0, len 368 bytes

192.168.0.17 - it's our virtual machine's local IP.
172.31.16.31 - amazon machine's local IP.
Coordinator
Apr 20, 2016 at 11:10 AM
Well, the system is open source and I don't post updates all that often (because nobody seems to find bugs lately). So changing it wouldn't be all that hard. But you would need to do this change.

I would suggest that you just modify the source to understand this kind of aliasing. I would maintain a table and then modify the Address equality and comparator methods to the two IP addresses as synonyms while using the "internal" one as the IP address inside the Address structure itself (which you shouldn't change!). You would also need to use the proper IP address when sending a UDP packet, based on who is doing the Send operation. And run in UNICAST_ONLY mode.

Do it in a clean and modular way so that when I do release an update six months from now it won't be hard to redo the change (or if it is clean and minor enough I might take it in as part of the main system)
Apr 21, 2016 at 8:43 AM
Yes, this was my first thought - to modify source code to make it understand such kind of public\private IP aliases.
But I see a problem here - having vsync group members spread across the different local networks and connected over the internet, we can easily get into a situation where two servers will have the same local\private IPs and thus the same vsync address.
Another one thing to consider is - when you send a message to a group member that is in another network, you have to use its public IP, BUT servers in the same network will have to use its private\local IP - it's probably just a performance thing, but I believe this could be vital for a real application. So, we need a way to find if we are in the same network with another group member.
So considering this all, here is how I see what could be done:
1) Allow to specify public\external IP address. This could be done via environment variables, like it done for VSYNC_HOSTS and other configuration parameters. I'm not sure you can get it programmatically.
2) Extend vsync address with: a) external IP - to connect outside the subnet, b) subnet mask - to allow everybody to check if they are in the same subnet with this node (and so could use local\private IP).
3) When you send something to a group member, check if you are in the same subnet and if so - use private\local IP, public\external otherwise.
4) When you receive a message, and (probably) do sender address check - also take this into account and check subnet mask before checking the IP address of the sender.

I've done very quick and dirty test, just to check if this is possible:
I've modified vsync source code to make it accept one more configuration parameter (passed via environment variables) - an IP address to use as a part of vsync address, instead the local machine IP, that was fetched programmatically before.
So the only place where local\private IP is used in modified code is inside Socket ReliableSender.SetSocketUp(out Socket s, ref int port, RcvrDel rd, out Thread t, ThreadPriority pri, string name) method) - IPEndPoint localEndPoint = new IPEndPoint(Vsync.my_IPaddress, 0); - to initialize a socket to send something.
Then I configured both my machines (amazon and local) to use public\external IP as a part of vsync address and... they finally were able to see each other, join the same group and execute few ordered queries!

P.S.
Can we expect the next vsync version this autumn?
Coordinator
Apr 21, 2016 at 12:03 PM
Edited Apr 21, 2016 at 12:03 PM
I'm not at all comfortable with anything that would change the Vsync Address class. The space impact is too disruptive and I don't want to harm the whole system just to support one scenario.

Instead, my suggestion is that you treat this as a question of "routing tables".

You could add three new environment variables: VSYNC_MY_HOSTNAME="whatever", VSYNC_MY_IPADDR="whatever", VSYNC_ROUTING_TABLE="filename".

If the former two are given, the system would trust the ones you provide. But the key thing is the routing table. This could have the form of an n x n table, which we can think of as being a map RoutingTable[ipA, ipB] => ipC, meaning, if you are node ipA and are trying to send to node ipB (using the official internal IP addresses that are inside the Vsync Address struct), then send to ipC.

As I write this, the question arises for me of whether in fact you need to map to a tuple: (nic, ipD). A modern computer has network interfaces with various names and you may actually need to tell each machine which interface to send on because sometimes, routing just isn't going to work otherwise. Worst case you might even need a triple: (nic, ipS, ipD) meaning "Send on nic, have your source IP address show as ipS in the packet, and have your destination show as ipD in the packet header.

You could even support wild-cards in this table: 128.64.. -> 67.221..: (eth1, 67.221.?.?, 67.221.?.?) meaning that if you have a Vsync address in the form 128.64.xx.xxx, and are sending to 67.22.yy.yyy, take xx.xxx and yy.yyy and substitute them in the ipS and ipD addresses, so that on the wire you would send on nic eth1, changing the IP header to show a packet from 67.221.xx.xxx to 67.22.yy.yyy, etc.

Then you would need to find all calls to Socket.Send and have them do this mapping at the very last step before the Send, and for Socket.Receive, having them remap in the other direction (if you get a packet on nic such-and-such from ipS to ipD, change the sender IP address to ipA and the dest IP address to ipB). Leave the rest of Vsync untouched.

Notice that with this form of table you could use entirely fake internal IP addresses that would be entirely assigned by you. The only real issue is that you might need to customize the routing table because the names of interfaces are kind of non-standard. Whatever Amazon is doing, I wouldn't assume that your home computer has the same convention, or that LiquidMetal does it the identical way, etc.

So something like this would fit my concept of a modular extension. It would be narrow, solving exactly this problem. It wouldn't change anything if the new options weren't specified. It seems quite general to me. And it imposes overhead only during startup (reading the table), since a table lookup to swap the addresses should be very cheap even if you use a wild-card scheme. And it needs to be cheap, since you really don't want to slow down the lowest level packet I/O methods.

I'll post a status update on the V2.3 discussion thread.
Apr 22, 2016 at 9:14 AM
So VSYNC_MY_IPADDR is an IP that vsync will use as a part of its address, and so - you can specify anything you wish here, anyway it will be replaced with the real IPs from routing table, to send\receive, did I get you right?
What is the role of the VSYNC_MY_HOSTNAME parameter?
Coordinator
Apr 22, 2016 at 12:06 PM
The system uses it to print the node address in the state dump. Applications can get access to it too, as a favor to make life easier for developers (on Linux and Windows, there is no completely standard way to obtain the name of the machine on which you are running).
Apr 22, 2016 at 3:43 PM
Thank you very much for your detailed explanation. I'll try to implement it the way you proposed.
Apr 29, 2016 at 3:28 PM
Edited Apr 29, 2016 at 3:29 PM
Found this during little vsync debugging session (Vsync.Group):
        public bool Lock(string lockName, int timeout)
        {
            return this.Lock(lockName, int.MaxValue, WRITELOCK);
        }
        public bool ReadLock(string lockName, int timeout)
        {
            return this.Lock(lockName, int.MaxValue, READLOCK);
        }
        public bool WriteLock(string lockName, int timeout)
        {
            return this.Lock(lockName, int.MaxValue, WRITELOCK);
        }
int.MaxValue is used instead of timeout parameter. I replaced it and did some testing - looks like timeout is working ok, so probably this code is left here forgotten since the debugging-time.

P.S.
Not sure where I should post such things, so posted it here - feel free to move\delete this post.
Coordinator
Apr 29, 2016 at 3:42 PM
I think this probably would be better as a separate posting on "issues", but in any case it seems like a reasonable fix. I have no idea why it was ignoring the timeout parameter but I completely agree that the intention was for timeout to be passed in and used.