Let’s talk failover. Most tools for failover (keepalived, heartbeat, wackamole/spread) use a protocol known as multicast. Multicast acts as a sort of “bulletin board” between computers. Anybody on the network can look at the bulletin board, and anybody on the network can post to the bulletin board. Normally, failover tools use multicast to pass messages between computers. For instance you could have three computer on a network, all posting and listening to the same multicast group: “Hey, I’m alive!” If one of the machines stops sending this repetitive message, the others know that something is wrong…either it has been disconnected or gone down, etc. They can use that information to act: was that computer hosting a shared IP? Give the IP to one of the computers that are still responding. This is the general idea behind IP-based failover.
Now, there’s no inherent problem with multicast. It’s generally known for being unreliable, but when all you’re sending is “Hi!” over the wire, data integrity isn’t a high priority. The problem with multicast in reality is that most “cloud” (VPS) providers (AWS, Linode, Slicehost, Rackspace, etc) don’t support it on their networks. You can send a multicast message to a group, but your other machines listening on that group won’t hear it. The other problem with multicast is that the failover tools mentioned above ONLY support multicast. There is no way to tell them to listen to another machine directly over unicast, which is supported by cloud providers.
One way you can solve this is by using GRE tunnels, which allow you to create a tunnel to another computer with everything inside encrypted. This allows multicast communications to pass between two computers, even if the router blocks them normally.
I recently tried to get this set up on my current host, Linode. I was not successful, even with the help of another member who had the same problem (but solved it with GRE). I just could not get two machines to talk to eachother over a GRE tunnel with keepalived.
The solution
I posted my question to serverfault.com in a last resort (video). I’d asked more or less the same question there before, but didn’t get the answer I wanted. This time, I hit a jackpot though.
Willy Tarreau, creator of HAProxy, responded with a patch to keepalived that allows it to communicate over unicast. I applied it, recompiled, set up the new options the patch gives (“vrrp_unicast_bind” &”vrrp_unicast_peer“), and spun it up on both machines.
Yesss!! It works! Stopping HAProxy on the first server made the second machine take the shared IP.
Now, ideally there would be a bunch of machines, namely all my web servers that would be standing by ready to take the shared IP. This patch only allows me two machines. Failover is failover though, when one instance goes down, I get an email and can go in and investigate. I’d still like to know if there is a way to do failover on a cluster of servers without multicast, but for now this works great.