Let’s talk failover. Most tools for failover (keepalived, heartbeat, wackamole/spread) use a protocol known as multicast. Multicast acts as a sort of “bulletin board” between computers. Anybody on the network can look at the bulletin board, and anybody on the network can post to the bulletin board. Normally, failover tools use multicast to pass messages between computers. For instance you could have three computer on a network, all posting and listening to the same multicast group: “Hey, I’m alive!” If one of the machines stops sending this repetitive  message, the others know that something is wrong…either it has been disconnected or gone down, etc. They can use that information to act: was that computer hosting a shared IP? Give the IP to one of the computers that are still responding. This is the general idea behind IP-based failover.

Now, there’s no inherent problem with multicast. It’s generally known for being unreliable, but when all you’re sending is “Hi!” over the wire, data integrity isn’t a high priority. The problem with multicast in reality is that most “cloud” (VPS) providers (AWS, Linode, Slicehost, Rackspace, etc) don’t support it on their networks. You can send a multicast message to a group, but your other machines listening on that group won’t hear it. The other problem with multicast is that the failover tools mentioned above ONLY support multicast. There is no way to tell them to listen to another machine directly over unicast, which is supported by cloud providers.

One way you can solve this is by using GRE tunnels, which allow you to create a tunnel to another computer with everything inside encrypted. This allows multicast communications to pass between two computers, even if the router blocks them normally.

I recently tried to get this set up on my current host, Linode. I was not successful, even with the help of another member who had the same problem (but solved it with GRE). I just could not get two machines to talk to eachother over a GRE tunnel with keepalived.

The solution

I posted my question to serverfault.com in a last resort (video). I’d asked more or less the same question there before, but didn’t get the answer I wanted. This time, I hit a jackpot though.

Willy Tarreau, creator of HAProxy, responded with a patch to keepalived that allows it to communicate over unicast. I applied it, recompiled, set up the new options the patch gives (“vrrp_unicast_bind” &”vrrp_unicast_peer“), and spun it up on both machines.

Yesss!! It works! Stopping HAProxy on the first server made the second machine take the shared IP.

Now, ideally there would be a bunch of machines, namely all my web servers that would be standing by ready to take the shared IP. This patch only allows me two machines. Failover is failover though, when one instance goes down, I get an email and can go in and investigate.  I’d still like to know if there is a way to do failover on a cluster of servers without multicast, but for now this works great.

A while back I wrote a post about using NginX as a reverse-proxy cache for PHP (or whatever your backend is) and mentioned how I was using HAProxy to load balance. The main author of HAProxy wrote a comment about keep-alive support and how it would make things faster.

At the time, I thought “What’s the point of keep-alive for front-end? By the time the user navigates to the next page of your site, the timeout has expired, meaning a connection was left open for nothing.” This assumed that a user downloads the HTML for a site, and doesn’t download anything else until their next page request. I forgot about how some websites actually have things other than HTML, namely images, CSS, javascript, etc.

Well in a recent “omg I want everything 2x faster” frenzy, I decided for once to focus on the front-end. On beeets, we’re already using S3 with CloudFront (a CDN), aggressive HTTP caching, etc. I decided to try the latest HAProxy (1.4.4) with keep-alive.

I got it, compiled it, reconfigured:

defaults
	...
	option httpclose

became:
defaults
	...
	timeout client  5000
	option http-server-close

Easy enough…that tells HAProxy to close the server-side connection, but leave the client connection open for 5 seconds.

Well, a quick test and site load times were down by a little less than half…from about 1.1s client load time (empty cache) to 0.6s. An almost instant benefit. How does this work?

Normally, your browser hits the site. It requests /page.html, and the server says “here u go, lol” and closes the connection. Your browser reads page.html and says “hay wait, I need site.css too.” It opens a new connection and the web server hands the browser site.css and closes the connection. The browser then says “darn, I need omfg.js.” It opens another connection, and the server rolls its eyes, sighs, and hands it omfg.js.

That’s three connections, with high latency each, your browser made to the server. Connection latency is something that, no matter how hard you try, you cannot control…and there is a certain amount of latency for each of the connections your browser opens. Let’s say you have a connection latency of 200ms (not uncommon)…that’s 600ms you just waited to load a very minimal HTML page.

There is hope though…instead of trying to lower latency, you can open fewer connections. This is where keep-alive comes in.

With the new version of HAProxy, your browser says “hai, give me /page.html, but keep the connection open plz!” The web server hands over page.html and holds the connection open. The browser reads all the files it needs from page.html (site.css and omfg.js) and requests them over the connection that’s already open. The server keeps this connection open until the client closes it or until the timeout is reached (5 seconds, using the above config). In this case, the latency is a little over 200ms, the total time to load the page 200ms + the download time of the files (usually less than the latency).

So with keep-alive, you just turned a 650ms page-load time into a 250ms page-load time… a much larger margin than any sort of back-end tweaking you can do. Keep in mind most servers already support keep-alive…but I’m compelled to write about it because I use HAProxy and it’s now fully implemented.

Also keep in mind that the above scenario isn’t necessarily correct. Most browsers will open up to 6 concurrent connections to a single domain when loading a page, but you also have to factor in the fact that the browser blocks downloads when it encounters a javascript include, and then attempts to download and run the javascript before continuing the page load.

So although your connection latency with multiple requests goes down with keep-alive, you won’t get a 300% speed boost, more likely a 100% speed boost depending on how many scripts are loading in your page along with any other elements…100% is a LOT though.

So for most of us webmasters, keep-alive is a wonderful thing (assuming it has sane limits and timeouts). It can really save a lot of page load time on the front-end, which is where users spend the most of their time waiting. But if you happen to have a website that’s only HTML, keep-alive won’t do you much good =).