I like THE CLOUD because with the cloud, I don’t have to know about things like SANs, Load Balancers, failover, DNS, application layers, etc. All that stuff is bullshit anyway. With the cloud, everything is everything, and everything is nothing. It’s beautiful. So beautiful, I wrote a poem. Ahem:

The cloud is fantastic,
the cloud is so great.
If it's not in the cloud,
it makes me irrate.

The nodes in the cloud,
be they big, be they small;
doesn't matter their power,
I love them all.

The cloud's all around us,
even in my phone's soul.
I can talk on the cloud
as I drive into a pole...

And if I do not wake
from my hospital bed,
I live free — in the cloud —
though my body is dead.

My essence be fetched,
from the cloud where it's stored,
to be put in a body
bought off Amazon (with rewards).

And when I grow senile,
from the cloud I will gain.
It will upload sane thoughts
through a chip in my brain.

So you see the clouds beauty,
spawned from words said so proud?
This poem will live on forever...
it's been sent to the cloud.

I decided this weekend I wanted to go down the road of trying out MySQL Cluster for beeets.com. The reason isn’t speed, it’s availability. After countless hours of research, I decided I’d rather have a plate of turds for breakfast than have to worry about Master-Master replication (or DRBD) w/heartbeat, not to mention what to do when things get out of sync. Not my cup of tea. MySQL Cluster may be a bit slower than a replicated setup (in almost all cases except for primary key lookup, I suspect), but to me it’s worth it to have a more set-it and forget-it approach. There are many benefits of cluster over replication:

  • Any server can go down. Assuming you have more than one replica of your data, you can lose any server in your setup and still be up and running. This can be achieved with replication, but it’s not as easy. You have to have some form of Master-Master replication, perhaps with DRDB, and some form of failover (usually heartbeat).
  • Your data  set scales. If you start running out of disk space with a cluster, just add a few more data nodes and your data will be spread out over them. With replication, each replicated server has to have enough storage to fit the entire database. That means if your dataset grows too large, you have to either partition (a hack, essentially) or upgrade your servers.
  • Your bandwidth scales. With a cluster, if you are running out of bandwidth, you can add more mysqld processes on your www servers or add more data nodes and your bandwidth scales almost linearly. With replication, you can only add so many slaves before your writes are the bottleneck. Then, once again, you have to look into things like circular replication (dangerous) or partitioning your data set (large updates to your app unless you have an insanely good ORM, big infrastructure change).

These are the main points that helped me decide. Historically, with a clustered approach, the entire dataset would have to fit in the memory of all the data nodes, which is somewhat restrictive if the dataset gets too large. Nowadays, the cluster only needs to store indexes in memory, and can store all non-indexed data on disk. There is talk of having completely disk-based store as well.

All that being said, I set up cluster, which was surprisingly easy. I’m not going to go over how to set it up or anything, just read the manual. After some benchmarking with the web API for beeets.com, the cluster setup appeared to be running about the same speed as the InnoDB setup when testing various commands…a pleasant surprise. It also appeared to handle concurrency a bit better.

Obviously once the dataset grows past a few megs and the traffic bumps up, we’ll revisit the benchmarking, but my hope is that what cluster loses in speed from your everyday general query, it gains in speed by having ability for higher concurrency.

This weekend I wen’t on a frenzy. I turned beeets.com from a single VPS enterprise to 4 VPSs: 2 web (haproxy, nginx, php-fpm, sphinx, memcached, ndb_mgmd) and 2 database servers (ndmtd). There’s still some work to do, but the entire setup seems to be functioning well.

I had a few problems though. In PHP (just PHP, and nothing else) hosts were not resolving. The linux OS was resolving hosts just fine, but PHP couldn’t. It was frustrating. Also, I was unable to sudo. I kept checking permissions on all my files in /etc, rebooting, checking again, etc.

The fix

Then I looked again. /etc itself was owned by andrew:users. Huh? I changed permissions back root:root, chmod 755. Everything works. Now some background.

A while back, I wrote some software (bash + php) that makes it insanely easy to install software to several servers at once, and sync configurations for different sets of servers. It’s called “ssync.” It’s not ready for release yet, but I can say without it, I’d have about 10% of the work done that I’d finished already. Ssync is a command-line utility that lets you set up servers (host, internal ip, external ip) and create groups. Each group has a set of install scripts and configuration files that can be synced to /etc. The configuration files are PHP scriptable, so instead of, say, adding all my hosts by hand to the /etc/hosts file, I can just loop over all servers in the group and add them automatically. Same with my www group, I can add a server to the “www” group in ssync, and all of a sudden the HAproxy config knows about the server.

Here’s the problem. When ssync was sending configuration files to /etc on remote servers, it was also setting permissions on those files (and folders) by default. This was because I was using -vaz, which attempts to preserve ownership, groupship, and permissions from the source (not good). I added some new params (so now it’s “-vaz –no-p –no-g –no-o”). Completely fixed it.

A while back I wrote a post about using NginX as a reverse-proxy cache for PHP (or whatever your backend is) and mentioned how I was using HAProxy to load balance. The main author of HAProxy wrote a comment about keep-alive support and how it would make things faster.

At the time, I thought “What’s the point of keep-alive for front-end? By the time the user navigates to the next page of your site, the timeout has expired, meaning a connection was left open for nothing.” This assumed that a user downloads the HTML for a site, and doesn’t download anything else until their next page request. I forgot about how some websites actually have things other than HTML, namely images, CSS, javascript, etc.

Well in a recent “omg I want everything 2x faster” frenzy, I decided for once to focus on the front-end. On beeets, we’re already using S3 with CloudFront (a CDN), aggressive HTTP caching, etc. I decided to try the latest HAProxy (1.4.4) with keep-alive.

I got it, compiled it, reconfigured:

defaults
	...
	option httpclose

became:
defaults
	...
	timeout client  5000
	option http-server-close

Easy enough…that tells HAProxy to close the server-side connection, but leave the client connection open for 5 seconds.

Well, a quick test and site load times were down by a little less than half…from about 1.1s client load time (empty cache) to 0.6s. An almost instant benefit. How does this work?

Normally, your browser hits the site. It requests /page.html, and the server says “here u go, lol” and closes the connection. Your browser reads page.html and says “hay wait, I need site.css too.” It opens a new connection and the web server hands the browser site.css and closes the connection. The browser then says “darn, I need omfg.js.” It opens another connection, and the server rolls its eyes, sighs, and hands it omfg.js.

That’s three connections, with high latency each, your browser made to the server. Connection latency is something that, no matter how hard you try, you cannot control…and there is a certain amount of latency for each of the connections your browser opens. Let’s say you have a connection latency of 200ms (not uncommon)…that’s 600ms you just waited to load a very minimal HTML page.

There is hope though…instead of trying to lower latency, you can open fewer connections. This is where keep-alive comes in.

With the new version of HAProxy, your browser says “hai, give me /page.html, but keep the connection open plz!” The web server hands over page.html and holds the connection open. The browser reads all the files it needs from page.html (site.css and omfg.js) and requests them over the connection that’s already open. The server keeps this connection open until the client closes it or until the timeout is reached (5 seconds, using the above config). In this case, the latency is a little over 200ms, the total time to load the page 200ms + the download time of the files (usually less than the latency).

So with keep-alive, you just turned a 650ms page-load time into a 250ms page-load time… a much larger margin than any sort of back-end tweaking you can do. Keep in mind most servers already support keep-alive…but I’m compelled to write about it because I use HAProxy and it’s now fully implemented.

Also keep in mind that the above scenario isn’t necessarily correct. Most browsers will open up to 6 concurrent connections to a single domain when loading a page, but you also have to factor in the fact that the browser blocks downloads when it encounters a javascript include, and then attempts to download and run the javascript before continuing the page load.

So although your connection latency with multiple requests goes down with keep-alive, you won’t get a 300% speed boost, more likely a 100% speed boost depending on how many scripts are loading in your page along with any other elements…100% is a LOT though.

So for most of us webmasters, keep-alive is a wonderful thing (assuming it has sane limits and timeouts). It can really save a lot of page load time on the front-end, which is where users spend the most of their time waiting. But if you happen to have a website that’s only HTML, keep-alive won’t do you much good =).

Recently I’ve been working on speeding up the homepage of beeets.com. Most speed tests say it takes between 4-6 seconds. Obviously, all of them are somehow fatally flawed. I digress, though.

Everyone (who’s anyone) knows that gzipping your content is a great way to reduce download time for your users. It can cut the size of html, css, and javascript by about 60-90%. Everyone also knows that gzipping can be very cpu intensive. Not anymore.

I just installed nginx’s Gzip Static Module (compile nginx with –with-http_gzip_static_module) on beeets.com. It allows you to pre-cache your gzip files. What?

Let’s say you have the file /css/beeets.css. When a request for beeets.css comes through. the static gzip module will look for /css/beeets.css.gz. If it finds it, it will serve that file as gzipped content. This allows you to gzip your static files using the highest compression ratio (gzip -9) when deploying your site. Nginx then has absolutely no work to do besides serving the static gzip file (it’s very good at serving static content).

Wherever you have a gzip section in your nginx config, you can do:

gzip_static on;

That’s it. Note that you will have to create the .gz versions of the files yourself, and it’s mentioned in the docs that it’s better if the original and the .gz files have the same timestamp; so it may be a good idea to “touch” the files after both are created. It’s also a good idea to turn the gzip compression down (gzip_comp_level 1..3). This will minimally compress dynamic content without putting too much strain on the server.

This is a great way to get the best of both worlds: gzipping (faster downloads) without the extra load on the server. Once again, nginx pulls through as the best thing since multi-cellular life. Keep in mind that this only works on static content (css, javascript, etc etc). Dynamic pages can and should be gzipped, but with a lower compression ratio to keep load off the server.

Here’s a good tip I just found. Note that this may not be for all cases. In fact, I may have stumbled on a freak coincidence. Here’s the story:

I hate java. I hate having java on a server, but hate it even more if it’s only for running one small script. Forever, beeets.com has used the YUI compressor to shrink its javascript before deployment. Well, YUI won’t run without java, so for the longest time, jre has been installed collecting dust, only to be brushed off and used once in a while during a deployment. This seems like a huge waste of space and resources.

Well, first I tried gcj. Compiling gcj was fairly straightforward, thankfully. After installing, I realized I needed to know a lot more about java in order to compile the YUI compressor with it. I needed knowledge I did not have the long-term need for, nor the will to learn in the first place. I, although revering myself as extremely tenacious, gave up.

I decided to try JSMin. This nifty program is simple, elegant, and it works well. It also has a much worse compression ratio then YUI. However, I trust any site that hosts C code and has no real layout whatsoever. Knowing the compression wasn’t as good, I still wanted to see what kind of difference gzipping the files would have.

I recorded the size of the GZipped JS files that used YUI. I then reconfigured the deployment script to use JSMin instead of YUI. I looked at the JS files with JSMin compression:

YUI:
mootools.js     88.7K (29.6K gz)
beeets.js       61.5K (20.5K gz)

JSMin:
mootools.js    106.1K (29.5K gz)
beeets.js       71.0K (17.7K gz)

Huh? GZip is actually more effective on the JS files using JSMin vs YUI! The end result is LESS download time for users.

I don’t know if this is a special case, but I was able to derive a somewhat complex formula:

YUI > JSMin
YUI + GZip < JSMin + GZip

Who would have thought. See you in hell, java.

In my work as a web developer, I’ve come across many, many cases where projects, namely projects using PHP frameworks, have made use of an Object Relation Mapping tool. I’ve used them a bit myself, in apps that use CakePHP. I have to say, after going from writing plain queries to communicating with objects, I prefer very much writing my own queries.

Let’s first talk about what an ORM is. Basically, you have an app, and you have a database. As the case with most apps, it needs to actually communicate with the database, usually by using queries. Queries are a language the database understands. They allow the application to ask the database for very specific information. An ORM sits between the application and the database. Its role is to give the application an object to communicate to. This object pretends as if it is a piece of data in the database, and allows the app to do things like data.update() or data.delete(). The ORM will write the appropriate queries to the database, regardless of the type of database. A good ORM can also perform joins between pieces of data and perform somewhat complex queries on the database. The purpose is to give a standard interface to communicate with any database.

So here’s my question: on a simple application, an ORM may be a good idea. It provides a standard interface to communicate with, and also allows the database to be “easily” switched out without modifying the main application code at all. But on any app I’ve worked on, there are many, many queries written that an ORM wouldn’t be able to map or understand. So what is the point of an ORM if it can’t handle everything? It’s a standard interface that becomes non-standard the second you write your first non-ORM query.

There is no way anybody could ever write an ORM that handles every query that possibly needs to be written. And instead of defining relationships between data in your queries, you have to define the relationships through the code.  Also, the argument I hear over and over and over: “it allows you to switch out your database easily.” Who the hell switches out their database? Why not just pick a database that does what you want from the beginning…and for the most part, they all do the same damned thing. Also, SQL is kind-of standard, so even without an ORM it’s not like you’ll be rewriting every query from scratch…most likely you’ll have to rewrite a few database-specific functions (think SELECT last_insert_id()). Is it really so hard to do this, especially if you only do it once? If you are switching from Oracle to PGSql to MySQL to MSSQL every other day, then yes, an ORM would probably make sense, but otherwise I don’t see the point.

Data is data, it is not another object. Moving everything under the sun into the object-oriented model does not make anyone’s life easier. SQL is good. Procedural is lightning fast. Learn how to use these, because OO will not solve all your problems.

I welcome use-cases besides those I have mentioned and arguments for/against ORMs in the comments. I’m speaking from personal experience and not married to my opinion…so I’m actually very curious if any of you successfully use an ORM that does everything you need it to.

I recently read a post on a web development firm’s blog (anonymous to protect them and myself). It was talking about how open-source web software is inferior to closed-source. The main reasoning was that open-source allows attackers to find vulnerabilities just by sifting through the code. The company touts their proprietary CMS as better than Drupal or WordPress because only they (and their customers, heh) see the source code. Therefore it’s rock solid.

I was kind of blown away by this. Obviously it’s a marketing ploy to scare unknowing customers into using them instead of doing a simple WordPress install, but it’s blatantly wrong and I feel the need to respond. Oddly enough, their blog is in WordPress. Hmm.

First off, all software has vulnerabilities. All servers have vulnerabilities. Yes, it’s easier to find them if you know the setup or know the code, but from what I’ve seen in my lifetime of computer work is this: if someone wants to hack your site, they will. If there is a vulnerability, they will find it. And as I just said, all software has vulnerabilities. It’s stupid to assume that because the source is only readily available to people who pay you money and the people who work on their site after you that no vulnerabilities will ever be found. They will be found. Look at Google. They were just hacked by China. Does Google open source their Gmail app? No, completely closed-source. But someone wanted to hack them, so they got hacked. That’s what happens. Also, if your proprietary CMS is written in PHP, Python, Ruby, Perl, etc etc…you’re still using open source. Someone could attack the site at the language level. Does it make sense to now develop your own closed-source programming language so nobody will ever be able to hack it?

Secondly, most well-known open-source software has been around a very long time and has had hundreds of thousands (if not millions) of people using it. This means that over time, it gets battle-hardened. The common and not-so-common vulnerabilities are found, leaving the users with the latest versions a rock-solid code base that has gone through thousands of revisions to be extremely secure. With open-source, you’ve got hundreds of eyes looking over everything that’s added/changed/removed at all times. With proprietary code, you get a few pairs of eyes at best, with much fewer installs, much fewer revisions to harden and secure.

Is open-source better than proprietary? If you’re poor, most likely, but otherwise they both have their good and bad points. The main point of this article isn’t to bash proprietary software at all, it’s to refute the claim that because the source is open the product is less secure. I believe the exact opposite, in fact. If your code is open for everyone to look at, you damn well better be good at seeing vulnerabilities before they even get deployed…and if you don’t catch it, someone else developing the project probably will.

Is open source too open? Hell no.

So I got to thinking. There are some good caching reverse proxies out there, maybe it’s time to check one out for beeets. Not that we get a ton of traffic or we really need one, but hey what if we get digged or something? Anyway, the setup now is not really what I call simple. HAproxy sits in front of NginX, which serves static content and sends PHP requests back to PHP-FPM. That’s three steps to load a fucking page. Most sites use apache + mod_php (one step)! But I like to tinker, and I like to see requests/second double when I’m running ab on beeets.

So, I’d like to try something like Varnish (sorry, Squid) but that’s adding one more step in between my requests and my content. Sure it would add a great speed boost, but it’s another layer of complexity. Plus it’s a whole nother service to ramp up on, which is fun but these days my time is limited. I did some research and found what I was looking for.

NginX has made me cream my pants every time I log onto the server since the day I installed it. It’s fast, stable, fast, and amazing. Wow, I love it. Now I read that NginX can cache FastCGI requests based on response caching headers. So I set it up, modified the beeets api to send back some Cache-Control junk, and voilà…a %2800 speed boost on some of the more complicated functions in the API.

Here’s the config I used:

# in http {}
fastcgi_cache_path /srv/tmp/cache/fastcgi_cache levels=1:2
                           keys_zone=php:16m
                           inactive=5m max_size=500m;
# after our normal fastcgi_* stuff in server {}
fastcgi_cache php;
fastcgi_cache_key $request_uri$request_body;
fastcgi_cache_valid any 1s;
fastcgi_pass_header Set-Cookie;
fastcgi_buffers 64 4k;

So we’re giving it a 500mb cache. It says that any valid cache is saved for 1 second, but this gets overriden with the Cache-Control headers sent by PHP. I’m using $request_body in the cache key because in our API, the actual request is sent through like:

GET /events/tags/1 HTTP/1.1
Host: ...

{"page":1,"per_page":10}

The params are sent through the HTTP body even in a GET. Why? I spent a good amount of time trying to get the API to accept the params through the query string, but decided that adding $request_body to one line in an NginX config was easier that re-working the structure of the API. So far so good.

That’s FastCGI acting as a reverse proxy cache. Ideally in our setup, HAproxy would be replaced by a reverse proxy cache like Varnish, and NginX would just stupidly forward requests to PHP like it was earlier today…but I like HAproxy. Having a health-checking load-balancer on every web server affords some interesting failover opportunities.

Anyway, hope this helps someone. NginX can be a caching reverse proxy. Maybe not the best, but sometimes, just sometimes,  simple > faster.

340xYeah, so this amazing new device will, like, revolutionize the way we all look at things and stuff. Because you can touch it, things will be way better. Our lives just got a ton better. This revolutionary device will revolutionize the way we look at news and movies. Oh, and it will also change the way cities are structured.

So, in case you haven’t heard, Apple took their iPod touch, made it 5x bigger, and are now marketing it as the iPad (or “Tablet”). Where does that leave us? A portable device that’s not portable and really fucking difficult to use. The reason laptops have keyboards and pointing devices is because people don’t like on-screen keyboards. They suck. It’s necessary on small and mobile devices like the iPod touch, but on a bigger level it’s not…which why laptops exist.

So before you follow the marketing hype and buy your new $500 tablet, ask yourself “What the fuck am I thinking?! I already have an iPod, and I already have a laptop. Those swindling asslickers don’t need more of my money!”

That’s right, the iPad is a shitty in between piece of shit which is shitty and smells like shit. It’s not quite a laptop, and it doesn’t quite fit in your pocket. Stay away!! Don’t be a dweeb!