We were writing some parsing code for a client today. It takes a long string (html) and parses it out into array items. It loops over the string recursively and running a few preg_replaces on it every pass. We got “out of memory” errors when running it. After putting in some general stats, we found that memory usage was climbing 400k after each block of preg_replaces, which was being added on each loop (there were around 600 loops or so). This memory just grew and grew, even though the recursion at most got 6 levels deep. It was never being released.

I did some reading and found that the preg* functions cache up to 4096 regex results in a request. This is the problem…a pretty stupid one too. It would be nice if they made this a configurable option or at least let you turn it off when, say, you are running a regex on a different string every time (why the hell would I run the same regex on the same string twice…isn’t that what variables are for?) Unless I’m misunderstanding and PHP caches the compiled regex (but not its values)…but either way, memory was climbing based on the length of the string.

Since the regex was only looking at the beginning of the string and disregarding the rest (thank god), the fix was easy (although a bit of a hack):

$val = preg_replace('/.../', '', $long_string);

Becomes:

$short_string = substr($long_string, 0, 128);
$val = preg_replace('/.../', '', $short_string);

PHP guys: how about an option to make preg* NOT have memory leaks =).

After reading an article about how the number of phone calls made is decreasing, I feel I have to interject something. This obviously shouldn’t be news to most people, because most of us are right in the middle of it (in North America, anyway). The fact is that people are talking less and less in favor of texting each other. While this is an interesting shift in our culture, I’m starting to think things are going a bit too far.

It seems that since widespread adoption of the internet, although more and more people have become seemingly connected through social networking and other mediums, people are drifting further and further apart. A friend is no longer a friend. A real friend is now what a friend was, and a friend is someone you say “damn we haven’t talked in years, how r u?” to. Communities are popping up everywhere online that replace the communities around us physically.

This in itself I don’t feel is bad. A lot of people who never would have met are meeting and sharing new ideas. Information spreads more rapidly. Cultural consciousness is more global, which in most cases is a very good thing.

I think things start to go wrong when people get addicted to this information overload though. They use it as a fuel for everyday distraction, a replacement for the communities they live in, and a tool to deliver opinions and beliefs to them when they would have otherwise had to think (although this last item is true of most media).

Also, it’s one thing to not be in front of someone when you talk to them. A voice conversation can have emotion and depth, but it can also be quick and effortless. The fact that it’s being replaced by one-off messages that are 100% ignorable and have no real content to them is kind of sickening. I’ve heard arguments that “I text someone when it doesn’t make sense to have a whole conversation,” but I’ll see the same person texting back and forth with someone for 10 minutes straight. Or a text is delivered and the person who sent it squirms in anticipation for the reply, which may never come.

What’s wrong with a phone call? Granted, if you’re in a bar and it’s very loud, texting would be appropriate. If you call someone and they don’t pick up, either they don’t want to talk or, god forbid, they aren’t right next to their phone all times of the day. If you want to talk to someone, just call them. I don’t believe texting is a viable replacement for what was the last string of human contact we had.

That all said, I know it’s a giant ball and it rolls where it rolls and there’s no stopping it. There’s no problem with being aware of things that are going on around us though. I feel like each time a real connection between two real people is replaced with something artificial, our culture as a whole goes just a little bit more insane. I’m interested to see how this all pans out, mainly because I don’t have a whole lot of attachment to what our culture is now.

In my latest frenzy, which was focused on HA more than performance, I installed some new servers, new services on those servers, and the general complexity of the entire setup for beeets.com doubled. I was trying to remember a utility that I saw a while back that would restart services if they failed. I checked my delicious account, praying that I had thought of my future self when I originally saw it. Luckily, I had saved it under my “linux” tag. Thanks, Andrew from the past.

The tool is called monit, and I’m surprised I ever lived without it. Not only does it monitor your services and keep them running, it can restart them if they fail, use too much memory/cpu, stop responding on a certain port, etc. Not only that, but it will email you every time something happens.

While perusing monit’s site, I saw M/Monit which allows you to monitor monit over web, essentially. The only thing I scratched my head about was that M/Monit uses port 8080 (which is fine) but NginX already uses port 8080, and I wasn’t about to change that, so I opened conf/server.xml and looked for 8080, replaced with 8082 (monit runs on 8081 =)). Then I reconfigured monit to communicate with M/Monit and vice versa, and now I have a kickass process monitor that alerts me when things go wrong, and also sends updates to a service that allows me to monitor the monitor.

I can’t look at things like queries/sec as I can with Cacti (which is awesome but a little clunky) but I can see which important services are running on each of my servers, and even restart them if I need to straight from M/Monit. The free download license allows to use M/Monit on one server, which is all I need anyway.

Great job monit team, you have gone above and beyond.

I decided this weekend I wanted to go down the road of trying out MySQL Cluster for beeets.com. The reason isn’t speed, it’s availability. After countless hours of research, I decided I’d rather have a plate of turds for breakfast than have to worry about Master-Master replication (or DRBD) w/heartbeat, not to mention what to do when things get out of sync. Not my cup of tea. MySQL Cluster may be a bit slower than a replicated setup (in almost all cases except for primary key lookup, I suspect), but to me it’s worth it to have a more set-it and forget-it approach. There are many benefits of cluster over replication:

  • Any server can go down. Assuming you have more than one replica of your data, you can lose any server in your setup and still be up and running. This can be achieved with replication, but it’s not as easy. You have to have some form of Master-Master replication, perhaps with DRDB, and some form of failover (usually heartbeat).
  • Your data  set scales. If you start running out of disk space with a cluster, just add a few more data nodes and your data will be spread out over them. With replication, each replicated server has to have enough storage to fit the entire database. That means if your dataset grows too large, you have to either partition (a hack, essentially) or upgrade your servers.
  • Your bandwidth scales. With a cluster, if you are running out of bandwidth, you can add more mysqld processes on your www servers or add more data nodes and your bandwidth scales almost linearly. With replication, you can only add so many slaves before your writes are the bottleneck. Then, once again, you have to look into things like circular replication (dangerous) or partitioning your data set (large updates to your app unless you have an insanely good ORM, big infrastructure change).

These are the main points that helped me decide. Historically, with a clustered approach, the entire dataset would have to fit in the memory of all the data nodes, which is somewhat restrictive if the dataset gets too large. Nowadays, the cluster only needs to store indexes in memory, and can store all non-indexed data on disk. There is talk of having completely disk-based store as well.

All that being said, I set up cluster, which was surprisingly easy. I’m not going to go over how to set it up or anything, just read the manual. After some benchmarking with the web API for beeets.com, the cluster setup appeared to be running about the same speed as the InnoDB setup when testing various commands…a pleasant surprise. It also appeared to handle concurrency a bit better.

Obviously once the dataset grows past a few megs and the traffic bumps up, we’ll revisit the benchmarking, but my hope is that what cluster loses in speed from your everyday general query, it gains in speed by having ability for higher concurrency.

This weekend I wen’t on a frenzy. I turned beeets.com from a single VPS enterprise to 4 VPSs: 2 web (haproxy, nginx, php-fpm, sphinx, memcached, ndb_mgmd) and 2 database servers (ndmtd). There’s still some work to do, but the entire setup seems to be functioning well.

I had a few problems though. In PHP (just PHP, and nothing else) hosts were not resolving. The linux OS was resolving hosts just fine, but PHP couldn’t. It was frustrating. Also, I was unable to sudo. I kept checking permissions on all my files in /etc, rebooting, checking again, etc.

The fix

Then I looked again. /etc itself was owned by andrew:users. Huh? I changed permissions back root:root, chmod 755. Everything works. Now some background.

A while back, I wrote some software (bash + php) that makes it insanely easy to install software to several servers at once, and sync configurations for different sets of servers. It’s called “ssync.” It’s not ready for release yet, but I can say without it, I’d have about 10% of the work done that I’d finished already. Ssync is a command-line utility that lets you set up servers (host, internal ip, external ip) and create groups. Each group has a set of install scripts and configuration files that can be synced to /etc. The configuration files are PHP scriptable, so instead of, say, adding all my hosts by hand to the /etc/hosts file, I can just loop over all servers in the group and add them automatically. Same with my www group, I can add a server to the “www” group in ssync, and all of a sudden the HAproxy config knows about the server.

Here’s the problem. When ssync was sending configuration files to /etc on remote servers, it was also setting permissions on those files (and folders) by default. This was because I was using -vaz, which attempts to preserve ownership, groupship, and permissions from the source (not good). I added some new params (so now it’s “-vaz –no-p –no-g –no-o”). Completely fixed it.

A while back I wrote a post about using NginX as a reverse-proxy cache for PHP (or whatever your backend is) and mentioned how I was using HAProxy to load balance. The main author of HAProxy wrote a comment about keep-alive support and how it would make things faster.

At the time, I thought “What’s the point of keep-alive for front-end? By the time the user navigates to the next page of your site, the timeout has expired, meaning a connection was left open for nothing.” This assumed that a user downloads the HTML for a site, and doesn’t download anything else until their next page request. I forgot about how some websites actually have things other than HTML, namely images, CSS, javascript, etc.

Well in a recent “omg I want everything 2x faster” frenzy, I decided for once to focus on the front-end. On beeets, we’re already using S3 with CloudFront (a CDN), aggressive HTTP caching, etc. I decided to try the latest HAProxy (1.4.4) with keep-alive.

I got it, compiled it, reconfigured:

defaults
	...
	option httpclose

became:
defaults
	...
	timeout client  5000
	option http-server-close

Easy enough…that tells HAProxy to close the server-side connection, but leave the client connection open for 5 seconds.

Well, a quick test and site load times were down by a little less than half…from about 1.1s client load time (empty cache) to 0.6s. An almost instant benefit. How does this work?

Normally, your browser hits the site. It requests /page.html, and the server says “here u go, lol” and closes the connection. Your browser reads page.html and says “hay wait, I need site.css too.” It opens a new connection and the web server hands the browser site.css and closes the connection. The browser then says “darn, I need omfg.js.” It opens another connection, and the server rolls its eyes, sighs, and hands it omfg.js.

That’s three connections, with high latency each, your browser made to the server. Connection latency is something that, no matter how hard you try, you cannot control…and there is a certain amount of latency for each of the connections your browser opens. Let’s say you have a connection latency of 200ms (not uncommon)…that’s 600ms you just waited to load a very minimal HTML page.

There is hope though…instead of trying to lower latency, you can open fewer connections. This is where keep-alive comes in.

With the new version of HAProxy, your browser says “hai, give me /page.html, but keep the connection open plz!” The web server hands over page.html and holds the connection open. The browser reads all the files it needs from page.html (site.css and omfg.js) and requests them over the connection that’s already open. The server keeps this connection open until the client closes it or until the timeout is reached (5 seconds, using the above config). In this case, the latency is a little over 200ms, the total time to load the page 200ms + the download time of the files (usually less than the latency).

So with keep-alive, you just turned a 650ms page-load time into a 250ms page-load time… a much larger margin than any sort of back-end tweaking you can do. Keep in mind most servers already support keep-alive…but I’m compelled to write about it because I use HAProxy and it’s now fully implemented.

Also keep in mind that the above scenario isn’t necessarily correct. Most browsers will open up to 6 concurrent connections to a single domain when loading a page, but you also have to factor in the fact that the browser blocks downloads when it encounters a javascript include, and then attempts to download and run the javascript before continuing the page load.

So although your connection latency with multiple requests goes down with keep-alive, you won’t get a 300% speed boost, more likely a 100% speed boost depending on how many scripts are loading in your page along with any other elements…100% is a LOT though.

So for most of us webmasters, keep-alive is a wonderful thing (assuming it has sane limits and timeouts). It can really save a lot of page load time on the front-end, which is where users spend the most of their time waiting. But if you happen to have a website that’s only HTML, keep-alive won’t do you much good =).

Recently I’ve been working on speeding up the homepage of beeets.com. Most speed tests say it takes between 4-6 seconds. Obviously, all of them are somehow fatally flawed. I digress, though.

Everyone (who’s anyone) knows that gzipping your content is a great way to reduce download time for your users. It can cut the size of html, css, and javascript by about 60-90%. Everyone also knows that gzipping can be very cpu intensive. Not anymore.

I just installed nginx’s Gzip Static Module (compile nginx with –with-http_gzip_static_module) on beeets.com. It allows you to pre-cache your gzip files. What?

Let’s say you have the file /css/beeets.css. When a request for beeets.css comes through. the static gzip module will look for /css/beeets.css.gz. If it finds it, it will serve that file as gzipped content. This allows you to gzip your static files using the highest compression ratio (gzip -9) when deploying your site. Nginx then has absolutely no work to do besides serving the static gzip file (it’s very good at serving static content).

Wherever you have a gzip section in your nginx config, you can do:

gzip_static on;

That’s it. Note that you will have to create the .gz versions of the files yourself, and it’s mentioned in the docs that it’s better if the original and the .gz files have the same timestamp; so it may be a good idea to “touch” the files after both are created. It’s also a good idea to turn the gzip compression down (gzip_comp_level 1..3). This will minimally compress dynamic content without putting too much strain on the server.

This is a great way to get the best of both worlds: gzipping (faster downloads) without the extra load on the server. Once again, nginx pulls through as the best thing since multi-cellular life. Keep in mind that this only works on static content (css, javascript, etc etc). Dynamic pages can and should be gzipped, but with a lower compression ratio to keep load off the server.

I never thought I’d see the day where people who build web servers would care what other people use them to host. In section 1 of LiteSpeed’s licence agreement you will see “You cannot use the SOFTWARE PRODUCT for any illegal activity or to host pornographic content.” HA!

That’s the stupidest thing I’ve ever seen. What kind of business limits the usage of its products to upstanding citizens only? Last I checked it was the government’s job to impose its views on businesses, not businesses imposing their views on their customers.

I have to say, it’s nice that someone is using their business to take a stand, I guess I’d just prefer it to be in defense of free speech and expression. Sure ALL porn is smutty and violent, but that’s expression in itself. Also, can you fight basic human nature? Perhaps, on a personal level. Repression of sexual tendencies is a lot different than acceptance and non-action though. My point is that pornography is the one place where sexual fantasies are allowed to exist in any way shape and/or form, and Americans, being extremely sexually self-repressed, need that outlet, not more repression.

I also think it’s funny when someone tries to be exclusive because they’re SO awesome when someone else is doing it way better

Here’s a good tip I just found. Note that this may not be for all cases. In fact, I may have stumbled on a freak coincidence. Here’s the story:

I hate java. I hate having java on a server, but hate it even more if it’s only for running one small script. Forever, beeets.com has used the YUI compressor to shrink its javascript before deployment. Well, YUI won’t run without java, so for the longest time, jre has been installed collecting dust, only to be brushed off and used once in a while during a deployment. This seems like a huge waste of space and resources.

Well, first I tried gcj. Compiling gcj was fairly straightforward, thankfully. After installing, I realized I needed to know a lot more about java in order to compile the YUI compressor with it. I needed knowledge I did not have the long-term need for, nor the will to learn in the first place. I, although revering myself as extremely tenacious, gave up.

I decided to try JSMin. This nifty program is simple, elegant, and it works well. It also has a much worse compression ratio then YUI. However, I trust any site that hosts C code and has no real layout whatsoever. Knowing the compression wasn’t as good, I still wanted to see what kind of difference gzipping the files would have.

I recorded the size of the GZipped JS files that used YUI. I then reconfigured the deployment script to use JSMin instead of YUI. I looked at the JS files with JSMin compression:

YUI:
mootools.js     88.7K (29.6K gz)
beeets.js       61.5K (20.5K gz)

JSMin:
mootools.js    106.1K (29.5K gz)
beeets.js       71.0K (17.7K gz)

Huh? GZip is actually more effective on the JS files using JSMin vs YUI! The end result is LESS download time for users.

I don’t know if this is a special case, but I was able to derive a somewhat complex formula:

YUI > JSMin
YUI + GZip < JSMin + GZip

Who would have thought. See you in hell, java.

Being a heavy and casual marijuana user for almost 10 years, and knowing many others who also are/were, I think I have a pretty good understanding of its effects, both positive and negative. I’d like to dispel some myths.

First off, you always hear that marijuana is a gateway drug. I respond: being a teenager is a gateway drug. The emotions, the hormones, the internal and external influences pulling you in a thousand directions every second of your life…it’s a wonder most of us make it through. That alone is enough to make most people want to try just about every drug out there. Also, another reason marijuana is a gateway drug is because kids are always taught how terrible it is and how addictive it is. So what’s the next thing they do? They try it. After finding that they were lied to and mislead, they learn to mistrust those telling them that “all drugs are bad.” So now heroin or cocaine doesn’t seem so bad either, even though they have much more far-reaching effects than marijuana. The point is, the only real cause of marijuana being a “gateway drug” is the fact that kids are constantly being told lies about it. The fix? Honesty.

Secondly, marijuana in moderation has no permanent effects. You can smoke till yer stupid for a few months, but take a week off and you bounce back completely. Its tar is more harmful than that of tobacco, but who aside from the most extreme users smokes a cigarette-pack’s worth of joints every day? The only way to get cancer from marijuana is to pump the smoke into a ventilator and breath it in 24/7. With cutting-edge advances in technology, there are now vaporizers, which remove the tar from smoking. It’s safer than ever.

Thirdly, smoking marijuana is a personal choice. Here we are, in the “land of the free,” restricted from doing things that even if they do have some negative effect, only affect us personally. It’s not illegal to saw off my arm. It’s not illegal to use a pogo stick next to the grand canyon. Why can’t I take a puff on a joint? Who am I harming?

Now to my main point. We’re in an economic crisis. We’re spending a lot of money on battling imports of drugs (including marijuana), and also spending a lot of money keeping potheads in prison (thanks, prison lobby). That’s two very large drains on our economy to

  1. Fund a losing battle. I can go anywhere in almost any town in the US and within an hour, even not knowing anyone, get an eighth of weed. Good job drug war, money well spent. It’s good to know that the taxes I just filed will go to “stopping” me from buying marijuana.
  2. Keep pot offenders in prison. Yeah, these people are really dangerous. They are on the edge of the law…sitting on the couch eating chips and giggling. The more money I can spend to keep them locked up, the better. Oh sure, most of them are dealers, but our culture is founded on the principals of capitalism: if a market exists, fill the void and capitalize. Makes sense to me. Nobody would sell pot if nobody wanted to smoke it. Yes it’s illegal, but once again let’s ask ourselves why instead of pointing to a law.

Now imagine a world where the government grew, cultivated, sold & taxed pot. That’s a lot of money we’d make back. Hell even if they raised the price on it, it’d be worth it to just be able to walk into a store and buy it. They could use the revenue from pot to plug the holes caused by battling all the other drugs.

Maybe it’s time to really start thinking about this. If you are against legalization of marijuana, ask yourself why. Anyone who wants to smoke it already does. Show me a person who wants to smoke pot but doesn’t because it’s illegal, and I’ll show you the portal that takes you out of Neverland and back to reality.

Conservative America: you want a smaller government with less services and less control on the population in general. Why not start with drug reform?