Today is Earth Day, and we are trying to keep the Earth Day web site rolling with the massive amounts of traffic that have been hitting it.
We were expecting about 25 million visitors to the site. Then Google decided to be all environmentally friendly and link to the site from their home page. Then Apple decided to be all ecosensitive by making the integrated iPhone app the number one app in their featured section. We really appreciated it when the servers were suddenly all spitting out 600 pages a second with no warning.
Trying to keep the site up and running has been a real challenge. As a web developer, you always want your sites to be popular, just not so popular you have to deal with issues affecting your ability to serve pages. Anyways, here is some wisdom from the trenches to explain what we did to compensate when traffic began to quintuple.
- Drupal scales horizontally under load, and does so effectively. We added a lot of servers in a very small amount of time. Using Pressflow, a Drupal distribution, made adding new servers and databases a lot simpler.
- On the other hand, Pressflow does no favors to authenticated users. Do not mistake it for a silver bullet for dealing with logged in users, it behaves exactly the same way Drupal would.
- Varnish was behaving badly under load at times. There were a few instances where the varnish cache spiked to several gigabytes and stopped serving pages. This was rough.
- Slow query log was useful, but not as useful as it could have been. We made real progress when we started focusing on small, repetitive queries in core and caching those or distributing them across the slaves. We were missing those previously before we started jumping in there.
- Memcached is still your best friend. We have just about every query in the bootstrapping process pulling from memcached at this point.
- We were doing something tricky with mod_geoip in Apache. Check your apache config very carefully - there was a typo in the configuration that was causing some problems to occur.
- Someone suggested we start looking at virtualization settings as a way of maximizing throughput. No one on our team had real experience with this, primarily because we don't know any developers in the virtualization community. Rackspace appears to be running Xen - we are going to find some friends who work on that project.
- We were not expecting to need a CDN for this site, but kept the TTLs on the domain very low just in case. We added a CDN after traffic quintupled and the low TTLs settings allowed us to keep the site alive.
At this point, we have no clue how much traffic the site is ultimately going to do. Measuring traffic in real time is challenging when you have varnish, apache, the iphone app and analytics packages all keeping separate metrics. We do know the site is not going down again, unless we decide to take it down.