Today is Earth Day, and we are trying to keep the Earth Day web site rolling with the massive amounts of traffic that have been hitting it.
We were expecting about 25 million visitors to the site. Then Google decided to be all environmentally friendly and link to the site from their home page. Then Apple decided to be all ecosensitive by making the integrated iPhone app the number one app in their featured section. We really appreciated it when the servers were suddenly all spitting out 600 pages a second with no warning.
Trying to keep the site up and running has been a real challenge. As a web developer, you always want your sites to be popular, just not so popular you have to deal with issues affecting your ability to serve pages. Anyways, here is some wisdom from the trenches to explain what we did to compensate when traffic began to quintuple.
- Drupal scales horizontally under load, and does so effectively. We added a lot of servers in a very small amount of time. Using Pressflow, a Drupal distribution, made adding new servers and databases a lot simpler.
- On the other hand, Pressflow does no favors to authenticated users. Do not mistake it for a silver bullet for dealing with logged in users, it behaves exactly the same way Drupal would.
- Varnish was behaving badly under load at times. There were a few instances where the varnish cache spiked to several gigabytes and stopped serving pages. This was rough.
- Slow query log was useful, but not as useful as it could have been. We made real progress when we started focusing on small, repetitive queries in core and caching those or distributing them across the slaves. We were missing those previously before we started jumping in there.
- Memcached is still your best friend. We have just about every query in the bootstrapping process pulling from memcached at this point.
- We were doing something tricky with mod_geoip in Apache. Check your apache config very carefully - there was a typo in the configuration that was causing some problems to occur.
- Someone suggested we start looking at virtualization settings as a way of maximizing throughput. No one on our team had real experience with this, primarily because we don't know any developers in the virtualization community. Rackspace appears to be running Xen - we are going to find some friends who work on that project.
- We were not expecting to need a CDN for this site, but kept the TTLs on the domain very low just in case. We added a CDN after traffic quintupled and the low TTLs settings allowed us to keep the site alive.
At this point, we have no clue how much traffic the site is ultimately going to do. Measuring traffic in real time is challenging when you have varnish, apache, the iphone app and analytics packages all keeping separate metrics. We do know the site is not going down again, unless we decide to take it down.
M
Bookmark/Search this post with:
2 Comments
Hey Mike, it's great to see
Hey Mike, it's great to see reports back from the trenches, and great to see that earthday.org is being run on Pressflow! I just wanted to drop some quick points to add a bit of detail, and some helpful tips for anyone who finds this post and wonders what it might mean for their project.
In general, the biggest liability on any project is always organizational disfunction. For instance, not knowing that the iphone app would be featured with enough advance warning to plan accordingly: that's a communication failure that no amount of slick technology can solve. We really want to test, test and test these things again before launching under such high-pressure circumstances.
But congratz on getting it all working! Logged-in page load times are currently at about 1.5 seconds from my office, which is pretty respectable. :)
And you guys managed all of
And you guys managed all of this while presenting at DrupalCon SF! I remember you mentioning the earthday.org website on Saturday at DCSF. Is Trellon responsible for the Earthday website from concept to launch or did you guys take over the project at some stage?
The site looks great and loads very fast as a logged in user from my office. I think Josh may be referring to this patch which is included as the path alias cache module in Pressflow.
Good to have met you Michael.
Post new comment