It's Earth Day and We Need a Nap

Apr
22

Today is Earth Day, and we are trying to keep the Earth Day web site rolling with the massive amounts of traffic that have been hitting it.

We were expecting about 25 million visitors to the site. Then Google decided to be all environmentally friendly and link to the site from their home page. Then Apple decided to be all ecosensitive by making the integrated iPhone app the number one app in their featured section. We really appreciated it when the servers were suddenly all spitting out 600 pages a second with no warning.

Trying to keep the site up and running has been a real challenge. As a web developer, you always want your sites to be popular, just not so popular you have to deal with issues affecting your ability to serve pages. Anyways, here is some wisdom from the trenches to explain what we did to compensate when traffic began to quintuple.

  • Drupal scales horizontally under load, and does so effectively. We added a lot of servers in a very small amount of time. Using Pressflow, a Drupal distribution, made adding new servers and databases a lot simpler.
  • On the other hand, Pressflow does no favors to authenticated users. Do not mistake it for a silver bullet for dealing with logged in users, it behaves exactly the same way Drupal would.
  • Varnish was behaving badly under load at times. There were a few instances where the varnish cache spiked to several gigabytes and stopped serving pages. This was rough.
  • Slow query log was useful, but not as useful as it could have been. We made real progress when we started focusing on small, repetitive queries in core and caching those or distributing them across the slaves. We were missing those previously before we started jumping in there.
  • Memcached is still your best friend. We have just about every query in the bootstrapping process pulling from memcached at this point.
  • We were doing something tricky with mod_geoip in Apache. Check your apache config very carefully - there was a typo in the configuration that was causing some problems to occur.
  • Someone suggested we start looking at virtualization settings as a way of maximizing throughput. No one on our team had real experience with this, primarily because we don't know any developers in the virtualization community. Rackspace appears to be running Xen - we are going to find some friends who work on that project.
  • We were not expecting to need a CDN for this site, but kept the TTLs on the domain very low just in case. We added a CDN after traffic quintupled and the low TTLs settings allowed us to keep the site alive.

At this point, we have no clue how much traffic the site is ultimately going to do. Measuring traffic in real time is challenging when you have varnish, apache, the iphone app and analytics packages all keeping separate metrics. We do know the site is not going down again, unless we decide to take it down.

M

2 Comments

Hey Mike, it's great to see

Hey Mike, it's great to see reports back from the trenches, and great to see that earthday.org is being run on Pressflow! I just wanted to drop some quick points to add a bit of detail, and some helpful tips for anyone who finds this post and wonders what it might mean for their project.

  1. While Pressflow is certainly not a silver bullet for logged-in user scalability — such things don't exist — it actually does contain a number of optimizations to improve the logged-in pageload times. Principally, it utilizes more up to date functions for PHP5 and drops some sql generalizations that are necessary for PostgreSQL support. These don't solve a 4x traffic spike, but they help.
  2. Speaking of query optimization and logged-in user support, Pressflow also includes a nice little module to cache path lookups, which are the most frequent query in Drupal by volume. However, more work remains to be done to identify queries which are safely "slaveable" and farm them out to decrease the load on the master database. Launchpad makes submitting a proposed merge easy if you think you found some wins here.
  3. If you had issues with gigabytes of Varnish cache and/or concurrency overload, something was likely misconfigured with the VCL. Varnish caches the responses from the webserver, so it's hard to see how you'd get to multiple gigs on just one site, even without gzip compression on (which you might want to enable). If you had a large spike of mobile traffic and the Vary headers were being incorrectly handled for user-agents, that could have meant Varnish was storing different cached pages for every kind of user-agent out there. Since the mobile space is full of these that could mean both a lot of unnecessary cache misses, as well as more memory consumption for the eventual caches. It might also be sub-optimal cookie hashing. The varnish-cache.org wiki has a lot of smart examples of things to do (and not to do) in this realm.
  4. I also noticed the site uses Panels, which provides its own caching mechanisms which can really improve page load times even for logged in users, as not every single element on the page needs to be dynamically built every time. I assume you're already using this based on the speed of some authenticated pages, but it's a good thing for people to be aware of.

In general, the biggest liability on any project is always organizational disfunction. For instance, not knowing that the iphone app would be featured with enough advance warning to plan accordingly: that's a communication failure that no amount of slick technology can solve. We really want to test, test and test these things again before launching under such high-pressure circumstances.

But congratz on getting it all working! Logged-in page load times are currently at about 1.5 seconds from my office, which is pretty respectable. :)

And you guys managed all of

And you guys managed all of this while presenting at DrupalCon SF! I remember you mentioning the earthday.org website on Saturday at DCSF. Is Trellon responsible for the Earthday website from concept to launch or did you guys take over the project at some stage?

The site looks great and loads very fast as a logged in user from my office. I think Josh may be referring to this patch which is included as the path alias cache module in Pressflow.

Good to have met you Michael.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options