DrupalForAMajorConvergence

Table of contents :

How to prevent a drupal meltdown when you've got a major protest or event to cover

Drupal is great. It does tons right out of the box, and between contributed modules and things like the CCK and Views, you can make it jump through just about every hoop you ever wanted as an IMC site, mostly without a single line of code. And it even comes with some intelligent caching measures that make it easy to handle the day to day load of an average imc site.

But let's say, for instance, the RNC comes to your town and all of sudden you're getting thousands and thousands of visitors and inbound links from all over and a ton of new visitors to the site. With a standard drupal setup, unless you're running on hardware no imc is likely to be able to afford, your site is basically going to be forced off the internet, and all of a sudden the mumblings of bitter people from #tech about mir and static page generation is going to make a lot more sense.

Why?

There's a couple of problems with Drupal's design that make performance under insane conditions less than optimal (Note - this is all mostly based on experience with Drupal 5, things may be getting better). The first problem is that by default, Drupal's internal caching ("normal" caching) uses the database for storing what it caches. This means that for every visitor to the site, even people just hitting the front page, Drupal will run some php code and then talk to the mysql database a bit, then run some more php code, then return the page to the user. Although not nearly as inefficient, this is exactly the pattern of activity that caused active, the first imc codebase, to meltdown all the time. The second problem is that by default, Drupal is set up to not play nicely with upstream caches - meaning that 1) site user's browsers won't help you out by caching parts of your site 2) ISP's your users are using won't help you out by storing cacheable content that's in heavy demand and serving it up to multiple clients and 3) you have no way of putting caching proxies in front of your poor drupal server to buffer the insane deluge of requests.

What is to be done?

Security culture: You did turn off IP logging, right? [absolutely neccessary]

By default, Drupal keeps a lot of information about your users, much of which is supoenable and can lead to box seizure by law enforcement types, resulting in huge comprises of anonymity and security for your users and legal headaches for your imc. The moral of the story: log nothing. Yes, it's nice to be able to block spam based on originating IP, yes, it's nice to see pretty maps of where in the world your visitors are from. But the security concerns based on concrete and very nasty past experiences by imc's across the globe trump these conveniences. Make sure you've turned off IP logging in both apache's and drupal's respective access logs!

Cache your php code [absolutely neccessary]

By default, PHP code is expensive in terms of server resources, because it gets reinterpreted each time a request comes in. The absolute first thing to do is to install an opcode cache so you avoid this bottleneck. Some options are:

Check the cacheability of static resources [neccessary]

Now you're going to want to make sure that everything that's just a static file (images, css, javascript, etc.) is cacheable. Head over to:

http://ircache.net/cgi-bin/cacheability.py

and plug in your site's url. You'll definitely see problems for the front page itself, but the resources that page uses should be cacheable. If they aren't, or you just want to make sure, add the following to your apache conf (after enabling mod_expires!):

  ExpiresActive On
  <LocationMatch "^/files/">
   ExpiresActive on
   ExpiresDefault "access plus 7 days"
  </LocationMatch>
  <LocationMatch "^/sites/">
   ExpiresActive on
   ExpiresDefault "access plus 7 days"
  </LocationMatch>

While you're in there, go ahead and add this as well:

 <FilesMatch "\.html$">
   ExpiresByType text/html "access plus 5 minutes"
 </FilesMatch>

But drupal doesn't produce html files, so what good is this going to do? Read on....

Use boost to generate static files [neccessary]

Here's the most important trick in your whole arsenal. We're going to add another layer of caching to drupal, using the excellent contributed module Boost. Boost is going to let you trade off, sacrificing the instaneous response you normally get with drupal for a huge increase in scalability, especially given your limited resources. It's going to do this with a two part strategy: first, it hooks into drupal to generate static html caches of pages which are requested by anonymous users. Then, some apache configuration added to drupal's .htaccess file allows anonymous requests to be rewritten to the static html pages - without hitting drupal or mysql at all. The rewrites ensure that anyone with a drupal session cookie (i.e. a logged in user) bypasses the boost cache entirely. With Boost running, not only do you get much faster, less resource intensive responses from the server, but they are totally cacheable - no drupal session cookies, no wacky Drupal expires headers set to 1978 for maximum stale-ness. Basically, just follow the directions that Boost provides, making sure to specify a list of pages to cache - definitely your front page, probably your article pages, and especially anything really intensive in terms of page generation time, like a complicated calendar page. You can see the difference in the server responses if you look at the headers, or via the ircache.net cacheability service mentioned above. If you've done it right, you should be serving nice plain html to your anonymous visitors.

Fix the broken boost .htaccess file to play nice with caches

One thing to note when you've got this all set up and working is that (at least as of 9/4/2008) is that the Boost .htaccess file could use some tweaking regarding caching - you want to make sure that if a client makes a HEAD request to check the freshness on some piece of data they've cached, they hit static html (if appropriate) rather than drupal.

Basically just change:

RewriteCond %{REQUEST_METHOD} ^GET$

to

RewriteCond %{REQUEST_METHOD} ^(GET|HEAD)$

everytime you see it in the Boost .htaccess file.

Use a lightweight webserver or caching proxy in front of php-laden apache (or alongside it) [recommended]

Now, because PHP modules tend to make using a sane Apache worker model possible, you're still serving up all this static stuff with extremely heavy, mod_php apache child processes. Why tie these up dealing with simple stuff? Here's where an apache instance running with a seperate ultra-lightweight config, or a next generation lightweight webserver like nginx or lighttpd, or even a caching proxy running locally like squid or varnish, intercepting requests on port 80, and then forwarding to the "backend" apache processes listening on some other port, comes in handy. With the right config, requests for things like images or cached pages not only don't touch drupal, they don't even touch an apache process that could be doing some drupal work.

If you're sharing your server (and ip) with a lot of virtual hosts, configuring this all correctly will probably be a pain.

If you feel like redoing your urls, you could even have all static resources on a separate domain, say static.wherever.indymedia.org. You could even put this stuff on a separate server, with an appropriate method like rsync for syncing things up.

Another advantage of using a subdomain for static files is that requests for these resources will not have any associated session cookie, which allows HTTP accelerators like Varnish to serve requests from its cache.

Now you're ready to really build some extra capacity into the system. Many imc's worldwide will be happy to set up a caching proxy for you using squid or varnish or some such. An hour on #tech on irc.indymedia.org will probably get you half a dozen proxies, as well as all the necessary tweaks to dns if you're using the indymedia.org domain exclusively. It's pretty amazing to see huge amounts of computational resources being thrown at you in solidarity, and can really convince you of the virtue of being part of a global network of likeminded media projects.

Practically, the first thing you'll need to do is configure your drupal site to respond to one addtional url, which you'll keep kind of secret, so we'll call it "secret.wherever.indymedia.org". Don't post it online, don't spread it around via email - it's only intended for you and the people setting up the caching proxies. They'll set up their proxy to forward requests from your regular old domain name, "wherever.indymedia.org" (and variants like "www.wherever.org") to this domain name. Now, through what's called a DNS round-robin, we'll implement a poor-man's loadbalancer by putting a DNS record for "wherever.indymedia.org" to each of the proxy servers in turn, so that a request for your site will be randomly routed to one of the proxies, automagically and just based on the way dns works. An example of what'd you see now if you looked up your hostname is included below:

guest@ubuntu:~$ dig twincities.indymedia.org

; <<>> DiG 9.3.2 <<>> twincities.indymedia.org
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39319
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;twincities.indymedia.org.      IN      A

;; ANSWER SECTION:
twincities.indymedia.org. 1463  IN      A       80.82.245.142
twincities.indymedia.org. 1463  IN      A       128.2.97.219
twincities.indymedia.org. 1463  IN      A       204.13.164.102
twincities.indymedia.org. 1463  IN      A       209.237.247.50
twincities.indymedia.org. 1463  IN      A       216.139.209.83

Remember, unless you've taken the above steps to ensure that what drupal emits is cacheable, you're not going to see any benefit from this network of caches - they all will just pass along every request to your poor backend, which will melt down under the load.

It's a good idea to give each cache a unique name in dns that will also proxy to your backend (like www1.wherever.indymedia.org, etc.) This way you can easily check the status of each cache. Also keep in mind that your dns records should be set with a short ttl, otherwise any changes you need to make - say if a cache goes down - will take a really long time to propagate.

Bonus - security benefits!

One of the neat things about this setup is that your "real" server is no longer facing the public at large in any meaningful way. Why is this cool? For one thing, it means that you've just made it a lot harder for someone to attack your drupal server, like for instance right-wing script kiddies trying to dos you off the internet. Also, if a random halfbaked supoena/warrant for your server or its data is issued, your server is now effectively multiple servers, potentially spread over multiple jurisidictions and countries. While this isn't a foolproof strategy for protecting yourself, your server, or your users (for that see above under "don't log ip addresses ever"), it just might make someone's life a little more annoying.

Also, because boost generates static files, if drupal goes off line for some reason, but the backend server remains up, most visitors will see your site, albeit with no updates until you get drupal running again. Much better than an error page!

Apache module mod_deflate [recommended]

Along with mod_expires mentioned above, it's also a good idea to enable mod_deflate, which can be used to compress Drupal's often bloated CSS and javascript files as they're being served. Add this to your Apache configuration:

AddOutputFilterByType DEFLATE application/javascript application/x-javascript text/css

If you are using Drupal's built-in page cache, pages will already be gzipped so you will not want to add text/html to this list.

Some refinements to the set up

Visualizing the setup

icanhassquid.jpg

Munin or other monitoring software

It really helps to have munin installed on the backend server and on at least one of the proxies - you'll get pretty graphs that tell you a) how badly your backend server is thrashing over time and b) how much traffic the proxies are handling for you.

Analysis

The YSlow Firefox extension analyzes your site's performance and tells you what factors could be optimized.

Other drupal things that come in handy for a convergence

Media Mover

Why rely on YouTube for embeddable video? If you've got the bandwidth to spare, Media Mover provides a seamless drupal video conversion to flash playable and rembeddable solution:

http://drupal.org/project/media_mover

Breaking news with Drupal

I don't know how this was done for twincities, if there's a standard way to do it, but it's pretty easy to set up in drupal with CCK+Views. One thing twincities did was add a "dispatch" role that could edit breaking news, but wasn't a regular site admin/editor.

Further steps [i.e., if you have a datacenter and money to blow]

What follows is some pointers on how you'd do this with drupal if you weren't on a shoestring budget, and had the luxury of multiple servers with a fast local ethernet connection to each other. This is the "standard" way of making drupal handle a high load - while better ultimately than the set up described above, because you don't lose instantaneous updates, it's a little more complicated and a lot more expensive.

Separate Database server

Multiple drupal web front ends

DB Cluster

Topic revision: r4 - 05 Sep 2008, JohnDudaAccount
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback