What and How to Measure Performance


Last week I wrote about performance testing Open Site Explorer.  But I didn’t write much about how and why to collect the relevant data.  In this post I’ll write about the tools I use to collect performance data, how I aggregate it, and little bit about what those data tell us.  This advice applies equally well when running a performance test or during normal production operations of any web application.

I collect three kinds of data:

  • system performance characteristics
  • client-side, perceived performance
  • server-side errors and per-request details

To make this a little bit more concrete, consider a pretty standard web architecture:

web architecture including measurement ideasI’ve highlighted where to include the three categories of measurement.

System Characteristics

System characteristics are the lion’s share of performance measurement.  I want to know how my app is performing and what the bottlenecks are.  I can’t do much better than actually measuring the raw components.  On each system in your architecture you’ll want to collect at least the following:

  • Load average
  • CPU, broken out by process including I/O wait time, user/system time, idle time
  • Memory usage, broken out by process, and used, cached, free
  • Disk activity, including requests and bytes read/written per second
  • Network bytes read/written per second

Make sure you understand what each of these does and does not measure.  For instance, load average may include network and disk wait, even if the CPU is idle.  But it might not.  Unused memory isn’t useful, but disk cache (often reported as unused) is useful.  So check how your OS and your tools calculating these things.

I do lots of analysis on this kind of data, but here are a few basic things to look at:

  • What’s your load average?  It’s (almost always) interpreted relative to the number of cores you have, so load average of 4 on a 4 core box probably means the box is saturated.
  • What does your memory usage look like?  Is free + cached memory very close to zero?  Most apps, daemons, etc. will work much better with a sizeable disk cache.  You don’t want to completely exhaust system memory or you’ll start swapping to disk, and that’s very bad.
  • Examine at least a week’s worth of data to get a sense for daily and weekly cycles.  Don’t tune your apps to operate optimally for weekend load; otherwise Monday morning will slam you worse than it normally does :P

And some more complicated things:

  • How much free (unused, non-cached) memory do you have?  How does this vary over time?  Tune your processes to use that free memory.  But keep enough (a small margin, perhaps 10% of total) in reserve for sudden spikes.
  • How does your total CPU usage compare to load average?  If you’ve routinely got a load average of 4 but your CPU usage is always under 50% (aggregated across all cores), then you’ve got some disk or network bottlenecks that aren’t letting you take advantage of all your cores.
  • Is your web server dumping nearly a MB/sec to disk during normal operations?  That could be some poorly tuned logging from apache or one of your applications.  Turn that chattiness down to get more performance.

To collect system performance data, I like collectdRRDToolDStat, and IOStat.  These are all simple and low-level tools.  But more importantly, I understand and trust them.  My ops guy, David, has been getting us on Zabbix which is a more full featured monitoring platform.  So check that out if that’s what moves you.

Collectd is both a system performance measuring agent and a central server to aggregate data from many nodes.  It’s important that you offload aggregation and recording of the data to a central server since this can be pretty disk intensive.  For instance, my data aggregation server is usually at 50% CPU I/O wait time due to writing all the perf data it collects.  Below is a sample configuration file to give you an idea of what collectd does as a data collection agent on a node, and how it’s configured:

#gets apache stats, needs mod_status enabled
LoadPlugin apache
#gets cpu stats broken out by core and aggregate
LoadPlugin cpu
#gets some disk statistics
LoadPlugin df
LoadPlugin disk
#gets load average
LoadPlugin load
#gets memory stats
LoadPlugin memory
#gets network stats
LoadPlugin interface
#gets system stats broken out by processes you specify below
LoadPlugin processes
#sends data back to a central ops server
LoadPlugin network

#your metrics aggregation server
<Plugin network>
	Server "ops.example.com" "27781"
</Plugin>

#measure the cpu usage of different processes
<Plugin processes>
	Process "apache"
	Process "ruby"
	Process "lighttpd"
	Process "memcached"
	Process "collectd"
</Plugin>

At the central aggregation server, collectd dumps its data to an RRDTool database.  RRDTool is a pretty well known, widely supported performance measurement storage format.  I don’t do much directly with RRDTool.  Instead I use drraw, a very light-weight web client for RRDTool.  drraw lets us quickly throw together arbitrary dashboards on my perf data.

drraw performance dashboard

Between collectd and drraw I collect, aggregate, and visualize all the measurements I listed above. But I also frequently collect finer grained, ad hoc data from boxes using DStat and IOStat.

DStat is a very versatile tool to collect pretty much any system metrics and display them in a very Linux-hacker interface:

dstat ad hoc performance measurement

I’ve asked for CPU, disk, memory, load average, and the most expensive I/O process.  It looks to me like:

  • One of my cores is pegged.
  • There’s nothing of note on disk or network.
  • Not much memory is free, but nearly 700MB is cached, so that looks good.
  • Xorg and “exe” (which is Flash player running Pandora) are talking to each other an awful lot, probably over pipes or local sockets (since there’s no corresponding disk or network)

One common problem I’ve got is that I see a lot of CPU I/O wait time, but only a few KB or maybe a MB/sec being written to disk.  The question is, where’s all that I/O wait time coming from?  It might be random disk I/O, or it might be network I/O.  That’s where IOstat comes in:

iostat ad hoc I/O performance measurement

I asked for extended information (-x) at 3 second intervals.  The first block of output is aggregated since system start.  Each block after that is aggregated over the 3 second interval.  This tells me:

  • The apps running are pushing between 1 and 10 write requests per second (the w/s column) (pretty low).
  • Those requests have to wait between 0 and 0.25 milliseconds to complete (the await column).
  • The disk has request response time of between 0 and 0.25 milliseconds (the svctm column).  This will always be less than or equal to await.  Because it’s equal to await in this case, that tells me there’s essentially no contention for the disk at the moment.
  • Most importantly, the disk is essentially at zero utilization (the %util column).

That about wraps it up for measuring performance on the server itself.  I’ve walked through a few scenarios.  But it’s a pretty complicated landscape.  The best thing you can do is to set up measurement and wait for stuff, good or bad, to happen.  After the fact you can match up what you saw from a business standpoint (what your users or support staff are telling you) with your performance data.  If things were reported by customers as being slow, did any of your perf graphs show spikes?  If you got a massive spike of traffic, did you see the effects on your system?  In the future you can use that experience to take action (add nodes, fix bugs, build better architecture) before any negative business impact occurs.

Client-Side Performance

In addition to measuring server-side performance, you get bonus points for putting together a (or many) synthetic client(s).  You’ll want to make sure your client can collect:

  • distribution of response times (or at least mean, median and 90%)
  • counts of successful (probably 200 OK) and failed (anything else) responses
  • throughput in total time to run a certain number of reports

The custom client I use collects all these things, plus more.  But there are plenty of tools and packages out there.  You can even set up a shell script that runs a simple curl or wget script:

$ /usr/bin/time curl --silent "http://www.nickgerner.com" 2>&1 | tail -n 2 | grep elapsed | sed 's/.* \([:.0-9]*\)elapsed .*/\1/'

0:00.79

This won’t tell you about render time or JavaScript time (unless you go with my suggestion of Keynote, but I’ve never used them); but it’s better than nothing.

Server-Side Errors

You’ll almost certainly uncover some errors under load.  You’ll want to make sure your application (and other server processes) have a reasonable amount of logging.  Debug logging could result in lots of unnecessary disk writes, so be sure to turn those off.  But it’s certainly okay to log errors for perf tests and in production.

It’s also a good idea to have Apache request logging, including timing turned on so you can see responses the server gave out, and the time to process them.  This will back up what you’re recording at the client.  I use the following log format (which should be compatible with lighttpd and Apache):

?%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{X-Forwarded-For}i %T

Throw some monitoring on this log.  I use monit.  But for performance analysis, a simple grep command does the trick:

$ cat /var/log/lighttpd/access.log | grep '" [5][0-9][0-9] '| wc -l
0

$ cat /var/log/lighttpd/access.log | grep '" 200 '| wc -l
434849

I hope that covers some basics, and the finer points around what to measure and how to measure performance in any web application.  The important point is to start collecting data.  The analysis of it comes in plenty of flavors and levels of complexity.  But the old 80-20 rule applies: just get started and you’ll quickly see benefits.

  1. #1 by Carter Cole on February 16th, 2010

    way cool stuff… i hope one day to build a big clustered app… right now all ive done is tiny line of business apps for companies of about 10-50

  2. #2 by Nick Gerner on February 17th, 2010

    I’m glad you like it :) I like your toolbar. I think it has a lot of potential to go big.

  3. #3 by Dag Wieers on May 19th, 2010

    Hi Nick,

    I like the way you show how Dstat can be useful by a nice screenshot and a story. In fact something I always wanted to do was create some scenarios (and related Dstat commands) that show the reasoning and functionality of Dstat. Your example is a good one.

    However Dstat is an evolving project and in the meantime we have plugins that show disk utilization rates (–disk-util) and other plugins showing more information that iostat provides as well. This would be perfect to explain you what disk is causing iowait situations.

    The true strength of Dstat compared to similar tools is that it is written in python and writing your own plugins is quite easy. Which gives you a wealth of possibilities to correlate system events and counters. I am limited by my own experiences but I hope other people contribute some plugins that extend Dstat in ways that I never envisioned.

    If you come across other scenarios, blog about them and send me a note ;-) Thanks again !

(will not be published)


  1. No trackbacks yet.