Last week I wrote about performance testing Open Site Explorer. But I didn’t write much about how and why to collect the relevant data. In this post I’ll write about the tools I use to collect performance data, how I aggregate it, and little bit about what those data tell us. This advice applies equally well when running a performance test or during normal production operations of any web application.
I collect three kinds of data:
- system performance characteristics
- client-side, perceived performance
- server-side errors and per-request details
To make this a little bit more concrete, consider a pretty standard web architecture:
System characteristics are the lion’s share of performance measurement. I want to know how my app is performing and what the bottlenecks are. I can’t do much better than actually measuring the raw components. On each system in your architecture you’ll want to collect at least the following:
- Load average
- CPU, broken out by process including I/O wait time, user/system time, idle time
- Memory usage, broken out by process, and used, cached, free
- Disk activity, including requests and bytes read/written per second
- Network bytes read/written per second
Make sure you understand what each of these does and does not measure. For instance, load average may include network and disk wait, even if the CPU is idle. But it might not. Unused memory isn’t useful, but disk cache (often reported as unused) is useful. So check how your OS and your tools calculating these things.
I do lots of analysis on this kind of data, but here are a few basic things to look at:
- What’s your load average? It’s (almost always) interpreted relative to the number of cores you have, so load average of 4 on a 4 core box probably means the box is saturated.
- What does your memory usage look like? Is free + cached memory very close to zero? Most apps, daemons, etc. will work much better with a sizeable disk cache. You don’t want to completely exhaust system memory or you’ll start swapping to disk, and that’s very bad.
- Examine at least a week’s worth of data to get a sense for daily and weekly cycles. Don’t tune your apps to operate optimally for weekend load; otherwise Monday morning will slam you worse than it normally does
And some more complicated things:
- How much free (unused, non-cached) memory do you have? How does this vary over time? Tune your processes to use that free memory. But keep enough (a small margin, perhaps 10% of total) in reserve for sudden spikes.
- How does your total CPU usage compare to load average? If you’ve routinely got a load average of 4 but your CPU usage is always under 50% (aggregated across all cores), then you’ve got some disk or network bottlenecks that aren’t letting you take advantage of all your cores.
- Is your web server dumping nearly a MB/sec to disk during normal operations? That could be some poorly tuned logging from apache or one of your applications. Turn that chattiness down to get more performance.
To collect system performance data, I like collectd, RRDTool, DStat, and IOStat. These are all simple and low-level tools. But more importantly, I understand and trust them. My ops guy, David, has been getting us on Zabbix which is a more full featured monitoring platform. So check that out if that’s what moves you.
Collectd is both a system performance measuring agent and a central server to aggregate data from many nodes. It’s important that you offload aggregation and recording of the data to a central server since this can be pretty disk intensive. For instance, my data aggregation server is usually at 50% CPU I/O wait time due to writing all the perf data it collects. Below is a sample configuration file to give you an idea of what collectd does as a data collection agent on a node, and how it’s configured:
At the central aggregation server, collectd dumps its data to an RRDTool database. RRDTool is a pretty well known, widely supported performance measurement storage format. I don’t do much directly with RRDTool. Instead I use drraw, a very light-weight web client for RRDTool. drraw lets us quickly throw together arbitrary dashboards on my perf data.
DStat is a very versatile tool to collect pretty much any system metrics and display them in a very Linux-hacker interface:
I’ve asked for CPU, disk, memory, load average, and the most expensive I/O process. It looks to me like:
- One of my cores is pegged.
- There’s nothing of note on disk or network.
- Not much memory is free, but nearly 700MB is cached, so that looks good.
- Xorg and “exe” (which is Flash player running Pandora) are talking to each other an awful lot, probably over pipes or local sockets (since there’s no corresponding disk or network)
One common problem I’ve got is that I see a lot of CPU I/O wait time, but only a few KB or maybe a MB/sec being written to disk. The question is, where’s all that I/O wait time coming from? It might be random disk I/O, or it might be network I/O. That’s where IOstat comes in:
I asked for extended information (-x) at 3 second intervals. The first block of output is aggregated since system start. Each block after that is aggregated over the 3 second interval. This tells me:
- The apps running are pushing between 1 and 10 write requests per second (the w/s column) (pretty low).
- Those requests have to wait between 0 and 0.25 milliseconds to complete (the await column).
- The disk has request response time of between 0 and 0.25 milliseconds (the svctm column). This will always be less than or equal to await. Because it’s equal to await in this case, that tells me there’s essentially no contention for the disk at the moment.
- Most importantly, the disk is essentially at zero utilization (the %util column).
That about wraps it up for measuring performance on the server itself. I’ve walked through a few scenarios. But it’s a pretty complicated landscape. The best thing you can do is to set up measurement and wait for stuff, good or bad, to happen. After the fact you can match up what you saw from a business standpoint (what your users or support staff are telling you) with your performance data. If things were reported by customers as being slow, did any of your perf graphs show spikes? If you got a massive spike of traffic, did you see the effects on your system? In the future you can use that experience to take action (add nodes, fix bugs, build better architecture) before any negative business impact occurs.
In addition to measuring server-side performance, you get bonus points for putting together a (or many) synthetic client(s). You’ll want to make sure your client can collect:
- distribution of response times (or at least mean, median and 90%)
- counts of successful (probably 200 OK) and failed (anything else) responses
- throughput in total time to run a certain number of reports
You’ll almost certainly uncover some errors under load. You’ll want to make sure your application (and other server processes) have a reasonable amount of logging. Debug logging could result in lots of unnecessary disk writes, so be sure to turn those off. But it’s certainly okay to log errors for perf tests and in production.
It’s also a good idea to have Apache request logging, including timing turned on so you can see responses the server gave out, and the time to process them. This will back up what you’re recording at the client. I use the following log format (which should be compatible with lighttpd and Apache):
Throw some monitoring on this log. I use monit. But for performance analysis, a simple grep command does the trick:
I hope that covers some basics, and the finer points around what to measure and how to measure performance in any web application. The important point is to start collecting data. The analysis of it comes in plenty of flavors and levels of complexity. But the old 80-20 rule applies: just get started and you’ll quickly see benefits.