Below are the slides from my talk “Common Sense Performance Inidicators”. I spoke about the need to measure performance indicators over time and make a case for measuring simple, system-level measurements to augment end-to-end user experience measurements. I also present the free, open source solution I used while I was at SEOmoz.
Tuesday was my last day at SEOmoz. Being unemployed gives me a lot of time to have coffee with various people (from VCs to entrepreneurs to recruiters). And I’ve spent a lot of time trying to explain who I am. That got me thinking about core values.
If you’re not familiar, core values are principles that organizations (and, as I’ll explain, individuals) use to guide decisions. Some rather famous examples are the core values of Microsoft, Google, and Zappos. And of course there are the core values at SEOmoz. These are tools to help explain the corporate culture to customers, partners, existing and potential employees.
I believe I (and others) can benefit from core values in the same way. What are my decision making criteria? By what principles can I measure success? How can I succinctly explain to someone what I stand for, at least professionally?
My Core Values
I’ve boiled down what drives me into three core values:
- Be customer focused
- Bias for action
- Grow in a critical atmosphere
It’s no coincidence that my first value is “be customer focused.” This lies at the heart of so many corporate core values, and I can’t get away from it myself. The “customer” might be a paying customer, or not. It might be a blog reader, an audience member, or a customer I haven’t met yet. Exercising this value has allowed me to turn angry customers, seeking refunds, into rabid fans who tout the customer service of my company. Ventures without customer focus seem obviously doomed to failure to me.
Secondly, I believe that no one knows as much about what you do as you. So you should take responsibility for what you do, and act! Success comes through action, and I firmly believe that individuals should be empowered to take that action. Much of the success I’ve had over the last few years has been by exercising this value, sometimes without consensus, and often without running it past managers first. This is what I’ve loved about working at a startup, even if it means mixed results. Taking responsibility for success and failure is a critical corollary to this value.
My third value states that I need a critical environment in order to grow. On my own I fall into patterns and routines that are comforting, but also stagnant. I need someone to pat me on the back, and I need to pat others on the back; but I also need someone to tell me what opportunities I’ve missed, and how I can improve in the future. I also need to be free to provide criticism for others to help them improve without worrying about personal politics. When I tell you I disagree with your design, or your business decision, it’s not personal. To me, this is the heart of teamwork, and seeking out feedback and getting someone to bounce ideas off of is part of what makes a team work well.
Coming to My Core Values
This is a personal code of conduct, so there’s a lot of introspection involved. But there is an external dimension to this too: I want to present myself professionally, and communicate my decision making process to others. There are plenty of core values other people have put together. To name just a few (taken from the sources I’ve already cited, with a couple of my own thrown in):
- Act with integrity and honesty.
- Take on big challenges and see them through.
- It’s best to do one thing really, really well.
- Great just isn’t good enough.
- Pursue growth and learning.
- Do one thing. (different from doing one thing well)
- Ask “why?”
- Measure it!
In addition to having statements of value that reflect what I really believe, I also want a set of core values which are concise, specific, and express different dimensions of my personal and professional profile. Conflict between values is inevitable, but when values cannot clearly be separated from one another it seems they lose value as a framework for decision making, measuring success, or communicating who you are. If it’s not clear what “Ask why?” means and how it’s different from “Grow in a critical atmosphere,” then they’ve failed to communicate an aspect of who you are.
Some of these values, while quite appropriate for a corporation, don’t seem to apply as well to an individual (at least not to me). For instance, “It’s best to do one thing really, really well” makes a lot of sense for a corporation that wants to index all the world’s information and make it available. But for individuals that want to achieve such a lofty goal, this core value seems either a little too high-level or incredibly short sighted.
I also don’t want to state that I hold something as a core value if it’s not deeply internalized into who I am. For instance, “Measure it!” is both trendy and something I think is great. But on the other hand, this is something I haven’t internalized. I don’t carefully measure everything I do and sometimes I act without collecting all the data I could. I’d like to improve in this area. ”Measure it!” also seems to conflict with my second value, “bias for action,” in some ways. If you’re busy gathering data to support a decision before acting, then you’ve failed to hold true to acting in the face of uncertainty.
I suspect these values will continue to evolve as I do, as I gain experience, as I change my career and life goals. But these goals, and the process by which I’ve arrived at them, give me a framework to explain who I am. And they help shape my decisions.
- Search and browse the catalog (including movies, series, and cast)
- Get recommendations
- Get ratings and predicted ratings
- View, add to, or remove from user queues
This API is a great move by Netflix: they get developers engaged in extending functionality, plus they have an affiliate program so everybody wins And there’s almost no business risk to Netflix. Most of the functionality is only available if a user is already a Netflix customer, and the functionality that’s available without a subscription only helps expose and upsell Netflix products.
There’s plenty of great resources available about getting started with the API, but I thought I’d explain some of the early challenges I’m encountering and the solutions I’ve got so far. Just FYI, I’m about 12 hours into my project, so I might be missing some obvious solutions
Wow, OAuth is really daunting at first. Netflix has a great walkthrough, but it’s still confusing and scary. They list 3 kinds of requests:
- Non-authenticated (content that they don’t really care who gets)
- Signed Requests (content that they do care about, but isn’t user specific)
- Protected Requests (content and interaction that is user specific)
The first type is really straightforward: just include your “consumer key” or access ID, application identifier, whatever you want to call it. Nothing else to say here.
The second is also straight forward. It’s just a signed request similar to Amazon S3 or the similarly inspired SEOmoz API. Basically you take your request, including query parameters, and compute a hash using a secret key you and Netflix share. This way Netflix can be sure that it’s really you who’s doing the request. This is pretty standard OAuth stuff, but I threw together a few lines of code to help.
The third kind of request, a “Protected Request,” is simple. You’ll make requests like signed requests, but you use a user-specific key. And to get that, you need to follow some really complicated authorization steps. To illustrate, here’s a diagram of what’s going on:
That’s a 9-way handshake including you, the user, and Netflix! But it’s safe, and extremely explicit. So security for the win, I guess. What’s going on is:
- You kick things off with a request to Netflix
- Netflix responds with a few security parameters, including a special login URL for the user
- You send the user that login URL (maybe with a 302 redirect), adding a callback URL parameter (see step 7)
- The user visits the login screen
- Netflix tells the user the dangers of playing with strange apps
- The user confirms
- Netflix redirects (with a 302) the user back to your callback URL, plus a user-specific authorized token (this is the first time anything has been user identifying)
- You make a final request to Netflix including that authorized token. This time notice you’re using a new key that combines your secret key with the new token from Netflix.
- Netflix responds with a final secret token
Those last two tokens (the ones you got in steps 7 and 9) are the real keys to accessing user-specific data. Again, I’ve got a couple of pieces of code to help. The first handles steps 1-3; the second handles steps 7-9. It’s up to your users to handle steps 4-6.
Once you get past the 9-way handshake and get a good OAuth lib to help out with signing requests, the rest is pretty easy. Mostly.
Rate Limits and Title Refs
I know a little bit about rate limits. And not surprisingly, Netflix has them. When you first sign up, you’re limited to 4 queries per second with a daily limit of 5000 queries overall. That’s enough to get started. And many requests support a batched interface. So you can get up to 500 predicted ratings in one request (way to go Netflix!)
But because everything in the Netflix API depends on internally assigned, opaque URIs (e.g., http://api.netflix.com/catalog/titles/movies/60021896), I find myself making a lot of search queries. For instance, If know that Shutter Island opened last weekend, and want to get some Netflix data about it, I first have to make a search request on “Shutter Island” before I can get that predicted rating. And the search API doesn’t support a batched interface. This adds up to a lot of requests, and a lot of requests quickly.
Perhaps there’s an easy solution I’m missing (download the whole catalog? But what about ambiguities in movie titles?). This does sound like a problem well suited to caching. At least, in my application, I have a few movies for which I want to look up ratings for many people. Caching is commonly recommended by API providers. So even if I am missing something, caching isn’t a bad idea. Inspired by WP-cache, I’ve started a small disk-based caching utility. It’s not done yet, but it works in my prototype
Speaking of my prototype, a super, super early version can be seen here. I’ll post again when I’ve got updates to the code or the application itself.
Last week I wrote about performance testing Open Site Explorer. But I didn’t write much about how and why to collect the relevant data. In this post I’ll write about the tools I use to collect performance data, how I aggregate it, and little bit about what those data tell us. This advice applies equally well when running a performance test or during normal production operations of any web application.
I collect three kinds of data:
- system performance characteristics
- client-side, perceived performance
- server-side errors and per-request details
To make this a little bit more concrete, consider a pretty standard web architecture:
System characteristics are the lion’s share of performance measurement. I want to know how my app is performing and what the bottlenecks are. I can’t do much better than actually measuring the raw components. On each system in your architecture you’ll want to collect at least the following:
- Load average
- CPU, broken out by process including I/O wait time, user/system time, idle time
- Memory usage, broken out by process, and used, cached, free
- Disk activity, including requests and bytes read/written per second
- Network bytes read/written per second
Make sure you understand what each of these does and does not measure. For instance, load average may include network and disk wait, even if the CPU is idle. But it might not. Unused memory isn’t useful, but disk cache (often reported as unused) is useful. So check how your OS and your tools calculating these things.
I do lots of analysis on this kind of data, but here are a few basic things to look at:
- What’s your load average? It’s (almost always) interpreted relative to the number of cores you have, so load average of 4 on a 4 core box probably means the box is saturated.
- What does your memory usage look like? Is free + cached memory very close to zero? Most apps, daemons, etc. will work much better with a sizeable disk cache. You don’t want to completely exhaust system memory or you’ll start swapping to disk, and that’s very bad.
- Examine at least a week’s worth of data to get a sense for daily and weekly cycles. Don’t tune your apps to operate optimally for weekend load; otherwise Monday morning will slam you worse than it normally does
And some more complicated things:
- How much free (unused, non-cached) memory do you have? How does this vary over time? Tune your processes to use that free memory. But keep enough (a small margin, perhaps 10% of total) in reserve for sudden spikes.
- How does your total CPU usage compare to load average? If you’ve routinely got a load average of 4 but your CPU usage is always under 50% (aggregated across all cores), then you’ve got some disk or network bottlenecks that aren’t letting you take advantage of all your cores.
- Is your web server dumping nearly a MB/sec to disk during normal operations? That could be some poorly tuned logging from apache or one of your applications. Turn that chattiness down to get more performance.
To collect system performance data, I like collectd, RRDTool, DStat, and IOStat. These are all simple and low-level tools. But more importantly, I understand and trust them. My ops guy, David, has been getting us on Zabbix which is a more full featured monitoring platform. So check that out if that’s what moves you.
Collectd is both a system performance measuring agent and a central server to aggregate data from many nodes. It’s important that you offload aggregation and recording of the data to a central server since this can be pretty disk intensive. For instance, my data aggregation server is usually at 50% CPU I/O wait time due to writing all the perf data it collects. Below is a sample configuration file to give you an idea of what collectd does as a data collection agent on a node, and how it’s configured:
At the central aggregation server, collectd dumps its data to an RRDTool database. RRDTool is a pretty well known, widely supported performance measurement storage format. I don’t do much directly with RRDTool. Instead I use drraw, a very light-weight web client for RRDTool. drraw lets us quickly throw together arbitrary dashboards on my perf data.
DStat is a very versatile tool to collect pretty much any system metrics and display them in a very Linux-hacker interface:
I’ve asked for CPU, disk, memory, load average, and the most expensive I/O process. It looks to me like:
- One of my cores is pegged.
- There’s nothing of note on disk or network.
- Not much memory is free, but nearly 700MB is cached, so that looks good.
- Xorg and “exe” (which is Flash player running Pandora) are talking to each other an awful lot, probably over pipes or local sockets (since there’s no corresponding disk or network)
One common problem I’ve got is that I see a lot of CPU I/O wait time, but only a few KB or maybe a MB/sec being written to disk. The question is, where’s all that I/O wait time coming from? It might be random disk I/O, or it might be network I/O. That’s where IOstat comes in:
I asked for extended information (-x) at 3 second intervals. The first block of output is aggregated since system start. Each block after that is aggregated over the 3 second interval. This tells me:
- The apps running are pushing between 1 and 10 write requests per second (the w/s column) (pretty low).
- Those requests have to wait between 0 and 0.25 milliseconds to complete (the await column).
- The disk has request response time of between 0 and 0.25 milliseconds (the svctm column). This will always be less than or equal to await. Because it’s equal to await in this case, that tells me there’s essentially no contention for the disk at the moment.
- Most importantly, the disk is essentially at zero utilization (the %util column).
That about wraps it up for measuring performance on the server itself. I’ve walked through a few scenarios. But it’s a pretty complicated landscape. The best thing you can do is to set up measurement and wait for stuff, good or bad, to happen. After the fact you can match up what you saw from a business standpoint (what your users or support staff are telling you) with your performance data. If things were reported by customers as being slow, did any of your perf graphs show spikes? If you got a massive spike of traffic, did you see the effects on your system? In the future you can use that experience to take action (add nodes, fix bugs, build better architecture) before any negative business impact occurs.
In addition to measuring server-side performance, you get bonus points for putting together a (or many) synthetic client(s). You’ll want to make sure your client can collect:
- distribution of response times (or at least mean, median and 90%)
- counts of successful (probably 200 OK) and failed (anything else) responses
- throughput in total time to run a certain number of reports
You’ll almost certainly uncover some errors under load. You’ll want to make sure your application (and other server processes) have a reasonable amount of logging. Debug logging could result in lots of unnecessary disk writes, so be sure to turn those off. But it’s certainly okay to log errors for perf tests and in production.
It’s also a good idea to have Apache request logging, including timing turned on so you can see responses the server gave out, and the time to process them. This will back up what you’re recording at the client. I use the following log format (which should be compatible with lighttpd and Apache):
Throw some monitoring on this log. I use monit. But for performance analysis, a simple grep command does the trick:
I hope that covers some basics, and the finer points around what to measure and how to measure performance in any web application. The important point is to start collecting data. The analysis of it comes in plenty of flavors and levels of complexity. But the old 80-20 rule applies: just get started and you’ll quickly see benefits.
I’ve spent the last week performance testing Open Site Explorer which we launched earlier this week. Using some of the same tools and techniques I’ve described in the past, I discovered plenty of issues. Below I share the process I followed and the issues I found. Then I connect that to our actual 24-hour-later analytics data.
The process I follow is:
- Understand Objectives of Perf Testing
- Create Performance Targets
- Gather Performance Measurements
- Analyze Results
- Close the Loop with Actual Results
- Understand performance characteristics of the system:
- Response Time: how long a user waits for pages to load
- Throughput: how many pages the system can deliver per second
- Identify Bottlenecks and Scaling: CPU, memory, network, disk, database, external services, per node and cluster-wide
- Find bugs under load, including repro scenarios (e.g. under high load, under moderate load, frequency of occurrence)
- Gain confidence in launch
Hopefully you can put together realistic, aggressive performance targets:
- Response Time: how long you want your users to wait for pages to load
- Sustained Load: how many users and page views you expect to serve in general
- Peak Load: how many users and page views you expect to serve at peak
- Uptime or Request Success rate: It’s naive to think that every request will succeed, so plan for failures realistically
Remember, you want these to be realistic, but also aggressive. What is the very best case scenario for launch? It’s better to plan for more traffic than you actually expect, than to find yourself short. Even so, you might want to have standby capacity, or have some kind of scale contingency plan.
Given that our application incorporates data from an external data source (the Linkscape API ) we can’t expect sub second response times. But page loads should not take more than two and a half seconds in general.
David, our ops guy, Scott in marketing, and I sat down and put together a simple launch model to predict our load. We anticipate we’ll have significantly more traffic at launch than we will in the near future after launch. We use a lot of data we already have:
- analytics on past tool launches
- analytics from our blog and site in general
- estimates on partner promotion reach
- generous conversion rate estimation (click-throughs from any promotion to the tool itself)
- 22,000 users on launch day
- 10 page views per person
- 200,000 reports run
Again, looking at our analytics we know that we can expect nearly 35% of those users and views between 6am and 9am Pacific. That gives us throughput targets:
- 8 requests per second over those three hours
- 24 requests per second at peak
With realistic performance targets, it’s time to insturment and load the application. This is a deeper topic than I’ll discuss here, but at a high level you want to:
- Put your system under load
- Collect relevant performance data
The diagram below gives a quick overview of the architecture of our performance test, including our load test client, Open Site Explorer (OSE), the Linkscape API (LSAPI), and what we’re measuring in each part of our system. Although the specifics of your system may differ, you’ll want to collect roughly the same data.
There are lots of tools out there to load a system, and discussing them is out of the scope of this post (maybe something for the future). I actually think they’re all missing something, so we’ve written our own load generation tool. But we’ll assume you’ve got a good methodology for loading your application. And we’ll assume the behavior of your tool mirrors real-world user behavior, or at least something close to it.
Eventually you’ll want to load your whole load balanced system (if you’re using a load balancer). But I always like to start with understanding the performance of just one node. That gives you a baseline and a best-case, linear scaling projection for throughput. From there you can see if, and how your system scales sub-linearly (e.g. because of a database bottleneck, etc.)
Caching is another important factor. You’ll want to test both cold and warm cache scenarios. And make sure your test mix includes enough diversity to reflect user behavior! Your users are not going to reload the same page, with the same inputs ten thousand times in a row. They are likely to provide a very long-tailed load of work (very little stuff will be requested with any frequency), so most input permutations will be very infrequent.
Response time (or latency) and throughput usually interact. Typically to maximize one, you’ll have to trade off the other: sure, one node can push 100 requests/second if each request can take 10 seconds to respond. And vice versa, you might be able to get sub-second response time if you process your requests one-at-a-time. Neither of these is desirable.
You’ll want to vary your test to run through different latency-throughput trade-offs. I’ve found that this really boils down to the concurrency of your testing. For one node (server), a concurrency of 1-5 is unrealistic, but should give you as good response time as you can expect. Concurrency of 10-30 per node should give you some more realistic load. Concurrency of 50-100 per node should give you a good idea of what a heavily loaded system looks like. Of course, this all depends on your hardware, configuration, and your application.
The measurements you collect while testing are very important. What you’re looking for are:
- System Characteristics and Bottlenecks (cpu, disk, memory, network)
- Client-Side performance and responses (response time, throughput, HTTP status codes)
- Server-Side errors (application and web server logs)
A deeper discussion of all these factors is valuable, but out of scope for this post. But keep an eye out for a deeper dive in the near future.
Now that you’ve loaded your application, and gathered measurements around system characteristics, client-side performance, and server-side errors, you should have enough data to understand how your system performs. Assuming you don’t have significant error rates (you’re probably looking for 99.95%+ successful requests for a reasonable mix of input), the two most important conclusions you’ll get are the simplest to analyze:
- Median Response time
- Throughput (# of successful requests / test run time)
The first is important because you know that at least 50% of your requests will be faster than this. Of course 50% are slower too, so make sure that response times fall off gracefully from 50% to 90%. The second is also important because it’s going to tell you if you have enough capacity, and how your application will scale. If one node can serve 20 requests/second, then hopefully two nodes can serve close to 40 requests/second, and so on. Of course linear scaling is an ideal which you’ll want to verify with more testing of your load balanced app.
If you’ve tried a few different levels of concurrency, you should have a pretty good idea of the trade-off between response time and throughput. And you’ll have an idea of the min/max values on throughput and response time your system should be able to put out.
In our case we immediately found response time and throughput issues with concurrency above 10. This turned out to be a problem with the PassengerMaxPoolSize which defaults to 6 (!!). A little more testing and we discovered incorrect throttling settings in LSAPI which caused our app to only serve a handful of requests before returning 503 errors. Once we sorted all that out, our database max connections were quickly all used up (another default config to blame).
We also discovered (from per-process CPU usage and disk utilization) that our database is working pretty hard compared to our web servers. This doesn’t impact our launch criteria (those perf targets above). But this is something we’ll want to investigate in the future since scaling the database is non-trivial.
In the end we discovered:
- Significant configuration errors (thread pool sizes, memory usage issues, load balancer configs, throttling, etc.)
- Several application errors (API error response handling, rare corner cases, minor perf problems)
- Median response time of 2.5 seconds under moderate load
- Median response time of 4 seconds under heavy load
- 15-30 pages per second on a single node, depending on caching
- 25-40 pages per second in a two node, load balanced configuration (this is sub-linear scaling)
Those first two bullets are great to catch before launch. Finding these now is why we test at all. The rest put us well within our aggressive performance targets. At this point perf testing (i.e. Me) can sign off for launch.
There’s a lot of other analysis you can do from the data you’ve collected. But response time and throughput are the most important factors for launch.
After collecting all that data and matching it up to our targets, we feel confident that we can handle launch. And indeed, launch went smoothly for Open Site Explorer. The final, and very important thing we have to do is to close the loop on measurement and projection with actual results. Within 24 hours of launch we had:
- No significant errors or downtime
- No performance related customer complaints
- 31 thousand users
- 100 thousand reports
Those first two bullets are exactly what I love to see. The bugs we found in testing would have had significant user impact. Fixing them saved the engineering team the stress of live hotfixes, saved customer support a flood of angry complaints, and saved the company embarrassment.
The other bullets speak to the success of our projections. Although total reports were lower than projected, that only means our aggressive projections were indeed aggressive. That’s exactly what you want. We were prepared for more traffic than we had. This is much better than the alternative.
These results, and the customer feedback we collected, tell me we have a compelling product with no significant performance issues.
Inspired by an article at Jane and Robot about domain canonicalization (and the fact that I’m the lead developer on the Linkscape API), I decided I’d write a small application using the Linkscape Free API to help with a common problem: checking canonicalization of website home pages.
I spend a lot of time answering SEOmoz Q&A and I see canonicalization problems come up all the time. No matter what the size of the site, or savvy of the engineering team, this is just an easy problem to miss. But it’s also easy to fix. You just have to find those pesky canonicalization errors.
The tool doesn’t scrape any sites. All of the data is pulled from the Linkscape API. And everything here is possible using only the free API. All of the code is available in my github repository. There’s plenty of documentation there and that’s a good place for any discussion about the code. Feel free to take it and use it in whole or in part on your site in any application. You’ll just have to sign up for a free API key.
I’m planning on pitching an excellent conference series on enterprise computing, operations, and performance. I’m hoping to talk about our cloud infrastructure at SEOmoz for Linkscape. In a nutshell we’re taking advantage of Amazon Web Services:
- For long term storage and backup we use S3
- For batch mode processing we periodically bring up an EC2 cluster
- For serving our API we have an EC2 cluster which we scale up and down
- Our API uses S3 as a high performance, mirrored block device
- For load balancing we use ELB
In a talk at Velocity, I’d like to dig into performance and operational characteristics of our set up, as well as some of the trade-offs:
- What low cost systems do we use for operational and performance monitoring?
- What is the throughput, response time, etc. of our system, end-to-end?
- How do each of the software and cloud infrastructure components contribute to end-to-end performance?
- How do we test the performance of our cloud infrastructure?
- What are some of the cost trade-offs we’ve made between cloud versus a traditional managed hosting, or co-location solution?
I’m curious what other people what to hear more about. So please feel free to make suggestions in the comments.
It’s like a database migration, only… not as well defined.
For some time my wife and I blogged together at twopieceset.blogspot.com. However, that’s always led to an awkward mix of technology, knitting, and kittens (I’m not telling you who’s responsible for each). In the future I’ll try to be true to the stated topic: “Software Engineering and Entrepreneurship”.
Some interesting posts I’ll leave on twopieceset:
- S3 Performance Benchmarks
- Lessons Learned While Indexing the Web
- Performance Measurement for Small and Large Deployments
- Why This Report is So Slow: Let the Database Handle the Data
- High Performance Computing at Amazon: A Cost Study
- Anatomy of Cross Site Request Forgery
- InfoCamp 2007: My Session on Calendaring