Performance Testing: An OSE Case Study

I’ve spent the last week performance testing Open Site Explorer which we launched earlier this week.  Using some of the same tools and techniques I’ve described in the past, I discovered plenty of issues.  Below I share the process I followed and the issues I found. Then I connect that to our actual 24-hour-later analytics data.

The process I follow is:

Objectives of Perf Testing

  • Understand performance characteristics of the system:
    • Response Time: how long a user waits for pages to load
    • Throughput: how many pages the system can deliver per second
  • Identify Bottlenecks and Scaling: CPU, memory, network, disk, database, external services, per node and cluster-wide
  • Find bugs under load, including repro scenarios (e.g. under high load, under moderate load, frequency of occurrence)
  • Gain confidence in launch

Performance Targets

Hopefully you can put together realistic, aggressive performance targets:

  • Response Time: how long you want your users to wait for pages to load
  • Sustained Load: how many users and page views you expect to serve in general
  • Peak Load: how many users and page views you expect to serve at peak
  • Uptime or Request Success rate: It’s naive to think that every request will succeed, so plan for failures realistically

Remember, you want these to be realistic, but also aggressive.  What is the very best case scenario for launch?  It’s better to plan for more traffic than you actually expect, than to find yourself short.  Even so, you might want to have standby capacity, or have some kind of scale contingency plan.

Given that our application incorporates data from an external data source (the Linkscape API ) we can’t expect sub second response times.  But page loads should not take more than two and a half seconds in general.

David, our ops guy, Scott in marketing, and I sat down and put together a simple launch model to predict our load.  We anticipate we’ll have significantly more traffic at launch than we will in the near future after launch.  We use a lot of data we already have:

  • analytics on past tool launches
  • analytics from our blog and site in general
  • estimates on partner promotion reach
  • generous conversion rate estimation (click-throughs from any promotion to the tool itself)

We expect:

  • 22,000 users on launch day
  • 10 page views per person
  • 200,000 reports run

Again, looking at our analytics we know that we can expect nearly 35% of those users and views between 6am and 9am Pacific.  That gives us throughput targets:

  • 8 requests per second over those three hours
  • 24 requests per second at peak

Gather Performance Measurements

With realistic performance targets, it’s time to insturment and load the application.  This is a deeper topic than I’ll discuss here, but at a high level you want to:

  1. Put your system under load
  2. Collect relevant performance data

The diagram below gives a quick overview of the architecture of our performance test, including our load test client, Open Site Explorer (OSE), the Linkscape API (LSAPI), and what we’re measuring in each part of our system.  Although the specifics of your system may differ, you’ll want to collect roughly the same data.

There are lots of tools out there to load a system, and discussing them is out of the scope of this post (maybe something for the future).  I actually think they’re all missing something, so we’ve written our own load generation tool.  But we’ll assume you’ve got a good methodology for loading your application.  And we’ll assume the behavior of your tool mirrors real-world user behavior, or at least something close to it.

Eventually you’ll want to load your whole load balanced system (if you’re using a load balancer).  But I always like to start with understanding the performance of just one node.  That gives you a baseline and a best-case, linear scaling projection for throughput.  From there you can see if, and how your system scales sub-linearly (e.g. because of a database bottleneck, etc.)

Caching is another important factor.  You’ll want to test both cold and warm cache scenarios.  And make sure your test mix includes enough diversity to reflect user behavior! Your users are not going to reload the same page, with the same inputs ten thousand times in a row.  They are likely to provide a very long-tailed load of work (very little stuff will be requested with any frequency), so most input permutations will be very infrequent.

Response time (or latency) and throughput usually interact.  Typically to maximize one, you’ll have to trade off the other: sure, one node can push 100 requests/second if each request can take 10 seconds to respond.  And vice versa, you might be able to get sub-second response time if you process your requests one-at-a-time.  Neither of these is desirable.

You’ll want to vary your test to run through different latency-throughput trade-offs.  I’ve found that this really boils down to the concurrency of your testing.  For one node (server), a concurrency of 1-5 is unrealistic, but should give you as good response time as you can expect.  Concurrency of 10-30 per node should give you some more realistic load.  Concurrency of 50-100 per node should give you a good idea of what a heavily loaded system looks like.  Of course, this all depends on your hardware, configuration, and your application.

The measurements you collect while testing are very important.  What you’re looking for are:

  • System Characteristics and Bottlenecks (cpu, disk, memory, network)
  • Client-Side performance and responses (response time, throughput, HTTP status codes)
  • Server-Side errors (application and web server logs)

A deeper discussion of all these factors is valuable, but out of scope for this post.  But keep an eye out for a deeper dive in the near future.

Analyze Results

Now that you’ve loaded your application, and gathered measurements around system characteristics, client-side performance, and server-side errors, you should have enough data to understand how your system performs.  Assuming you don’t have significant error rates (you’re probably looking for 99.95%+ successful requests for a reasonable mix of input), the two most important conclusions you’ll get are the simplest to analyze:

  • Median Response time
  • Throughput (# of successful requests / test run time)

The first is important because you know that at least 50% of your requests will be faster than this.  Of course 50% are slower too, so make sure that response times fall off gracefully from 50% to 90%.  The second is also important because it’s going to tell you if you have enough capacity, and how your application will scale.  If one node can serve 20 requests/second, then hopefully two nodes can serve close to 40 requests/second, and so on.  Of course linear scaling is an ideal which you’ll want to verify with more testing of your load balanced app.

If you’ve tried a few different levels of concurrency, you should have a pretty good idea of the trade-off between response time and throughput.  And you’ll have an idea of the min/max values on throughput and response time your system should be able to put out.

In our case we immediately found response time and throughput issues with concurrency above 10.  This turned out to be a problem with the PassengerMaxPoolSize which defaults to 6 (!!).  A little more testing and we discovered incorrect throttling settings in LSAPI which caused our app to only serve a handful of requests before returning 503 errors.  Once we sorted all that out, our database max connections were quickly all used up (another default config to blame).

We also discovered (from per-process CPU usage and disk utilization) that our database is working pretty hard compared to our web servers.  This doesn’t impact our launch criteria (those perf targets above).  But this is something we’ll want to investigate in the future since scaling the database is non-trivial.

In the end we discovered:

  • Significant configuration errors (thread pool sizes, memory usage issues, load balancer configs, throttling, etc.)
  • Several application errors (API error response handling, rare corner cases, minor perf problems)
  • Median response time of 2.5 seconds under moderate load
  • Median response time of 4 seconds under heavy load
  • 15-30 pages per second on a single node, depending on caching
  • 25-40 pages per second in a two node, load balanced configuration (this is sub-linear scaling)

Those first two bullets are great to catch before launch.  Finding these now is why we test at all.  The rest put us well within our aggressive performance targets.  At this point perf testing (i.e. Me) can sign off for launch.

There’s a lot of other analysis you can do from the data you’ve collected.  But response time and throughput are the most important factors for launch.

Closing the Loop

After collecting all that data and matching it up to our targets, we feel confident that we can handle launch.  And indeed, launch went smoothly for Open Site Explorer.  The final, and very important thing we have to do is to close the loop on measurement and projection with actual results.  Within 24 hours of launch we had:

  • No significant errors or downtime
  • No performance related customer complaints
  • 31 thousand users
  • 100 thousand reports

Those first two bullets are exactly what I love to see.  The bugs we found in testing would have had significant user impact.  Fixing them saved the engineering team the stress of live hotfixes, saved customer support a flood of angry complaints, and saved the company embarrassment.

The other bullets speak to the success of our projections.  Although total reports were lower than projected, that only means our aggressive projections were indeed aggressive. That’s exactly what you want.  We were prepared for more traffic than we had.  This is much better than the alternative.

These results, and the customer feedback we collected, tell me we have a compelling product with no significant performance issues.

  1. #1 by Chris Arkwright on January 25th, 2010

    This is an awesome in-depth look at the complexity and quality of testing done to ensure a successful launch of OSE… which, by the way, is amazing. Nice write-up Nick. My compliments to the chef! :P


  2. #2 by Calin Brancus on February 18th, 2010

    Can you provide a relationship between the throughput (pages/second) and the pageviews/second – metric that is widely used for web performance.
    Also what would be the throughput expectation – how many pages/second should a web site handle ?

  3. #3 by Nick Gerner on February 18th, 2010

    All the measurements I’m presenting here are about page views/second for the most expensive page we’ve got (the first page of a report). The expectation should be driven by marketing/product concerns: how many visitors are you expecting to get? What is their behavior? I discuss this a little bit in the “Performance Targets” section.

    From the engineering side, this is probably driven by cost. What’s the cost of a page-view? Hopefully you’ve got a scalable architecture so you can say something like:

    It’ll cost us $300/month to push 20 pages/second, plus ops concerns. If we want to push 40 pages/second, that’ll cost us $600/month because we’ll need to add another box.

    I don’t think I’ve got a rule of thumb in general. Performance varies from application to application so much that it’s hard to say. OSE is pretty data intensive: one report issues five or six API requests for a few thousand rows of data, hits the database a handful of times, and queues up other work for background processing. On the other hand, there are plenty of benchmarks that suggest a naive ruby app that does basically nothing can do ~500 page-views/second on low-end commodity hardware.

    I guess there you go: per box you should get at least 10 or 20 page-views/second at the low-end. But it’s unrealistic to expect 500 or more page-views/second from any real app on the kind of hardware you’re likely to see in production.

(will not be published)