Load Testing as Science and Art

Former Member · ‎03-22-2012

The aim of this post is not specifically to shed more light regarding what went wrong before the launch of SCN, but I do promise to get there. In fact, I hope to do much more: inspire some thinking about the making of load-tests for any big, complex system. I've been to a few of these, and many of you have been through this as well, I guess.

One thing that's always noticeable to me either when I present my findings, or when I read of other's experiences, is this aura of "OMG SCIENCE AT WORK! HYPOTHESES, ISOLATING VARIABLES, STATISTICS...FEAR ALL YE PRODUCT PEOPLE!". I do admit it's kinda satisfying as an engineer to bask in that light. However when looking at the details, there appears an intricate layer of reasoning, switchbacks, convenient omissions and the like which make it read more like a novel (and a bad one at times). Why does that happen?

The Unknown

One major reason, I think, is the actual amount of unknowns you're facing. Even when you have the legacy of an existing system such as the old SDN, there appear numerous questions. Here are but a few - some are easy, some are hard.

The old system had of course the concept of replies to discussions, and now we added Likes and Shares on top - so how to estimate the number of such actions? How do these actions affect the number of replies? A safe bet here, which we actually took, is to leave the rate of replies as it is but add likes and shares by a factor of 4x. Why 4x? Because these are much easier tasks for the user to do than reply, and so we needed to have SOME factor there. Of course, one could argue, likes and shares would also increase the number of replies - because people would reply just for the sake of being liked. Here you could hopefully see the endless loop of discussion which may surround every little detail. So, you decide, this corner of the system really doesn't matter all that much; You go for some nice factor, telling yourself that it's all fine because you added a lot of load for SOME OTHER FEATURE.
And here's another one: looking at the logs from the old system, we were quite surprised at the number of requests for RSS feeds. I'm not an avid fan of the technology myself, and I had to wonder how many of these registrations are actually "active" - how many users registered once to a blog/discussion and now forgot about it? In other words, how many would bother to register again in the new system, given the fact that the new system has more modern (and arguably better) functionality which serves a similar purpose: Followed Activity? For this corner of the system, we actually decreased the number of RSS requests compared to the old system, while making sure to set a really high rate for the All/Followed Activity page, which we already knew was quite heavy at times. In this process of bargaining, we always made sure the total number of page-views per hour would sum up to what we calculated as representative of a "busy hour in a busy day" - in the old system. Nice, but is this old hourly rate enough anyway? maybe not, but then you have to make your baseline SOMEWHERE and start loading from there.
The long-tail of content: Some content, such as Oliver Kohl 's blog, is more popular than others 😉 and so you look at your first draft of the test and think: maybe I'm actually making it too hard on that poor system...I mean, I'm randomly requesting for threads that no one would look for anymore, when actually it might be that 20% of threads are viewed by 80%, and maybe it's even closer to 5% consumed by 95%! If that is indeed the case, my system is gonna smoke that test with its in-memory caches! Results would be wonderful...but if you're like me, you just don't take that path of glory. And you know you're putting some non-realistic load here for the worse, because it evens out with somewhere else where you unknowingly made life way too easy on the system.
Product people don't know any better than you: there are many respects in which they actually do know better (as painful as this is to admit), but suprisingly not in this case. I've had this time and time again: You can't go to someone who's an expert on functionality and demand some numbers. It's hard to even get some realistic usage scenario from them. They just don't think that way, they don't have that info, and these nice little user-stories are totally made-up anyway - and they know it even better than you do. In the rare case they get TOO INTERESTED, however, you face the possibility of someone actually questioning your basic axioms all over again. Of course, you're open to that, but as George Smiley once said in "Tinker, Taylor, Soldier, Spy":
"TO A POINT".

I could go on in similar vein forever here, but a pattern does emerge I guess: There's just a lot you don't know, and nobody's gonna help you. So, you try to strike a balance which FEELS right - to you.

You Don't Have Enough Time - And This Will Never Change

An excuse? for sure, but also kind of a given for load-testing, because of the golden rule of load-tests: "By the time the system is mature & stable enough to test, it's time to deliver already". You can pathetically try to stress a work-in-progress, but would be hammered on all sides by the bugs, and cannot compare the previous week's results to this week anyway. In all probability, you are also usually busy building that system (someone has to do it), while hoping that your solution does scale as planned, but you don't really know except for some synthetic micro-benchmarks, which a lot of people won't even bother to do.

This also relates to a rant about optimization: too many people seem to think that optimization is the choice between ArrayList and LinkedList when their list is 5 items long. They don't realize that performance is usually borne out of architecture. If it's well-built, then it can scale already or can be fixed to become so. As for our context, this probably means that during the happy development phase, many people won't know how to code or what to test anyway when it gets to this dirty issue of performance.

My & Your Tests are Static, Reality is Not

Given the first rule which concerns the inherent lack of time, here is one thing that we tend to miss, and I think we missed it here.

You think you came up with some use-cases which describe "a day in the life" of your system. In reality, nonetheless, there is always change. Sounds like New-Age talk? Well, you better believe it. We based our scenarios on users that pretty much know how to get to stuff - as many did in the latter SDN days. On our launch, however, our users did not know how - for various justified reasons which have nothing to do with performance per-se. And so they wentto the content browser and clicked on just about every possible combination, with or without search terms, trying to FIND THAT CONTENT ALREADY. And so, a usability issue (which would both need better tooling on our side and might become less of an issue as many users become more accustomed to current navigation methods) became a performance issue - because it just happened to be generating lots and lots of unique queries that are really hard to cache on any level. Granted, this was not the only feature whose usage patterns we did not account for, but it's enough to have just one which hurts - and then it doesn't really matter that you found ten other such biggies just before launch (why just before launch?? see again: "You don't have time" etc.)

You (and I) are Doing it Wrong

The final point, for today at least, is that even if you came up with a brilliant test set, you probably use tools in a way that doesn't REALLY match the real world. One case in point: AJAX requests, such as the "More Like This" query on content pages which brought the system to its knees on its first day live. Unless you have a grid of computers at your disposal just aching to act as your test clients, it is much more feasible to just fetch the HTML content of pages instead of running in a real browser - which has this nice feature where it loads & runs not just static resources but all Javascript content. Instead, you look at the page and see what the "important" AJAX calls are, so you can just mimic these directly. When you miss one "important" call, as was the case here, it can blow over.

By the way, I don't think there's a magic bullet here: I've twice heard in the context of Cucumber/Capybara-based automated testing that one should use HtmlUnit instead of the default Firefox driver. You get a "real" browser core with JS, but don't have to suffer the burden of a full browser - and all is super-fast and well. In these two instances, when the browser implementation was changed from HtmlUnit to Firefox just as a demonstration for me, the tests then failed immediately, at which point the response of all involved was: "So....anyway....".

Before you get all depressed, I have to say that the picture is not that bad: if you work your way against all odds, you manage to somehow get to production with most of the big pain points already squashed. In a major site like SCN, where people actually care (and I appreciate that every day!), you get the rejects, live with them and fix the roadblocks as fast as you can, so we could argue about features again (which is the best possible state, I guess). Everybody knows that for internal deployments, as messy as they are, users would just have to live with it until the situation is fixed...(and so it's sometimes never fixed). For us here, this is not an option of course.

Now, let's see if the patches and fixes give all of you the experience expected. The error & response-time numbers show me an increasingly better picture, but as we now know... well, these are just numbers.

Load Testing as Science and Art

How to use SCN search

Easy Access to the SCN Rules of Engagement

I Care, I Gave, I Inspired