Load User Status… What’s your Limit?

MAoN8zO2f-DiYAqhV8BZPY8i1cftBP0tVFuNxi9qj2M

Clients come to us with an array of problems that need innovative solutions. Recently one of our clients came to us with an issue, their website was not configured to auto scale. To those of you that are not familiar, auto scaling ensures that your platform has the correct instances of servers to account for traffic or load. Due to the industry our client is in, breaking news and events can occur instantly which results in an increased amount of traffic to their website at any given time. Their site could be manually scaled; however, this retroactive intervention isn’t always fast enough. We were able to create a solution to account for sporadic increases in traffic to their site that would be successfully maintain platform uptime.

In order to do this, we used a multi-server locust setup to run load tests against which allowed us to test multiple different slow and fast test scenarios. These different tests allowed us to see how much further we could scale instances to maintain acceptable uptime rates then we ever could do before. Once we decided on the type of tests that we wanted to run, it was then time to identify our “standard” for how many users we wanted to test at once. We ultimately decided to test for 100,000 concurrent users to account for usage traffic seen hitting our CDN during high traffic events. We then collected a list of our 2500 most requested queries for our test users to use during the tests.

We experimented with the slow test first, which meant that we gradually added users in a more predictable and normal situation. This test added servers as needed and overall there was no immediate impact to the speed that the content was handled. Overall pretty dull…Which was wonderful! 100k users a minute and all statistics showed a happy healthy site.

The fast test allowed thousands and thousands of users to visit the site within a few minutes. During this test we went from 0 to 100k users in roughly six minutes. This caused the site to serve 500’s while the servers were being added. Our 500’s are mostly absorbed at our cache layer which is where we served stale content until the requests were fulfilled. Our findings proved that when we increased the speed of users it caused a tremendous amount of stress on the server. Our auto scale group added two servers every three minutes until we got to nine servers, through prior testing we knew this was likely the appropriate amount of servers that could handle the load. Once we reached around five servers the 500’s disappeared.  After we reached nine servers the request queue cleared up latency between our cache and application layer looked nominal. All in all this process took about 20 minutes to become “normal” at 100k users per minute.

During the peak of the quick test when the servers were under the most pressure we decided to do the unthinkable and clear our drupal site cache. I know, sounds like a great idea, but we love a challenge. We pressed the clear cache button and waited 10 to 15 minutes but to our surprise nothing happened.. the drama that we were anticipating never played out. The application saw a slight jump in latency and the statistics were raised or lowered by about 5-10% for about one minute and then returned to normal. That was it…no fireworks, only the lingering taste of sweet sweet success! This is due in large part to the site being anonymous, but cache policies still require frequent invalidation.

Overall both tests to 100k users is above and beyond the highest traffic we have seen in an hour timespan due to the CDN layer. We were able to successfully complete the 100K users a minute test, which we roughly estimated at one request every 10 seconds. We found that the results of each test to be relevant because of the scale our client can now operate; they can go from a site having minimal traffic to 50x that amount of traffic in a short amount of time. Problem solved!