Our world has changed in a very short space of time due to the coronavirus pandemic and so has our everyday behaviour. Instead of going to offices to work, pubs to socialise, and shops to buy products, people are doing these activities from home. We have been forced to use technology more than ever to enable these activities to continue and this has led to an increase in demand for online services such as video conferencing, shopping and entertainment.
This is a double-edged sword for companies offering online services that are experiencing increased demand. On one hand those services are more popular than ever, they are providing value to their customers like never before. On the other hand the increase in demand can be difficult to deal with and may result in a reduced quality of service or even a complete service outage. Those responsible for such online services face an unfavourable situation whereby when the service is most in demand it faces the highest risk of disappointing customers. Or to look at it another way, the best service is delivered when it is least in demand. Increased demand is a good problem to have.
What strategies can businesses use to cope with unexpected peaks of demand, and more importantly, what can they learn from this period and implement so that they are better placed to be able to cope with future events.
So how are companies coping? What real-world examples can we see of online services tackling the Good Problem To Have?
In the early stage of the UK coronavirus lockdown, Ocado gathered press attention after deploying a queueing system for their e-commerce site, making all customers wait in a virtual line in order to access the website. The technique was then used by several other online retailers, including B&Q, Next and Boots.
Netflix and Amazon have reduced the bit rate, potentially affecting image quality of their streaming video content.
Not exactly the ideal way to “cope” with an increase in demand but suffering a service outage is unfortunately a common occurrence for the unprepared. Zoom and Microsoft Teams have both suffered several service outages since the onset of the coronavirus pandemic.
But it’s not all bad. Some companies are making the most of this and delighting their customers. Google made some of the enterprise features of its Google Meet video conferencing software available to all GSuite customers. Sainsburys rolled out a feature to allow vulnerable customers priority access to delivery slots. The UK Government’s Universal Credit platform experienced a 500% increase in weekly applications, yet managed to cope with this surge in demand.
But why did these online services do these things in response to increased demand? What effect did they have? To answer these questions we have to discuss the concept of “capacity”. Every online service (even one that can “auto-scale”) has a maximum demand that it can cope with; that is its maximum capacity. If demand exceeds that maximum capacity then Bad Things can happen! In the best case, users may experience a slow down but if this trend continues the slow down can often experience a run-away behaviour and the entire system can lock-up, thus presenting users with a full service outage.
To explore this concept a little further, I’d like to introduce a business simulation game I used to play called ”Lemonade Tycoon”. In this simple game the objective is to turn a profit by making and selling lemonade. You have to buy stock (lemons, cups, sugar), set up a stall, create a recipe, set a price etc. If you get everything right then you create a demand for your delicious lemonade. The customers come flocking to your stand, eager to get a taste of your refreshing beverage. Great stuff, we’re in business! But wait, you can only serve lemonade so fast; it takes time to make the lemonade, dispense it and take payment. So what if you have created so much demand that customers arrive at a faster rate than you can serve them? You end up with the dreaded queue.
Back to our earlier concept, the queue is a Good Problem To Have, but it is a problem nonetheless. Customers are impatient and if you make them wait too long, they will utter an expletive and wander off. This illustrates the issue of disappointing our customers when we are offering them the most value. Wouldn’t it be great to be able to fulfil all of the demand that comes our way whilst not only avoiding disappointed customers but in fact delighting our customers?
When a queue forms in Lemonade Tycoon I don’t have to worry about suffering a service outage, there is no run-away behaviour that can happen to cause my stall to lock-up and become unavailable, except maybe if I just can’t take the stress any more and pull down the shutters! As we’ve seen, in the tech world this run-away behaviour is a very real threat and so a queue can throttle demand to manageable levels and keep the system operating at a known safe level. Keep demand below capacity. This is what Ocado and B&Q have been doing. They have forced all customers to sit in a queue before being allowed into their online shops. This is a great solution for the system itself, keeping it protected from a surge that might destroy it, but it is not great for their customers.
This is similar to what happened with the pandemic lock down. By staying at home, we are reducing the spread of coronavirus, thereby reducing the number of infections and reducing demand on hospitals. “Flattening the curve” is a phrase we’ve all become familiar with and a queue does exactly that for demand for an online shop. And just like in Lemonade Tycoon, if you make your customers wait too long they will find a different online shop at which to spend their money. This is the unfulfilled demand, as illustrated in diagram 1.
So how do we avoid making our customers queue and actually fulfil all our demand? There must be other ways of protecting system capacity other than a queue?
If a queue forms when demand exceeds capacity then one way to avoid a queue is to ensure that capacity always exceeds demand. Or in other words; increase capacity. This sounds simple but often is anything but! I will present three ways in which capacity can be increased and look at the pros and cons of each.
Partial service > No service.
If I’m running a lemonade stall and I just can’t serve people fast enough then I could reduce the quality of my service. Instead of freshly squeezed lemonade I could instead serve pre-made lemonade that has sat in the fridge for a few days, or even pre-canned lemonade.
It is now much quicker to serve drinks but I’m still delivering on my promise of cold refreshing lemonade. Aren’t I? Yet again I face the problem of disappointing my customers by delivering them an off the shelf commodity rather than a hand crafted artisanal delight.
In the digital world we can employ a similar strategy of reducing quality. Netflix and Amazon explicitly reduced the quality of their video streams to avoid exceeding capacity. Their press releases explained that this was to protect broadband connections, preserving valuable bandwidth when it is most needed, rather than protecting their own systems. This may well be the case but it may also be that they are trying to protect the capacity of their own systems. Regardless of which explanation is true, the point still holds that a reduction in quality can avoid bursting capacity.
Many online services obey an 80-20 rule insofar as the majority of the value of the service is provided by a relatively small number of its features. There is then a long list of more minor features that aren’t as valuable. Different features also have a different load profile on the underlying infrastructure. Some features consume a lot of compute resources and others less so. The lower value features that also consume a lot of resources are features that could be proactively disabled in order to save capacity as demonstrated in diagram 2. These features typically build upon the core offering by providing a more personalised experience. For an e-commerce website these could be things like recommendations, favourites, shopping history etc. None are required in order to place an order but make the overall experience better for the customer.
An online system that has such features could install “kill switches” to proactively disable these features when the system is approaching capacity; it’s better to provide a partial service than no service at all.
If, at my lemonade stand, I can serve each customer more quickly, then I can serve more customers in any given time. This should in turn reduce the size of the queues I get, assuming demand stays the same. In lemonade terms this amounts to buying a bigger and better juicer!
Other things I could do would be to invest in a contactless payments system to avoid wasting time sorting cash and counting change; contactless payments are much quicker.
So if bigger juicers and contactless payments speed things up at my lemonade stall, how do these things translate to tech? Speeding up an online system essentially amounts to making the back-end of the system run faster. There is often some low hanging fruit (please excuse the pun) to be had here such as improving database performance or hardware upgrades, but in order to obtain a step change in performance it may be necessary to re-write large parts of the system’s code.
Improving back-end performance can bring other benefits beyond increasing system capacity. Firstly a faster back-end will likely be perceived as a faster service in general, improving the user experience. One key technique for achieving a faster system is to simplify things. Reducing complexity, if done properly, results in a system which is easier and cheaper to maintain and operate.
The cumulative effect of these benefits can be transformative, making this a compelling strategy – so what’s the catch?
Rewriting the entire back-end is likely to be an expensive, lengthy and risky endeavour and requires a high degree of technical skill. Without the right engineering thought driving the process there is no guarantee that it will work – it could even end up making things slower! There is a risk to stability; the old code was battle hardened and had years of bug fixes, the new code has not so will likely suffer problems for a while until it beds in. This is also the kind of strategy that suffers from diminishing returns; once rebuilt and simplified there is nowhere left to go. Speed improvements cannot be made ad infinitum.
At my lemonade stall I can scale out my operations by getting a bigger stall and hiring some staff. If I have two people serving lemonade at the same rate then I’ve doubled my capacity! I can serve twice as much lemonade in the same time as before. In theory I can just keep on doing this, making the stall bigger, or starting up new stalls, hiring more staff until I’ve increased my lemonade serving capacity to a satisfactory level.
Back in the world of tech there is a very similar concept of scaling, usually done at the infrastructure level. A well designed infrastructure platform should be able to scale out, simply by adding more resources (e.g. CPU, memory, storage, network). A highly scalable platform should be able to get close to linear scaling, i.e. if we double the number of resources, we double capacity. Reality is never quite that simple, so a linear relationship is never actually achievable – but you can get close. If a platform does not scale in this manner then things get more difficult, it will require some rework to enable it to scale out so this has a similar profile to the work in the previous section, i.e. it can be lengthy, costly, and risky but brings great benefits.
Once an infrastructure platform is inherently scalable, the question of how and when it should be scaled comes down to a few factors. One is the nature of the demand itself and the second is the make-up of the underlying infrastructure.
Shape of demand
The demand on most online systems follows the “double hump” pattern whereby demand surges in the morning, tails off a little over lunch time and then surges again in the afternoon or evening. The humps will be slightly different for consumer or business focused services but the main feature of these humps is that they ramp up and down very gradually and it is usually easy to predict when they will occur and the levels they will rise to, simply by studying what has happened in the past.
Some online services are subject to more “spikey” surges in demand. These surges are typically driven by events and are by their nature unpredictable and sharp. For example there may be a huge news story that has just broken, a significant goal during a football match or even an advert for the service hits prime time TV. These events push users to the service all at around the same time, meaning the demand surge can be several times greater than normal and it can peak within as little as 20 seconds of the event occurring.
The infrastructure underlying the online service will either be physical infrastructure managed by the organisation themselves or it will be running on virtual infrastructure in the cloud. If it’s running in the cloud then there are two main subcategories that I’ll term “classic” and “serverless”.
With this type of platform everything needs to be scaled in advance so that there is enough capacity to meet the peak of demand, as shown in diagram 3. The main downside of this is the time it takes to respond to increases in demand. New infrastructure has to be procured and provisioned, by which time it may be too late. Additionally if the increase in demand is temporary, as it may well be for some online services during the pandemic, then the extra kit becomes a sunk cost; once it is no longer required the cost can’t be recouped.
In this model it is possible to configure the infrastructure to automatically scale up and down to meet demand and because of the pay-as-you-go cost model of cloud services when the system is scaled back down there is nothing to pay for unused capacity. This kind of “auto-scaling” works really well for the “hump” style surges because the auto scaling can respond in time. However for the “spikey” surges the auto-scaling usually cannot respond in time so the spikes will exceed capacity, which at worst may trigger a full service outage, or at best disappoint some customers. To deal with the rapid spikes in demand the general approach is to fall back on the physical strategy and pre-scale when spikes are likely to occur.
In this model the concept of infrastructure disappears almost entirely and in turn the scaling of it becomes a trivial issue. A well designed serverless system will just cope with whatever is thrown at it. The whys and wherefores of scaling have been outsourced to the cloud vendor. That said, even serverless technologies sometimes have a “brake” which limits how fast they can scale up which may be too slow for the fastest of demand surge spikes.
How much capacity to add?
In order to know how much additional capacity is needed to meet an increase in demand, there needs to be a known level of current capacity; a baseline, and the only way to determine that is to measure it. Evidence is key here. In theory an infrastructure platform may automatically scale itself to meet demand, but does it work in practice? And how high can it go? Maybe there is a bottleneck somewhere which limits how high it can go? The only way to be sure of how much demand can be met is to look historically at the highest level that has previously been met. Essentially the “high water mark” of the system. Even then this is no guarantee; as online systems typically undergo a lot of change, a system that evidently had plenty of capacity last month may have unknowingly reduced capacity since then.
Once current capacity is known, the next step is to predict what future capacity is required. There are a number of ways to do this but typically it would involve creating a capacity model based on historical usage and then combining this with commercial models and forecasts for increased customer demand. Once combined this should provide a target capacity; the capacity that needs to be provided in order to hit the demand forecasts.
Once the additional capacity has been provisioned (by whatever means, automated or not), then the next step is to test it, if certainty is required. This is typically done with a “load test” tool. These tools simulate user behavior in order to create artificial demand. Some tech teams maintain an identical copy of their system specifically for testing purposes such as this. If so, then running the load test tool against such a pre-production environment can give great confidence as to the real capacity of the live system.
If there is no pre-production platform available then there is a choice. A choice between hoping that enough capacity is available and that the theory and calculations were right, or to run the load test tools against the live system. Testing against live services is a controversial topic in some tech circles but we have observed this technique working well with several of our clients. Load testing against live services brings with it the risk of causing a service outage, and again the possibility of disappointing customers who are using the service at the time. This risk can be mitigated by running the test at a quiet time (usually in the early hours) but the key thing to consider is that the load can be instantly turned off and the system can recover, unlike a real demand surge. So it really comes down to a choice between finding out the capacity limit when real customers are loading up the system, or when artificial customers, that can be instantly turned off, are loading it up. Think of this a bit like a fire drill; practising the procedure at a safe time.
The tech vacuum
We have concentrated here on the technology behind online systems, but the technology does not exist in a vacuum, what else needs to be scaled up? For example for a retail business it might be fantastic to be able to withstand any number of customers hitting the website and placing orders, but what fulfilment? Supply chain? Shipping? Contact centre? These more “physical” aspects of a business need to be scaled up too in order to meet the increased demand. If the website is taking a million orders per day but only 100,000 orders can be fulfilled per day then again we’re looking at disappointed customers, who are then going to bombard the contact centre with complaints.
A blended strategy
We have discussed many ways to address the issue of surges in online demand, but what is the best approach? The answer will be different depending on the nature of the online service and the infrastructure that underpins it, but what is universally true is that a blended strategy is the best; that is one that takes several lines of defence. A good strategy will use many of the techniques discussed in this article in order to protect the service under different circumstances.
No matter how well a system is tested and no matter how much capacity is available, there is always the possibility that the system can be overwhelmed. In these cases there should be some emergency mechanisms that can be quickly put into place:
- At its most basic, and the option of last resort, is a holding page that says “sorry we’re offline”. This is better than users seeing an error message, or nothing at all.
- A queue system similar to those we examined earlier. This can throttle demand to keep it within safe limits.
- Kill-switches to disable the heavyweight, low-value features.
Like an emergency shut-off button, one would hope to never have to use these measures, but everyone will be grateful that these defences exist if they ever are required.
While emergency defences can help if a service is unprepared for the demand surge, it is much better to be prepared for it. Capacity planning is an integral part of any strategy and encompasses some of the techniques previously discussed: baselining capacity, forecasting demand and scaling out infrastructure or enabling the automatic scaling. Depending on the underlying platform it may take some time to increase capacity, particularly if physical servers have to be procured and provisioned. Knowing the limits of the current platform and having a plan to increase its capacity means work to scale out can progress quickly as soon as capacity starts to run out.
Issues inherent with scaling infrastructure are much reduced, or even eliminated by modernising and simplifying the underpinning technology. Running online services in the cloud makes the availability of resources a non-issue but doesn’t necessarily mean the infrastructure can scale well: this still needs to be proven. Going a step further and adopting serverless technologies means even the scaling aspects can become a non-issue, so a purely serverless design maximises scalability with the minimum of operational overhead.
Besides using the best the cloud has to offer, it is worthwhile investing in detailed monitoring so that there is visibility into what the system is doing and how it is coping with demand. As well as a view of current system health, good monitoring also stores historical data which is essential for performing historical analysis in order to try and predict future demand.
Simplifying a system is another key strategy for surviving demand surges. By removing the number of moving parts within a system it becomes easier to fix and easier to scale. Afterall a component that has been removed can no longer fail, or be overwhelmed by load. A simpler system in turn is then easier to scale out as there are fewer components to worry about
No time like the present
Demand surges have become a very real phenomenon and if the coronavirus pandemic has shown us anything it is that the unexpected can, and does, happen, and that the demand surges will occur whether the online services are ready or not. Being able to meet the immense demands of these surges is no simple task and requires a strategy that addresses long term concerns as well as the ability to implement immediate emergency measures. If you are responsible for an online service ask yourself these questions:
- Do you know the highest level of demand that has hit the service?
- Do you know what the maximum capacity of the service is?
- Can you react quickly in an emergency situation to protect the service?
- How would you scale out the system should it be required? Do you know the bottlenecks of the system?
- Do you have a plan to migrate the system to serverless technologies?
- Have you played Lemonade Tycoon?
The main conclusion is that preparedness is key, and that’s it’s not too late to put a better strategy in place. The pandemic has caught (most of us) completely unaware, and another future emergency can and will take a completely different shape.
The best time to plant a tree was 20 years ago, the next best time is today. Gather all the data that you can from the last two months, as well as insights from all of your teams before it all fades into the past. Putting a robust future strategy together is the only way to learn from this experience and to use this knowledge in a proactive way.
Don’t wait for the next pandemic to catch you unawares!