We received an increasing number of reports of slow compression over the past weeks. The service has been steadily growing since we launched in July 2012 and we have been able to deal with the additional server load so far by scaling up the size and number of servers hosted on Amazon Web Services (AWS).
We fully expected that we only needed to scale the servers up to solve the performance issues. However, we did not take into account a limit that prevents more than a certain number of compressions to be run simultaneously on a single server. This limit exists to protect the servers from becoming unresponsive. Scaling up the servers proved not to be effective enough. We discovered that this protection mechanism still caused the service to be slow during peak hours.
In addition to that, we recently made a switch from AWS Elastic Load Balancer to AWS Application Load Balancer. This new type of load balancer allows us to support upcoming features on our website.
This required us to slightly modify the network setup. Unfortunately we misconfigured this, causing one of the IP addresses that were used by the load balancer to be unresponsive. The HTTP clients we used to test the new load balancer implement DNS-failover and gracefully switched to the other, responsive IP addresses. This caused us to miss this problem during pre-production testing.
To make things worse we only noticed the second issue after fixing the first problem. This led us to communicating that the perfomance issues had been resolved and that the API was stable again, while in fact for many users it wasn’t.
We learned that while focussing on one specific problem we overlooked another issue that we did not anticipate. We are now working on ways to make it immediately obvious to us and our users what the status of the various subsystems is.
Thanks for reading. Feel free to comment or ask questions about the recent service disruptions!