On Thursday 10th and Friday 11th January, database upgrade work was scheduled out of working hours, early each morning. The upgrade was needed to ensure we run a current version that is stable and regularly security patched. It also brought some improvements that will allow us to manage our database services better.
Although this upgrade was a major version update, we routinely update our databases without any downtime or interruption to services. The work required for the upgrade was not dissimilar to a routine upgrade, with some extra planned checkpoints and configuration to make sure that data flowed between our servers properly whilst the upgrade was midway through. So our expectation was exactly the same here, that the upgrade would complete without any impact on service.
As the load on our servers increased during Thursday morning, it became apparent that the database was not coping well.
After upgrade work, a database server needs some time to optimise its workload and ‘re-learn’ how to cope with the demands on it (warming up). We concluded that this was the problem here and began to run some operations that would allow this learning process to happen faster.
With the metrics looking better early on Friday morning, and with a revised plan for better warming up operations after the upgrade, the remainder of the work was completed.
Although things ran smoother on Friday, we again saw issues of higher load and bottlenecks on our database servers which caused problems using the application.
During Friday it became clearer that the issues were less about the warming up of the servers, and more about how the upgraded server was treating some queries quite differently than the previous version had. During the weekend, we made the necessary changes to fix those problems. By Saturday, the database was back to normal operation.
Our preparations for the upgrade were thorough, with the application carefully tested against the new version and the change log checked for compatibility. However, our plan fell short of benchmarking with real-world traffic loads.
Our assumption was that a major database version upgrade would handle the exact same traffic load at least as well, but much likely better than an older version. This was clearly a mistake. For future major version upgrades, we will benchmark loads on the new version before going through with the upgrade.
We know you not only rely on FLG being available constantly, but also you need it to be quick and responsive. The work you do on FLG is often time sensitive due to it being during a live phone call.
We’re genuinely sorry we didn’t live up to your expectations here, and will put in place not only the remedy above, but also continue to look at all of our infrastructure tooling and procedures to ensure we deliver the level of service you rightly expect.