Slower Access

Incident Report for FLG

Postmortem

What happened?

On Thursday 10th and Friday 11th January, database upgrade work was scheduled out of working hours, early each morning. The upgrade was needed to ensure we run a current version that is stable and regularly security patched. It also brought some improvements that will allow us to manage our database services better.

Although this upgrade was a major version update, we routinely update our databases without any downtime or interruption to services. The work required for the upgrade was not dissimilar to a routine upgrade, with some extra planned checkpoints and configuration to make sure that data flowed between our servers properly whilst the upgrade was midway through. So our expectation was exactly the same here, that the upgrade would complete without any impact on service.

As the load on our servers increased during Thursday morning, it became apparent that the database was not coping well.

After upgrade work, a database server needs some time to optimise its workload and ‘re-learn’ how to cope with the demands on it (warming up). We concluded that this was the problem here and began to run some operations that would allow this learning process to happen faster.

With the metrics looking better early on Friday morning, and with a revised plan for better warming up operations after the upgrade, the remainder of the work was completed.

Although things ran smoother on Friday, we again saw issues of higher load and bottlenecks on our database servers which caused problems using the application.

During Friday it became clearer that the issues were less about the warming up of the servers, and more about how the upgraded server was treating some queries quite differently than the previous version had. During the weekend, we made the necessary changes to fix those problems. By Saturday, the database was back to normal operation.

What can we learn?

Our preparations for the upgrade were thorough, with the application carefully tested against the new version and the change log checked for compatibility. However, our plan fell short of benchmarking with real-world traffic loads.

Our assumption was that a major database version upgrade would handle the exact same traffic load at least as well, but much likely better than an older version. This was clearly a mistake. For future major version upgrades, we will benchmark loads on the new version before going through with the upgrade.

Your expectations of us

We know you not only rely on FLG being available constantly, but also you need it to be quick and responsive. The work you do on FLG is often time sensitive due to it being during a live phone call.

We’re genuinely sorry we didn’t live up to your expectations here, and will put in place not only the remedy above, but also continue to look at all of our infrastructure tooling and procedures to ensure we deliver the level of service you rightly expect.

Posted Jan 17, 2019 - 14:08 GMT

Resolved

This incident is confirmed as resolved. A post mortem will follow in the coming week posted to this status page. Our apologies to you and your team if this impacted your work in FLG.

Posted Jan 12, 2019 - 16:30 GMT

Update

To update you on this incident. Upgrade work is complete but the knock-on effects of that are still having an impact on performance. Specifically:

- During certain times, all pages load slower than usual (you may see the spinning loading indicator).
- At other times, the application will generally perform well, with only some requests failing, particularly, loading history against a lead or running reports.

Some measures have been taken to mitigate the impact on performance:

- Some types of requests that always take longer than usual, such as reports run within the browser rather than queued, may not complete within the time they normally do.
- At times, we will place the service into a state where we restrict some types of request, such as viewing leads over a timeframe of longer than 1 month. Note that if we do this, you can still search any timeframe with a keyword search. Also note that you can change the default period that search leads uses, to 'this month' for example. Therefore you won't see the message each time you go to search leads. You can change this in preferences.

We will provide a post mortem for the incident once it is concluded. The issues are a side effect of maintenance work which was carried out well out of working hours.

The incident will remain open until the application is working 100% normally again. There may be further periods where the service slows down today.

We are very aware that you rely on FLG to be fast and available at all times, and rest assured we're working hard behind the scenes to resolve this issue, but thank you enormously for your patience.

Posted Jan 11, 2019 - 12:44 GMT

Monitoring

Although performance is broadly normal at the moment, there is continuing upgrade work over the next 24 hours to improve the stability and security of FLG's databases.

So we will leave this incident open until work is complete. We are conscious of busy times and peak periods and work is planned around that, we intend to keep impact to an absolute minimum. However we can't guarantee there won't be further periods where performance drops until we close this incident.

Thanks for your patience and understanding.

Posted Jan 10, 2019 - 13:19 GMT

Update

Posted Jan 10, 2019 - 13:19 GMT

Identified

The application is performing slower than usual at the moment as we carry out an operation on our database. We will let you know as soon as this is complete.

Posted Jan 10, 2019 - 10:43 GMT