# Incident Postmortem
_Outage: 06/07/2022_
# Incident summary
At approximately 2:10pm on July 6th, 2022, the FLG application stopped loading and our internal monitoring systems alerted us that the entire FLG application was down.
The reason for outage is that a maintenance process was incorrectly ran by one of our development team and as a result ran a script against the live production database that caused most of the database tables to be dropped.
This action had the direct effect of causing the application to stop responding.
A critical incident was declared, and our team invoked the disaster recovery procedure.
Following initial investigation, It was determined that the correct recovery action was to start restoring the database from a previous backup.
The FLG database is hosted by AWS and uses 5-minute point in time backup increments as well as a daily snapshot of the database which is taken every day at 3am.
The first action was to attempt to restore the database from the 2:00pm backup but this failed.
The team then worked backwards in 5-minute increments attempting to identify an achieve a successful restore to no avail.
As a result of the failed restore attempts, the team reached out directly to AWS support who informed us that there was an issue with the backups we had been attempting to restore and confirmed that further attempts would not work.
As a result of this outcome, we then made the decision to restore the database from the snapshot taken at 3:00am on the 6th July, to enable the FLG system to be available to the customer base.
This then enabled the FLG system back online at 9:09pm.
Between 6th July and the 11th of July, several attempts to restore the database using the point in time backups where made.
Unfortunately, all subsequent attempts, in conjunction with AWS support, were unsuccessful.
In parallel we also carried out work on the original database, and we were successful in being able to retrieve the lead and transaction tables which have since enabled us to provide this information to customers when requested.
# Impact
Between 14:00 and 21:09 all users were unable to access the FLG application, access any data or receive leads.
Any leads that would have been received between 14:00 and 21:09 were unable to be added to the system.
Users have been able to access leads and partner transactions from between 3:00 and 14:00 in the form of a CSV file which has been provided on request.
All emails, calls, notes, and other activity associated with these leads were lost.
# Detection
At the point of outage, the automated alerts were received from our monitoring services.
The FLG support team immediately noticed the system was down.
Customers notified the FLG support team by both telephone and email.
# Response
Once the notifications arrived the incident team carried out an immediate initial investigation.
As a result of that initial investigation a critical incident was declared, and the matter escalated to invoke the Disaster Recovery Procedure.
# Recovery
In line with the procedure, the action was to recover the database using the most recent AWS point in time backup.
This attempt failed and further attempts where made to check each backup sequentially to find a working backup.
With support from AWS, and other group resources it was determined that this action would not be successful, and the decision was taken to recover to the snapshot created at 3:00am.
We then continued attempts over the following days to see if there was any way to restore this point in time backups however this failed and the conclusion confirmed that this was not possible, and the backups were not a viable restore point.
We then completed a separate exercise to export the data that was left in the original production database and provided the leads and partner transaction data to customers who requested them.
# Remedial actions
Since the incident we have implemented the following measures:
* Stricter access permissions for the accounts used to access the primary database.
* Increased the frequency of regular restore tests of the point in time backups to identify any issues going forward. These restore tests are now carried out every 2 weeks.
* Enhanced approval stage added to ensure all scripts going forward are checked by another member of the team prior to being submitted for processing.
* After further liaison with AWS, we are now satisfied and tested the latest point in time backups configuration and files against the recovered version of the database are restoring successfully.
Measures we are going to implement:
* We are looking into additional backup solutions to give FLG an additional failover points and reduce reliance on AWS infrastructure.
* Reduce manual actions, and adopt more automated tried and tested actions when running routines against the production database.
# Timeline
All times are BST. Some times are approximate.
14:00 – Maintenance process incorrectly run against the production database.
14:01 – Automated Notifications start to arrive in the technical mailbox.
14:10 - Customers start reporting issues.
14:12 - Development team members start investigating the issue.
14:19 - It is discovered that most of the database tables have been dropped.
14:20 – Critical Incident declared & DRP invoked.
14:30 - 20:30 - Several attempts to restore the database from point in time backups are made, including backups from different times. Each of these fails.
17:28 - Support ticket opened with AWS support. AWS investigates why backups fail but are unable to provide us with a solution.
20:30 - Decision made to use a snapshot from 3am so we could get the system back online.
21:09 Access restored to the system.