FLG - The main platform is currently down – Incident details

All systems operational

The main platform is currently down

Resolved
Major outage
Started almost 3 years agoLasted about 7 hours

Affected

Applications & API

Operational from 1:14 PM to 1:14 PM, Major outage from 1:14 PM to 8:09 PM

Main Platform & API

Operational from 1:14 PM to 1:14 PM, Major outage from 1:14 PM to 8:09 PM

Updates
  • Update
    Update

    # Incident Postmortem _Outage: 06/07/2022_ # Incident summary At approximately 2:10pm on July 6th, 2022, the FLG application stopped loading and our internal monitoring systems alerted us that the entire FLG application was down. The reason for outage is that a maintenance process was incorrectly ran by one of our development team and as a result ran a script against the live production database that caused most of the database tables to be dropped. This action had the direct effect of causing the application to stop responding. A critical incident was declared, and our team invoked the disaster recovery procedure. Following initial investigation, It was determined that the correct recovery action was to start restoring the database from a previous backup. The FLG database is hosted by AWS and uses 5-minute point in time backup increments as well as a daily snapshot of the database which is taken every day at 3am. The first action was to attempt to restore the database from the 2:00pm backup but this failed. The team then worked backwards in 5-minute increments attempting to identify an achieve a successful restore to no avail. As a result of the failed restore attempts, the team reached out directly to AWS support who informed us that there was an issue with the backups we had been attempting to restore and confirmed that further attempts would not work. As a result of this outcome, we then made the decision to restore the database from the snapshot taken at 3:00am on the 6th July, to enable the FLG system to be available to the customer base. This then enabled the FLG system back online at 9:09pm. Between 6th July and the 11th of July, several attempts to restore the database using the point in time backups where made. Unfortunately, all subsequent attempts, in conjunction with AWS support, were unsuccessful. In parallel we also carried out work on the original database, and we were successful in being able to retrieve the lead and transaction tables which have since enabled us to provide this information to customers when requested. # Impact Between 14:00 and 21:09 all users were unable to access the FLG application, access any data or receive leads. Any leads that would have been received between 14:00 and 21:09 were unable to be added to the system. Users have been able to access leads and partner transactions from between 3:00 and 14:00 in the form of a CSV file which has been provided on request. All emails, calls, notes, and other activity associated with these leads were lost. # Detection At the point of outage, the automated alerts were received from our monitoring services. The FLG support team immediately noticed the system was down. Customers notified the FLG support team by both telephone and email. # Response Once the notifications arrived the incident team carried out an immediate initial investigation. As a result of that initial investigation a critical incident was declared, and the matter escalated to invoke the Disaster Recovery Procedure. # Recovery In line with the procedure, the action was to recover the database using the most recent AWS point in time backup. This attempt failed and further attempts where made to check each backup sequentially to find a working backup. With support from AWS, and other group resources it was determined that this action would not be successful, and the decision was taken to recover to the snapshot created at 3:00am. We then continued attempts over the following days to see if there was any way to restore this point in time backups however this failed and the conclusion confirmed that this was not possible, and the backups were not a viable restore point. We then completed a separate exercise to export the data that was left in the original production database and provided the leads and partner transaction data to customers who requested them. # Remedial actions Since the incident we have implemented the following measures: * Stricter access permissions for the accounts used to access the primary database. * Increased the frequency of regular restore tests of the point in time backups to identify any issues going forward. These restore tests are now carried out every 2 weeks. * Enhanced approval stage added to ensure all scripts going forward are checked by another member of the team prior to being submitted for processing. * After further liaison with AWS, we are now satisfied and tested the latest point in time backups configuration and files against the recovered version of the database are restoring successfully. Measures we are going to implement: * We are looking into additional backup solutions to give FLG an additional failover points and reduce reliance on AWS infrastructure. * Reduce manual actions, and adopt more automated tried and tested actions when running routines against the production database. # Timeline All times are BST. Some times are approximate. 14:00 – Maintenance process incorrectly run against the production database. 14:01 – Automated Notifications start to arrive in the technical mailbox. 14:10 - Customers start reporting issues. 14:12 - Development team members start investigating the issue. 14:19 - It is discovered that most of the database tables have been dropped. 14:20 – Critical Incident declared & DRP invoked. 14:30 - 20:30 - Several attempts to restore the database from point in time backups are made, including backups from different times. Each of these fails. 17:28 - Support ticket opened with AWS support. AWS investigates why backups fail but are unable to provide us with a solution. 20:30 - Decision made to use a snapshot from 3am so we could get the system back online. 21:09 Access restored to the system.

  • Resolved
    Resolved

    We can confirm that the system is now available following this afternoon's outage. We are sorry for any inconvenience this may have caused. We will be reaching out with further details about this incident once our investigation is complete.

  • Update
    Update

    We're still exploring options to best minimise the impact that this issue has had on existing and new data. Full details will be provided as soon as possible.

  • Update
    Update

    We are continuing to work on a fix for this issue.

  • Update
    Update

    The restoration from the most recent backup encountered some issues so we are working on a restore from previous versions. Full details will be provided as soon as possible.

  • Update
    Update

    This is still being worked on. An update will be provided before 5 PM.

  • Update
    Update

    The backup database instance is still being prepared. We don't currently have a reliable way to provide an estimated time for when this will be functional but we're working to have this ready as soon as possible. Another update will be provided within the next 30 minutes.

  • Update
    Update

    The problem has been identified as a configuration issue on the primary database. A backup database instance is being prepared for use. An update will be provided within the next 30 minutes.

  • Identified
    Identified

    The issue has been identified. We are now working on getting a solution in place as soon as possible.

  • Update
    Update

    We are looking into this issue as a top priority and will report back here as soon as possible. Sorry for any problems this is causing.

  • Investigating
    Investigating

    We are currently investigating this issue.