Ensuring the Reliability and Durability of the Knowledge Commons Infrastructure

Summary

We tested the backups and restore processes of our digital infrastructure. The results were good and we can recover! We still have some work to do to provide ultimate resilience.

How do you know, at any infrastructure, that your backups and disaster-recovery procedures work? You can have the most extensive backup procedures in the world, but if you have never tested it, you will not know, for absolute certain, that you can resurrect things in a disaster.

A sign for traffic lights flooded and half underwater. Image represents disaster.

Lots of the infrastructure that we use at Knowledge Commons is ephemeral. This means that, as a cloud-native system, we can simply re-create many bits of our infrastructure. For example, our networking stack, security policies, user permissions, and so on are all handled in AWS and could be recreated as equivalents on any other cloud provider. The EC2 boxes that are created as part of the provisioning of our containerised infrastructure are completely disposable hosts for docker containers.

Obviously, though, we have data that we need to store. In particular, we have three RDS systems; several filesystems in both EFS and EBS systems; and some snapshot backups in S3 buckets. Today, Dimitrios Tzouris (Infrastructure Lead) and I (Martin Paul Eve, Technical Lead) took it upon ourselves thoroughly to test the resilience of our system. The hope was that the majority of backups would operate just fine, but that we would gain useful insight on our points of weakness and how we can improve our reliability.

We set several parameters for the test. We were not allowed access to production or dev sites, task definitions, or EFS volumes. We were not allowed access to existing RDS databases. Finally, access to S3 was allowed (but this was identified as a dependency; see below)

Persistent Storage Mechanisms Tested and Recovered:

We can confirm that we were able to restore the following persistent storage components from our backups:

  • Works Postgres RDS (9 mins)
  • WordPress MySQL RDS (12 mins)
  • Registry RDS (snapshots confirmed)
  • Search snapshot
  • Registry EBS

Persistent Storage Mechanisms Unable to Be Recovered (All Now Resolved):

We did, though, hit one problem. Although we can see that we have taken EFS snapshots throughout, giving us a robust backup solution, we were unable to restore the filesystem of uploads and media from the main Commons site. While this is crucial, it appears to be fixable with a simple permissions tweak to allow us to restore. We then heard back from AWS and, with a small fix, this restore went through just fine.

  • EFS dynamic content folder (snapshotted daily)
    • We did not have permission to restore an EFS snapshot in AWS

Identified Dependencies and Weaknesses:

We identified several dependencies and weaknesses in our backup and restore systems. The most notable of these is a total reliance on AWS. We use their backup services and systems. If AWS goes, our infrastructure is currently in difficulty. We will be fixing this in the very near future. That said, if AWS disappears off the face of the earth tomorrow, with no warning, we all have a much bigger problem! That said, this is something we wish to address and I have initiated a discussion internally about ensuring that all backups have an offsite, different provider counterpart.

  • Backups for Works are handled by S3 versioning
    • In case of total catastrophic failure, worldwide, of S3, we would not have backups
    • Very unlikely scenario
  • Backups for RDS is handled by RDS system
    • Failure of AWS RDS systems would compromise backups
    • There is also a “restore from S3 scenario”
    • Very unlikely scenario
  • Backups for EBS root filesystems of EC2 ECS hosts not taken
    • Not needed. Nothing is stored on the root fs
    • No risk
  • Backups for EFS are all stored in AWS snapshots
    • See above for problems during restore, now resolved
  • We do not have backups of task definition JSONs
    • This is not catastrophic, as we can recreate them, but would radically slow down restore
    • Moderate risk / delay factor
  • We do not have backups of secrets
    • Not catastrophic as we have these are all stored securely elsewhere but could take a long time to recreate, especially as many secrets are used for environment setup
    • Moderate risk / delay factor
  • Users and security groups (IAM stuff) are not backed up
    • Can all be recreated but will take time
    • Can be addressed by thorough documentation
    • Would not help in the case of a migration to a new platform for restore, anyway
    • Moderate

Conclusions

With huge thanks to Dimitrios for his work setting everything up, our backup processes at Knowledge Commons appear to be robust. We take regular snapshots and recovery backups of all services that store persistent data. There are still aspects of this that we needed to test, and the EFS filesystem recovery is now confirmed as working. We also need to work to get more of ephemeral infrastructure documented so that any recovery isn’t delayed by having to experiment to get the infrastructure running (and also to militate against centralised information with one source only). Most crucially, we need to break the total AWS dependence so that if something catastrophic happened to our access or their services, recovery would be much easier. We will be discussing this shortly at Knowledge Commons so that we can break this cycle.