Our quest toward modern application architecture

Over the last 18 months, the Commons tech team has been on a quest to modernize our development process and software architecture. Most significantly, we have moved from a workflow where development and testing happens remotely, and code updates to the site are made on-the-fly, to one where development happens locally and code is deployed in automatically built containers.

What does that mean and why should Commons users care? The goal of this post is to make it so that anyone can understand clearly what it means, and hopefully gain a bit more understanding about what we’re working on to improve the Commons even when that work is hard to see from the user perspective.

Knowledge Commons runs “on the cloud”. Specifically, we run on AWS—Amazon Web Services. What that means is that we rent computers from Amazon. Somewhere in Virgina there is a room full of computers and we pay Amazon to use those computers. We also pay Amazon for a variety of services, such as RDS, their managed database service and Route 53, their domain name service. Here’s a diagram of our current set up:

Diagram of Knowledge Commons AWS services, including Elastic Container Registry (ECR), Elastic Compute Cloud (EC2), Relational Database service (RDS), and Route 53.

The nested rectangles at the center of the diagram represent layers of the main WordPress application. All of this exists on an “EC2 instance”—a computer we’re renting from AWS. This central application connects to a bunch of AWS services (colored squares), and to other parts of our application such as our Identity Management System. The left part of the diagram shows how our application code, hosted on GitHub, gets packaged into containers and eventually deployed to our main WordPress service. At the bottom of the diagram is the user’s web browser, which interacts with the application after first passing through our firewall (WAF), which does its best to filter out malicious and disruptive traffic, and our application load balancer (ALB), which routes requests to different parts of the system as needed. The ALB also serves the Commons’ SSL certificate and encrypts traffic, which ensures a secure connection between our application and your browser.

This diagram looks complicated, but compared to many modern web applications it’s extremely simple. Many sites have dozens to hundreds (maybe thousands?) of “micro” services that dynamically scale up and down and are coordinated by complex “orchestration” software such as Kubernetes. Our site is still an old-fashioned “monolith”: it mostly runs on a single computer, though it depends on a dozenish other services. In the future, if Knowledge Commons needs to scale much more, we might need to follow a similar path, but for now this simpler architecture still works for us.

At this level of detail, our site one year ago didn’t look that different from today. We use mostly the same services today as we did then and our site still runs on an EC2 instance. The two main differences are (1) how code runs within the instance and (2) how changes to that code get deployed to the site.

Same as above diagram, with highlights for (1) code deployment from GitHub and (2) container structure within EC2

If we zoom in on (1), there are three “containers” running on the EC2 instance:

Zoomed in view of the EC2 instance, showing "nginx container", "cron container", and "app container" as boxes. There is a "shared volume" within the app container that points to the other containers.

The “nginx container”, the “cron container”, and the “app container” correspond to three components of the WordPress site. Nginix is our web server. It receives requests from the browser through the application load balancer and forwards those on to the app container or returns responses immediately for static assets like images. The app container is the main WordPress application. The cron container is a duplicate of the app container that doesn’t directly respond to requests but instead periodically runs maintenance activities like sending out email digests or publishing scheduled posts.

Internally, we refer to this re-architecting the WordPress application as the “containerization project,” referencing these containers. Previously our site’s code ran on the “bare” EC2 instance, like you might run an application on your computer. Containers create isolated environments for each service, so that the service has the exact configuration it needs to run. This is a big advantage of containers: you can explicitly specify what your application needs, and you can be confident that it will run in the same environment regardless of whether it’s running on a developer’s local computer, our staging environment, or on the live site.

Section (2) depicts not so much of a structure as a process:

Depiction of process where code starts on GitHub, is put into containers according to Dockerfiles, and pushed to Elastic Container Registry.

In this diagram, our application code, hosted on GitHub, is packaged into Docker images and pushed to AWS’s Elastic Container Registry. This process happens every time we commit changes to our “main” branch, which runs on our staging site, or our “production” branch, which runs on our live site. These images are what runs within the nginx, app, and cron containers discussed above.

Before we transitioned to this containerized architecture, our testers would frequently encounter what we called “dev/prod issues”. These are problems that the testers encountered that we knew or suspected wouldn’t occur on production. Conversely, issues could appear on production that hadn’t manifested during testing. To some extent this kind of issue is inevitable: testing environments never perfectly replicate production environments. However, we now have far better correspondence between our testing and production environments, and these issues occur much less frequently. This means that we can make improvements to the site with more confidence and less time invested into tracking down spurious issues.

Containerization also facilitates scaling: dynamically increasing our resources to handle spikes in traffic. Many sites, ours included, have struggled in recent months to deal with a deluge of bots (programs aimed at “scraping” or downloading every page of a website) that are used to train large language models like GPT. Due to our large amount of user-generated content, these bots hit the Commons especially hard. We have measures such as firewall rules to mitigate these issues, but inevitably some new bot comes along and floods us with requests. Currently we have a hard limit on how much traffic we can handle, but we are working on being able to dynamically scale by adding additional instances as traffic exceeds certain thresholds. This scaling is made possible by the containerized nature of the application, as containers can be deployed to multiple EC2 instances to run in parallel.

It has taken us the better part of a year to get the containerization project out-the-door. I won’t bore you with the details of why the project was so complex, but trust me, it was hard! We’re excited about what developments this will facilitate in the future. Ultimately our goal is to improve the Commons, both in terms of the features we offer and in terms of the reliability and performance of the site. This project is essential to both of those aims. I hope this window in to our development process has been (a little bit?) enlightening and helps all of you better understand what we’re up to.

If you have questions or comments, please leave them below, or you can find me on Mastodon at https://hcommons.social/@mikethicke .