Testing the New ID Management System Rollout

Today, we present a blog post from the engineering deck here at Knowledge Commons, exposing the infrastructure underside of our operations…

We are about to make significant changes at Knowledge Commons to the IDMS infrastructure. This is the ID Management System that allows all our users on a day-to-day basis to authenticate and authorize with the Commons. This is a very complex technical operation that involves reworking some of the core behind-the-scenes structures that allow our fifty thousand plus or so members to securely log in.

First, it’s worth briefly outlining why we are doing this. Our current IDMS relies on us maintaining many different login systems manually by ourselves. It is not sustainable from a technical labour point of view. Second though, the system to which we are moving, CILogon, allows us to give users the choice of thousands of institutional login points, but also allows for Google, ORCID, GitHub, and other third party services that are open to everyone. This will reduce our need to maintain logins down to a count of just one, while providing all our users with thousands of choices.

Making such a radical change to our platform requires careful planning and structure. This is why, on Wednesday, the core Knowledge Commons technical team of Ian Scott, Dimitrios Tzouris, and me (Martin Paul Eve) gathered for a day of test migration and installation. We wanted to simulate exactly what the process of migration would look like; moving from a test platform (which Dimitrios had built) that mirrored our production servers to the new setup.

There are two key stages to this process. The first is to migrate the data out of the old structures and into our new Postgres database. This allows us centralized control over user profiles and login methods. The European team gathered at 0900 hours GMT to begin this process. This went remarkably well. Our export and import procedures work exactly as we need them. There were no unexpected hitches during this process. It just takes some time. We estimate that it takes about three hours to do this process, and in that period the site needs to be totally inaccessible (so the database doesn’t change).

The second part of the test deployment was perhaps a little bumpier than we would like. We brought up the test containers and quickly found that we had some problems. However, most of the problems that we encountered were actually because of subtle inconsistencies between production and the actual test that we were using. That is, we spent time fixing problems related to the test infrastructure.

An example: we have a centralized login proxy that handles all incoming callbacks and directs them to the correct application on our stack. However, I had omitted to whitelist the new test infrastructure in order to allow this to happen. (Forwarding happens only to allowed domains, to prevent token exfiltration.) This was an easy oversight, but caused us a significant delay. This would not happen in production, as the main site is already whitelisted.

Another example: I had forgotten that the choice of Django configuration file – development, production, local, GitHub, tests etc. – is made through a secret environment variable and not as part of the build process. This had been set, incorrectly, on the Test infrastructure to use the development configuration environment, instead of mirroring production. Something this small cost us over an hour of debugging, before we looked in the most obvious place.

One final example: obviously we have to put the system into maintenance mode so that users see a message saying, “Sorry we’re down, come back later”. Our systems rely on WordPress and Works being able to reach the API of the IDMS. We had neglected to consider that in the setup we had on the test rig, instead of seeing the API, these systems saw the maintenance mode page! This meant that we could not test the systems without making them publicly available, which of course we were not ready to do at this stage! Ian tracked this down and Dimitrios fixed it extremely quickly. However, this was something we had not anticipated and that we now need to plan for, in advance, when we do the real deployment.

On Wednesday, we hit the end of our time allowance, and were it the real deal, we would have to have rolled back the system to its previous state. The main problem we were having was with inter-process communication and getting KC Works and WordPress to “talk properly” to the IDMS.

This morning, having been out yesterday, I sat down at my desk, changed a few environment variables, pressed the button to redeploy, and the system sprang to life in working state. That is, had we had fifteen minutes more time on Wednesday, we would have succeeded in the deployment. Of course, it may also be the case that first thing at my desk in the morning I am thinking clearly and able to see and realize things that seemed impossible late on Wednesday afternoon after a long day. Contrary to the myth that AI will save everything, there is very much a human factor of transmitted knowledge and thinking involved in this infrastructural process. Getting it right requires care, thought, communication, and shared expertise.

So where do we go from here? We have produced a write-up of everything we did, everything that went wrong, and everything that went to plan. Next, we will meet as a team to discuss whether we wish to do one more test run or whether we feel we have enough information and competence to proceed to the real thing. It’s a balance between having 100% certainty in a test space that everything will go to plan vs. feeling we know enough to get it right and shipping the new system. My feeling is: it’s actually impossible to get 100% certainty. It just doesn’t exist. It will always be the case that unexpected, unanticipated gremlins can lurk. We just need to feel that, if they surface, we are equipped to deal with them.

The timing of the real thing is also critical. We need to have clear space for the development team immediately after the deployment in case our real users find something that our internal testers did not. We will, after all, be moving from a relatively small, though by no means tiny, group of testers to thousands of users per day logging in. Many of our users depend upon Knowledge Commons for their teaching, for public workshops, for funding documents and more. It is almost inevitable that there will be edge cases of bugs that we could not catch in our test rig. So if there are problems, we need to fix them. Fast.

In any case, do please keep your eyes peeled for our rollout for real of this new system. A new login system does not sound exciting. However, this is a huge step forward for us and a really critical piece of the Knowledge Commons infrastructural architecture. So, in a very geeky way, I am extremely excited about this rollout and where we go next.

Featured image by Alain Pham on Unsplash

Leave a Reply

Your email address will not be published. Required fields are marked *