In the morning of Wednesday, November 16, 2022, the CILogon team initiated a routine Docker software update for our production infrastructure, which we had previously tested in our development and test infrastructure. Unexpectedly, the software update caused our production Docker Swarm to lose "quorum" which required us to initiate a rebuild of the Docker Swarm cluster. For unknown reasons, the cluster rebuild resulted in internal network routing errors for the LDAP virtual network segment and other minor issues with some CILogon service containers. In particular, the need to diagnose the localized Docker Swarm networking problem caused a significant delay in our recovery efforts. Once diagnosed, we rebuilt the LDAP virtual network segment and returned CILogon to full service.
The CILogon team regrets any outage and especially regrets an outage of such long duration. We have been planning to migrate from Docker Swarm to Amazon Managed Kubernetes Service, which we believe will provide a more reliable and sustainable platform for CILogon operations going forward. The November 16 outage provides additional motivation for us to proceed with that migration.