CILogon Production cluster outage

Incident Report for CILogon

Postmortem

In the morning of Wednesday, November 16, 2022, the CILogon team initiated a routine Docker software update for our production infrastructure, which we had previously tested in our development and test infrastructure. Unexpectedly, the software update caused our production Docker Swarm to lose "quorum" which required us to initiate a rebuild of the Docker Swarm cluster. For unknown reasons, the cluster rebuild resulted in internal network routing errors for the LDAP virtual network segment and other minor issues with some CILogon service containers. In particular, the need to diagnose the localized Docker Swarm networking problem caused a significant delay in our recovery efforts. Once diagnosed, we rebuilt the LDAP virtual network segment and returned CILogon to full service.

The CILogon team regrets any outage and especially regrets an outage of such long duration. We have been planning to migrate from Docker Swarm to Amazon Managed Kubernetes Service, which we believe will provide a more reliable and sustainable platform for CILogon operations going forward. The November 16 outage provides additional motivation for us to proceed with that migration.

Posted Nov 17, 2022 - 10:55 CST

Resolved

All services, including LDAP, are operational. Please contact help@cilogon.org if you experience any authentication issues.

Posted Nov 16, 2022 - 13:51 CST

Monitoring

All non-LDAP services have been restored.

Posted Nov 16, 2022 - 12:16 CST

Update

We are in the process of restarting the services on the cluster. Most services are up and running.

Posted Nov 16, 2022 - 11:30 CST

Update

We are continuing to investigate this issue.

Posted Nov 16, 2022 - 10:53 CST

Investigating

The CILogon cluster in AWS is experiencing a partial outage. We are investigating.

Posted Nov 16, 2022 - 10:42 CST

This incident affected: cilogon.org and registry.cilogon.org.