Kubernetes Outage
Incident Report for Quick Think of Something Witty
Postmortem

Timeline

While cleaning up unused resources in the Kubernetes cluster, I deleted the ‘local’ namespace. Immediately Kubernetes started showing signs of error (for example, there were 0 nodes returned from kubectl get nodes).

After unsuccessfully trying to recover the cluster, the unfortunate decision was made to destroy the cluster and recreate it. The cluster was recreated and hooked back up to Flux, which required some handholding to get into a good state. Ultimately by Tuesday the 30th at around 8PM, everything was restored except for Longhorn and all of the services that use it. After installing it via Flux HelmRelease, it would not schedule storage.

After opening an issue, it was discovered that storageOverProvisioningPercentage was set to 0, instead of the desired value of 100. Setting this value correctly caused Longhorn to operate normally. Restoration of backed up volumes was performed and after modifying some of the PersistentVolumeClaims for Helm to take ownership of them, all but the Nextcloud volume (and the Nextcloud application) have yet to be restored. Estimates for this to be finished is tomorrow morning.

Lessons Learned

There were many failures and missteps that led to this event.

First, a cleanup was performed without adequate knowledge of what created the resource in the first place.

Second, although great pains were taken to create resources declaratively, there were various resources that had to be written declaratively on the fly and applied because they were created manually previously.

Third, while setting up Longhorn I opted to change how it was installed (Flux HelmRelease as opposed to manually applying a Helm chart). As a result of this, I misconfigured Longhorn and that delayed resolution by the longest amount of anything.

Fortunate Circumstances

Luckily, there were many fortunate circumstances that made fully recovering from the event possible.

First, most of the cluster was created declaratively already. This massively reduced TTR (and if it wasn’t the case, full recovery may not have been possible).

Second, all important data was backed up to S3. Although it’s in a format only Longhorn can natively restore, no irreplaceable data was lost.

Posted Apr 01, 2021 - 01:39 EDT

Resolved
The Kubernetes namespace 'local' was deleted and as a result the cluster is in a bad state. All services on the cluster are affected.
Posted Mar 30, 2021 - 11:00 EDT