While cleaning up unused resources in the Kubernetes cluster, I deleted the ‘local’ namespace. Immediately Kubernetes started showing signs of error (for example, there were 0 nodes returned from kubectl get nodes
).
After unsuccessfully trying to recover the cluster, the unfortunate decision was made to destroy the cluster and recreate it. The cluster was recreated and hooked back up to Flux, which required some handholding to get into a good state. Ultimately by Tuesday the 30th at around 8PM, everything was restored except for Longhorn and all of the services that use it. After installing it via Flux HelmRelease, it would not schedule storage.
After opening an issue, it was discovered that storageOverProvisioningPercentage
was set to 0
, instead of the desired value of 100
. Setting this value correctly caused Longhorn to operate normally. Restoration of backed up volumes was performed and after modifying some of the PersistentVolumeClaim
s for Helm to take ownership of them, all but the Nextcloud volume (and the Nextcloud application) have yet to be restored. Estimates for this to be finished is tomorrow morning.
First, a cleanup was performed without adequate knowledge of what created the resource in the first place.
Second, although great pains were taken to create resources declaratively, there were various resources that had to be written declaratively on the fly and applied because they were created manually previously.
Third, while setting up Longhorn I opted to change how it was installed (Flux HelmRelease as opposed to manually applying a Helm chart). As a result of this, I misconfigured Longhorn and that delayed resolution by the longest amount of anything.
First, most of the cluster was created declaratively already. This massively reduced TTR (and if it wasn’t the case, full recovery may not have been possible).
Second, all important data was backed up to S3. Although it’s in a format only Longhorn can natively restore, no irreplaceable data was lost.