SQR-108

Rebuilding GKE Kubernetes clusters#

Abstract

A runbook for destroying and recreating GKE Kubernetes clusters with no data loss and minimal downtime.

Sometimes we need to make changes to a GKE Kubernetes cluster in Google Cloud that requires us to destroy and rebuild the cluster from scratch. One such change is enabling GKE Dataplan V2. This technote describes a way to destroy and recreate the cluster with no data loss and minimal downtime.

Prerequisites#

Backup for GKE#

Make sure that the Backup for GKE API is enabled. This command will exit non-zero if it is not:

$ PROJECT=roundtable-dev-abe2 gcloud services list --enabled --project $PROJECT | grep gkebackup.googleapis.com
gkebackup.googleapis.com                Backup for GKE API

Make sure that the Backup for GKE cluster addon is enabled in the cluster you are rebuilding. This command will exit non-zero if it is not:

$ CLUSTER_NAME=roundtable-dev \
  PROJECT_ID=roundtable-dev-abe2 \
  LOCATION=us-central1 \
  gcloud container clusters describe $CLUSTER_NAME \
    --project $PROJECT_ID \
    --location $LOCATION \
    --format json \
    | jq -e .addonsConfig.gkeBackupAgentConfig.enabled

The best way to enable these things if they are not enabled is to make a PR to the idf_deploy Terraform config repo and merge it. That way, we won’t forget to enable it if the environment needs to be recreated.

Static IPs#

Every GKE cluster has an instance of ingress-nginx running in it. The Helm chart that installs it creates a LoadBalancer Service that provisions a Google Cloud load balancer to recieve traffic from outside the cluster. This load balancer has an IP address attached to it, and we create a DNS record (in AWS Route53 for now) that points to it. This IP address is set in Phalanx as the loadBalancerIP value in the ingress-nginx app. It is important not to change this IP addresses when we recreate the cluster, or else we would also have to update the DNS records.

By default, this is an ephemeral IP address, which means that it will disappear when the Google Cloud load balancer gets destroyed. The load balancer might get destroyed when we destroy the cluster to rebuild it. This IP address should be a Google Cloud static external IP addresses. The best way to ensure this is a static and not ephemeral IP address is to make sure it is provisioned in our idf_deploy Terraform config repo.

There may be other static IP addresses that we depend on:

  • External Kafka broker access

  • External InfluxDB access

You should ensure these are also static, and not ephemeral, IP addresses in Google Cloud.

If some external IPs are ephemeral, and not static, in Google Cloud, you will probably have to update a DNS A record when the new IP gets provisioned. You should then add config to the idf_deploy repo for that IP address and import it into the Terraform state so it doesn’t get destroyed next time.

Runbook#

Make your change in idf_deploy#

All changes should be made in the idf_deploy repo, so the first step is to make a PR to idf_deploy that contains your change, and make sure the Terraform plan looks good.

Announce downtime and data loss#

You’re about to create a complete backup of the persistent volumes and Kubernetes objects in the cluster. Any changes to data on persistent volumes or Kubernetes objects in the cluster after the backup is made will not be restored into the new cluster. Make sure you notify all of the necessary people of this fact.

Create a backup of the cluster#

Take an on-demand backup of the cluster using Backup for GKE. Every cluster should have a backup plan with the same name as the cluster. Create an on-demand backup using that plan through the web console UI, or by using a command like this:

This will take several minutes. If you don’t want to have the command wait, omit the --wait-for-completion option. See the on-demand backup docs for more options.

Manually delete the existing cluster#

The terraform module we’re using tries to create the new cluster before destroying the old cluster. This won’t work because we want to keep the name of the new cluster the same, so we need to delete the cluster manually.

  1. Manually delete all of the Services of type LoadBalancer in the cluster. The GKE cluster deletion docs recommend doing this to ensure that the associated Google Cloud load balancer instances are deleted.

  2. Delete the cluster through the Google Cloud web console UI or with this command:

    $ CLUSTER_NAME=roundtable-dev \
      PROJECT_ID=roundtable-dev-abe2 \
      LOCATION=us-central1 \
      gcloud container clusters delete $CLUSTER_NAME --project $PROJECT_ID --location $LOCATION
    

    This will take several minutes.

Merge the idf_deploy PR#

This will create a new cluster, including the changes that you made that required destroying and recreating the cluster. This will take several minutes

Restore the backup#

Create a restore of the on-demand backup of the cluster using Backup for GKE. Every cluster should have a restore plan with the same name as the cluster. Create restore using that plan through the web console UI, or by using a command like this:

This will take several minutes. You can view the progress of the restore in the Google Cloud web console UI.

When the backup is completely restored, you should be able to access the Argo CD instance for the new cluster.

Regenerate local Kubernetes API creds#

Follow the directions in the Phalanx environments page for this cluster to regenerate local API credentials so you can run commands with kubectl. Something like this:

Fix Sasquatch#

When a Kafka cluster is created with a Strimzi Kafka CRD, it gets assigned a random ID. The data in our backed-up persistent volumes will contain a different ID, and the Kafka broker pods will not be able to start because of this. You need to manually change the Strimzi Kafka cluster ID to match the cluster ID in the persistent volume. See this Strimzi discussion about the ID mismatch, and the Strimzi docs for pausing reconciliation for more information.

  1. Look in the logs of the failing kafka pods. There should be a message that says something like “Exception in thread “main” java.lang.RuntimeException: Invalid cluster.id in: /var/lib/kafka/data-0/kafka-log0/meta.properties. Expected <new-cluster-ID>, but read <old-cluster-ID>”. Note the old cluster ID in that message.

  2. Pause the Strimzi reconciliation of the Kafka object by adding an annotation:

    $ CONTEXT=roundtable-dev \
      kubectl --context $CONTEXT --namespace sasquatch \
         annotate Kafka sasquatch strimzi.io/pause-reconciliation="true"
    
  3. Edit the clusterID in the status of the Sasquatch KafkaNodePool controler resource:

    $ CONTEXT=roundtable-dev \
      kubectl --context $CONTEXT --namespace sasquatch \
         patch KafkaNodePool controller \
         --type=merge --subresource status --patch 'status: {clusterId: old-cluster-id}'
    
  4. Resume Strimzi reconciliation by removing the pause annotation:

    $ CONTEXT=roundtable-dev \
      kubectl --context $context --namespace sasquatch \
        annotate Kafka sasquatch strimzi.io/pause-reconciliation-
    
  5. Wait for all resources in the sasquatch app to stabilize

  6. Restart any Kafka-dependent workloads in other namespaces if necessary