Latest White Paper | "Cyral for Data Access Governance"· Learn More
Cyral
Free Trial
Blog

Automating Kubernetes Cost Reductions

Rosie Responding to Actions

If you have not seen part 1 & 2, I suggest you start there to read about the concept and architecture for this project, as well as how the reminder service works. We are building a slack bot that will suspend Kubernetes resources outside of business hours, inspired by Cyral’s Just In Time (JIT) access feature.

Another aspect that makes Cyral’s JIT access integration with Slack so convenient is that an admin can approve or deny the data access request directly from the slack message. No need to open a browser or log in. To create a similar experience in our Kubernetes resource suspension tool, the reminders sent on slack will include buttons for snooze, suspend & delete.

Reacting to a button press from Slack is straightforward. When you configure your slack app, you can provide a webhook endpoint for slack to notify you of the interaction. Because we need to take actions in response to the interactions, we will run the listener inside the same cluster, next to the reminder job.

Validating requests

The endpoint needs to be public so that we can receive a message from Slack’s servers, this means that we need to validate that the message did come from Slack. Each time Slack calls your webhook endpoint, it will sign the message using a signing secret generated when you create your app. The listener service simply needs to generate the signature using the same signing secret when it receives a request, and make sure it matches the signature provided with the request. So long as you keep the signing secret value safe, you can then be sure the request did come from your slack app.

Taking action

First we need to know which namespace the request relates to. When we sent the messages in part two, we also included some metadata with the namespace. Now we can retrieve that from the slack message to know what this action relates to:

def getNamespace(message):
  return message.metadata["namespace"]

There are only a few actions we need to handle. Let’s take a look at each one:

Snooze:

A user may want to snooze the request because they are still using those resources for a bit longer, or they are running a test overnight and do not want them to shutdown. The reminder message gives the option of selecting to snooze for 1 hour, 3 hours, 1 day or 3 days. When we receive this action, we need to clear any other annotations (so we don’t accidentally send more reminders or perform a suspend action), then add an annotation for whenever the snooze expires

def snooze_action(message):
  namespace = getNamespace(message)
  snoozeDuration = message.metadata["duration"]
  configmaps[namespace] = {"reminderTime":  now() + snoozeDuration}

Suspend:

If the user has finished for the night and forgot to remove their resources, they can choose to suspend them now which saves costs and prevents any more reminders. We already have the suspend function from part 2. We just need to delete the configmap afterwards so that we know to start again next time we see any resources. We will not send any more reminders because the first check in the reminder service is if there are any pods running (which there are not, as we just suspended them).

def suspend_action(message):
  namespace = getNamespace(message)
  suspend(namespace)
  del configmaps[namespace]

Delete:

Here we just need to delete the namespace, and all resources will be removed. We will clean up the configmaps shortly. In our version I also implemented a confirmation message to warn people that deletion is irreversible in case of a mis-click, but I’ve omitted that here.

def delete_action(message):
  namespace = getNamespace(message)
  delete(namespace)

Resume:

This is a critical feature of this solution. If we keep suspending resources to save costs, but people cannot easily resume them, then they will just keep snoozing forever and we will just annoy people while not generating any cost savings.

We want to keep all of the interactions within Slack, and so we need a way to issue a resume action too. Previously we had a function sendSuspendConfirmation which would tell the user their namespace has been suspended. We can add to this a resume button that they can use to easily turn back on their resources whenever they need them again.

def sendSuspendConfirmation(namespace):
  email = namespace.annotations.email
  metadata = {
    "namespace": namespace,
    "callback_url": CALLBACK_URL
  }
  slack.post(email, SUSPEND_MESSAGE, metadata)

The resume function works similarly to the suspend function, checking for a previous-replcias annotation and updating the resource to match. This process can take a few minutes if there are lots of resources, so we will send a confirmation message when finished.

def resume(namespace):
  for resource in (namespace.deployments + namespace.statefulsets):
    if resource.annotations.previous-replicas:
      resource.replicas = resource.annotations.previous-replicas
  sendResumeConfirmation(namespace)

Now we have a working slackbot which can remind, suspend, delete and resume resources in our kubernetes cluster to save costs. Next we will look at one of the challenges with the Slack webhooks and how we can work around it to expand this solution to work with multiple kubernetes clusters.

Subscribe to our Blog

Get stories about data security delivered directly to your inbox

Try Cyral

Get Started in Minutes with our Free Trial