Rosie’s Reminders
If you have not seen part 1, I suggest you start there to read about the concept and architecture for this project. We are building a slack bot that will reduce Kubernetes costs by suspending resources outside of local business hours, inspired by Cyral’s Just In Time (JIT) access feature.
One aspect that makes Cyral’s JIT access integration with Slack so usable is how clear the access request messages are. The admin can clearly see who is requesting access, what resources they need, how long, and the reason, all from one spot.
I want to replicate this in our reminder service. Each time a developer gets a slack message, it should be clear which resources are being proposed for suspension, and let them respond quickly within the message.
When to send reminders
The first challenge is determining which resources belong to which developer. Each instance is deployed in a dedicated namespace. We use a script to make the deployment easy and repeatable, so I added an extra step which appends an annotation to the namespace called `email` which is set to the person that runs the creation script.
One of the key problems with our old solution is that it uses only a single timezone, rather than being aware of where the developers are located. Using the Slack APIs, we can lookup the local timezone for each developer so we know when to message them.
The following pseudocode will check if the current user is inside their working hours based on their slack time zone:
def inWorkingHours(email)
tz_offset = slack.users[email]['tz_offset']
local_time = datetime.datetime.utcnow() + timezone_offset
return WORK_HOURS_START <= local_time.hour < WORK_HOURS_END
What does suspending look like?
Given that each independent instance of the Cyral product is deployed into a unique namespace, we can consider an instance suspended if all deployments and statefulsets in its namespace have been scaled to 0 replicas. We can easily use the kubernetes API to list all deployments/statefulsets in namespace, then set the replicas to 0. Kubernetes will then do the heavy lifting to scale everything down suitably.
To support a resume function (more details in part 3), we need to know how many replicas there were before we scaled down. We want to keep the state within the existing resources as much as possible, so before the scale down, we will add an annotation to each deployment/statefulset called previous-replicas. When it comes time to resume, we can just scale each deployment/statefulset to its own previous-replicas annotation value.
After suspending, we want to send a slack message to the user so that they are aware that the resources have been shut down. Let’s turn this into pseudocode:
def suspend(namespace):
for resource in (namespace.deployments + namespace.statefulsets):
if resource.replicas > 0:
resource.annotations.previous-replicas = resource.replicas
resource.replicas = 0
sendSuspendConfirmation(namespace)
Keeping track of interactions:
Finally, we don’t want to send a reminder if there are no pods running in the namespace. This can be done with a simple check. We also need to know if we have already sent a reminder recently so that we don’t spam anyone. We can do this by storing the state in a data object called a configmap with the same name as the namespace we are processing. This will allow us to track when we have sent reminders, and if they have been ignored.
Let’s turn this into pseudocode:
for namespace in namespaces:
if len(namespace.pods) == 0:
continue # There are no pods to suspend
if inWorkingHours(namespace.annotations.email):
continue # Don't interupt while people are working
if not configmaps[namespace]:
sendReminder(namespace) # No reminder has been sent
if configmaps[namespace].suspendTime > now():
suspend(namespace) # No response, so snooze the resources
elif configmaps[namespace].finalReminderTime > now():
sendFinalReminder(namespace) # No response, so send a final reminder
elif configmaps[namespace].reminderTime > now():
sendReminder(namespace) # Send another reminder
We can flesh out the sendReminder code also. The only trick is that we need to update the configmap with the time of the next reminder/suspend actions so that they get completed when they are due. The sendFinalReminder looks the same, but instead sets the suspendTime key instead. We also know that later when the user interacts with our reminder, we will need to know which namespace it relates to, and which cluster the namespace is in. We will add both of these attributes as metadata on the slack message. We will see how they are used shortly.
def sendReminder(namespace):
email = namespace.annotations.email
metadata = {
"namespace": namespace,
"callback_url": CALLBACK_URL
}
slack.post(email, REMINDER_MESSAGE, metadata)
configmaps[namespace].finalReminderTime = now() + 45m
Creating the message body is easy. The Slack Block Kit Builder takes care of most of the hard work and provides an easy UI. Now we have a service that will remind users to suspend or delete resources in Kubernetes outside their business hours, and automatically suspend them if the user ignores the two reminders.
Cleanup:
There are a variety of ways that people may delete their resources & namespaces, this slackbot is only one. As a result, we need a reliable way to ensure any configmaps that reference non-existent namespaces are deleted. All we need to do is find any configmap names which do not have corresponding namespaces:
def removeOrphanConfigmaps():
for configmap in configmaps:
if not configmap in namespaces:
del configmaps[configmap]
Check back for part 3 where we will handle interactions and allow the user to snooze the reminder or suspend/delete their resources all from within slack!