On-Call Access Management + Incident Response
On-call access management or incident access management is the set of procedures and policies that provide necessary access to those employees who are on-call and/or responding to an incident. For companies operating according to the Principle of Least Privilege (PoLP), managing access for on-call teams presents a challenge: the on-call rotation needs fast, privileged access so they can handle incidents, but PoLP demands that each employee only has minimal rights, by default. Today’s API-driven environment offers ways to overcome this challenge: companies can use Security as Code and Policy as Code to grant access in an emergency situation without having to worry about latent excess permission for all employees.
Incidents and On-call Rotation
In today’s DevOps focused environments, a key process for high-performing teams is to have an incident and on-call rotation for all service owners that includes the developers and engineers who are responsible for code changes in each service. The DevOps model was born out of limitations of the centralized model of IT administration. Under that model, the IT team handled all aspects of production infrastructure but also maintained many technical and organizational silos, making change difficult. Today’s approach for many teams is to focus first on developing resilient systems that will have very few major incidents and even fewer outages. These teams have also built robust procedures to reduce metrics such as Mean Time to Recovery (MTTR) which necessitates having service owners be part of the initial incident response team. Services like PagerDuty have been built to support this approach, providing an API-first model that lets teams automate their on-call schedules and generate notifications to bring in incident responders as soon as possible.
Principle of Least Privilege
On a day-to-day basis, the rule of thumb is that all employees should operate under the Principle of Least Privilege. Employees should only be granted the privileges essential to their day-to-day job. This best practice limits the risks associated with privileged access. The risks associated with over-privileged users affect both the stability and security of the affected environment. Users with excessive privileges may accidentally cause system outages or data loss from acts of human error such as performing the wrong action in a production environment. If a user account is compromised, the attacker can use the privileged access from a single user to gain access to data. Today we see multiple targeted attacks against privileged users to compromise employee accounts using such techniques as “watering hole attacks” to entice these types of users.
Privileged Access With Policy as Code
With a focus on Security as Code and specifically Policy as Code, companies can achieve the Principle of Least Privilege and automate on-call access management. By using services that support API-driven development, teams can implement access control schemes that grant privileged access to an engineer only when that engineer is currently on-call. For example, teams can integrate with a service such as Open Policy Agent (OPA) and use the Rego policy language to develop a policy that will grant access only during times when the user is on-call.
package cyral.authz_policy
default decision = {"allow": false, "ttl": "60s"}
decision = {"allow": true, "ttl": "60s"} {
is_user_on_call(input.userID)
}
default min_escalation_level = 1
min_escalation_level = l {
l := data.policyparams.min_escalation_level
}
is_user_on_call(user) {
id := get_user_id(user)
is_userid_on_call(id)
}
get_user_id(user) = id {
users := http.send({
"url": "https://api.pagerduty.com/users",
"method": "GET",
"body": {"query": user},
"headers": {
"Content-Type": "application/json",
"Authorization": sprintf("Token token=%s", [data.policyparams.api_token])
}
}).body.users
users[0].email == user
id := users[0].id
}
is_userid_on_call(id) {
oncalls := http.send({
"url": "https://api.pagerduty.com/oncalls?limit=100",
"method": "GET",
"body": {
"user_ids": [id]
},
"headers": {
"Content-Type": "application/json",
"Authorization": sprintf("Token token=%s", [data.policyparams.api_token])
}
}).body.oncalls
count(oncalls) > 0
oncalls[_].escalation_level >= min_escalation_level
}
The sample code above retrieves users, checks their on-call status, and then makes a policy decision that determines whether the engineer can perform the requested action. This type of policy decision is completely codified and eliminates the need for elevated-access requests and approvals. For many companies operating according to the Principle of Least Privilege, this eliminates the overuse of the “break glass” workaround that customarily gave engineers emergency access — access that was often too broad and often left in place for too long after the emergency was over. With access management instead being completely automated with Policy as Code, these inherently risky workarounds can be eliminated.