Custom classifier for Automatic Data Map
Cyral's Automatic Data Map feature relies on the Repo Crawler to classify data according to a set of predefined data labels and custom data labels. Below, we explain how to set up and use custom data labels.
To use a custom data label for classification, you must set up the following in Cyral:
- A custom data label; for example, FIRST_NAME or DOB
- A classifier must be attached to that data label.
- The data label's classifier's status must be enabled.
The classifier (or classification rule) is a special piece of Rego code that the Repo Crawler uses to classify data under a specific data label. During the data sampling process, the Repo Crawler evaluates the classifier for each data label that you have defined and enabled. It passes two pieces of information to the classifier code to classify the sampled data:
- the column name (
key
) and - the sampled column value (
val
).
The data label's classifier can then act on these two input parameters and
return a true
or false
, indicating whether the data should be
classified with the data label or not.
While the actual classification logic is arbitrary, the classifier code must follow a specific format, which is outlined below.
caution
This is an advanced configuration, and you must use the Cyral API to set it up. If you need assistance, contact Cyral support.
Classifier Rego
The classifier code must conform to the following Rego structure:
package classifier
output := {k: v |
v := classify(k, input[k])
}
# Replace <LABEL NAME> with the actual data label name
classify(_, _) = "<LABEL NAME>" {
# Classifier code to operate on the input params goes here.
# Must ultimately return a boolean (true/false) value.
# Params:
# * key = col name
# * val = sampled col value
} else = "UNLABELED" {
true
}
The input
variable passed to the Rego classifier is a JSON object in the
following form:
{
"columnName1": "columnVal1",
"columnName2": "columnVal2",
...
}
The output
of the Rego classifier is also a JSON object in the following
form:
{
"output": {
"columnName1": "<LABEL NAME>",
"columnName2": "UNLABELED",
...
}
}
This represents the results of applying the classifier, where each column is
mapped to some data label name, or optionally UNLABELED
if the classifier does not
apply to the input sample passed to it.
It is important to match the above template verbatim, otherwise the classifier will not be functional.
Example classifier: SSN
Here’s an example Rego classifier for the SSN data label, which uses a
simple regex (using Rego’s built-in regex.match function, which uses the
Golang/RE2 regex syntax) on the sampled
value (the val
variable) to determine if a column should be
classified under the SSN data label:
package classifier
output := {k: v |
v := classify(k, input[k])
}
classify(_, val) = "SSN" {
regex.match(
`\A((((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))-((0[1-9])|([1-9]\d))-((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3})))|(((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))+((0[1-9])|([1-9]\d))+((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3})))|(((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))\.((0[1-9])|([1-9]\d))\.((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3})))|(((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))((0[1-9])|([1-9]\d))((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3}))))\z`,
val,
)
} else = "UNLABELED" {
true
}
Example classifier: ADDRESS
Here’s another example for the ADDRESS data label, where the classifier
instead evaluates the column name (key
) to make a decision:
package classifier
import future.keywords.in
output := {k: v |
v := classify(k, input[k])
}
classify(key, _) = "ADDRESS" {
true in [
lower(key) == "state",
regex.match(`\A.*address.*\z`, lower(key)),
lower(key) == "zip",
lower(key) == "zipcode",
regex.match(`\Astreet.*\z`, lower(key)),
]
} else = "UNLABELED" {
true
}
Testing Custom Rego Classifiers
The Rego code for custom classifiers is best tested using OPA directly and the
utilities it provides. Rego classifiers can be invoked directly
using the OPA command line (opa eval
).
Using the ADDRESS
classifier example above as
classifier.rego
and the following JSON as input.json
:
{
"address_line_1": "123 Example St.",
"address_line_2": "Apt. 987",
"state_of_matter": "liquid",
"street_name": "123 Example St.",
"zip": "12345"
}
Running the opa eval
command:
opa eval "data.classifier" --input input.json --data classifier.rego --format pretty
Will yield the following output:
{
"output": {
"address_line_1": "ADDRESS",
"address_line_2": "ADDRESS",
"state_of_matter": "UNLABELED",
"street_name": "ADDRESS",
"zip": "ADDRESS"
}
}
Additionally, OPA provides a robust testing suite for Rego which can also be used to test classifiers. For more details, see the OPA docs.
Finally, classifiers can be manually verified once they are created and attached to a custom data label, simply by running the Repo Crawler against some known data, and inspecting the results and asserting they match what you expect.
API Example
In this example, we create a custom data label called SSN_CUSTOM, using a slightly modified version of the SSN Rego classifier shown above. You would use the same API call to update a data label as well.
See the data labels
section of the Cyral API Reference
for information about the endpoint we use below.
curl -X "PUT" "https://<control plane host>:<control plane port>/v1/datalabels/SSN_CUSTOM" \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer ***** Hidden credentials *****' \
-d $'{
"type": "CUSTOM",
"tags": [ "tag1", "tag2" ],
"description": "Social Security Number (Custom)",
"classificationRule": {
"status": "ENABLED",
"ruleType": "REGO",
"ruleCode": "package classifier\\n\\noutput := {k: v |\\n v := classify(k, input[k])\\n}\\n\\nclassify(_, val) = \\"SSN_CUSTOM\\" {\\n regex.match(\\n `\\\\A((((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))-((0[1-9])|([1-9]\\\\d))-((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3})))|(((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))+((0[1-9])|([1-9]\\\\d))+((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3})))|(((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))\\\\.((0[1-9])|([1-9]\\\\d))\\\\.((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3})))|(((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))((0[1-9])|([1-9]\\\\d))((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3}))))\\\\z`,\\n val\\n )\\n} else = \\"UNLABELED\\" {\\n true\\n}\\n"
}
}'
The classifier Rego code goes in the ruleCode
property of the
classificationRule
object. Note that since the ruleCode
property is
just a JSON string, it must be properly escaped (quotes, newlines,
and so on) to work with cURL.