Skip to main content
Version: v4.17

Custom classifier for Automatic Data Map

Cyral's Automatic Data Map feature relies on the Repo Crawler to classify data according to a set of predefined data labels and custom data labels. Below, we explain how to set up and use custom data labels.

To use a custom data label for classification, you must set up the following in Cyral:

  • A custom data label; for example, FIRST_NAME or DOB
  • A classifier must be attached to that data label.
  • The data label's classifier's status must be enabled.

The classifier (or classification rule) is a special piece of Rego code that the Repo Crawler uses to classify data under a specific data label. During the data sampling process, the Repo Crawler evaluates the classifier for each data label that you have defined and enabled. It passes two pieces of information to the classifier code to classify the sampled data:

  • the column name (key) and
  • the sampled column value (val).

The data label's classifier can then act on these two input parameters and return a true or false, indicating whether the data should be classified with the data label or not.

While the actual classification logic is arbitrary, the classifier code must follow a specific format, which is outlined below.

caution

This is an advanced configuration, and you must use the Cyral API to set it up. If you need assistance, contact Cyral support.

Classifier Rego

The classifier code must conform to the following Rego structure:

package classifier

output := {k: v |
v := classify(k, input[k])
}

# Replace <LABEL NAME> with the actual data label name
classify(_, _) = "<LABEL NAME>" {
# Classifier code to operate on the input params goes here.
# Must ultimately return a boolean (true/false) value.
# Params:
# * key = col name
# * val = sampled col value
} else = "UNLABELED" {
true
}

The input variable passed to the Rego classifier is a JSON object in the following form:

{
"columnName1": "columnVal1",
"columnName2": "columnVal2",
...
}

The output of the Rego classifier is also a JSON object in the following form:

{
"output": {
"columnName1": "<LABEL NAME>",
"columnName2": "UNLABELED",
...
}
}

This represents the results of applying the classifier, where each column is mapped to some data label name, or optionally UNLABELED if the classifier does not apply to the input sample passed to it.

Try on the Rego Playground.

It is important to match the above template verbatim, otherwise the classifier will not be functional.

Example classifier: SSN

Here’s an example Rego classifier for the SSN data label, which uses a simple regex (using Rego’s built-in regex.match function, which uses the Golang/RE2 regex syntax) on the sampled value (the val variable) to determine if a column should be classified under the SSN data label:

package classifier

output := {k: v |
v := classify(k, input[k])
}

classify(_, val) = "SSN" {
regex.match(
`\A((((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))-((0[1-9])|([1-9]\d))-((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3})))|(((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))+((0[1-9])|([1-9]\d))+((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3})))|(((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))\.((0[1-9])|([1-9]\d))\.((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3})))|(((66[0-57-9])|(6[0-57-7]\d)|(00[1-9])|(0[1-9]\d)|([1-578]\d{2}))((0[1-9])|([1-9]\d))((000[1-9])|(00[1-9]\d)|(0[1-9]\d{2})|([1-9]\d{3}))))\z`,
val,
)
} else = "UNLABELED" {
true
}

Try on the Rego Playground.

Example classifier: ADDRESS

Here’s another example for the ADDRESS data label, where the classifier instead evaluates the column name (key) to make a decision:

package classifier

import future.keywords.in

output := {k: v |
v := classify(k, input[k])
}

classify(key, _) = "ADDRESS" {
true in [
lower(key) == "state",
regex.match(`\A.*address.*\z`, lower(key)),
lower(key) == "zip",
lower(key) == "zipcode",
regex.match(`\Astreet.*\z`, lower(key)),
]
} else = "UNLABELED" {
true
}

Try on the Rego Playground.

Testing Custom Rego Classifiers

The Rego code for custom classifiers is best tested using OPA directly and the utilities it provides. Rego classifiers can be invoked directly using the OPA command line (opa eval). Using the ADDRESS classifier example above as classifier.rego and the following JSON as input.json:

{
"address_line_1": "123 Example St.",
"address_line_2": "Apt. 987",
"state_of_matter": "liquid",
"street_name": "123 Example St.",
"zip": "12345"
}

Running the opa eval command:

opa eval "data.classifier" --input input.json --data classifier.rego --format pretty

Will yield the following output:

{
"output": {
"address_line_1": "ADDRESS",
"address_line_2": "ADDRESS",
"state_of_matter": "UNLABELED",
"street_name": "ADDRESS",
"zip": "ADDRESS"
}
}

Additionally, OPA provides a robust testing suite for Rego which can also be used to test classifiers. For more details, see the OPA docs.

Finally, classifiers can be manually verified once they are created and attached to a custom data label, simply by running the Repo Crawler against some known data, and inspecting the results and asserting they match what you expect.

API Example

In this example, we create a custom data label called SSN_CUSTOM, using a slightly modified version of the SSN Rego classifier shown above. You would use the same API call to update a data label as well.

See the data labels section of the Cyral API Reference for information about the endpoint we use below.

curl -X "PUT" "https://<control plane host>:<control plane port>/v1/datalabels/SSN_CUSTOM" \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer ***** Hidden credentials *****' \
-d $'{
"type": "CUSTOM",
"tags": [ "tag1", "tag2" ],
"description": "Social Security Number (Custom)",
"classificationRule": {
"status": "ENABLED",
"ruleType": "REGO",
"ruleCode": "package classifier\\n\\noutput := {k: v |\\n v := classify(k, input[k])\\n}\\n\\nclassify(_, val) = \\"SSN_CUSTOM\\" {\\n regex.match(\\n `\\\\A((((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))-((0[1-9])|([1-9]\\\\d))-((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3})))|(((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))+((0[1-9])|([1-9]\\\\d))+((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3})))|(((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))\\\\.((0[1-9])|([1-9]\\\\d))\\\\.((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3})))|(((66[0-57-9])|(6[0-57-7]\\\\d)|(00[1-9])|(0[1-9]\\\\d)|([1-578]\\\\d{2}))((0[1-9])|([1-9]\\\\d))((000[1-9])|(00[1-9]\\\\d)|(0[1-9]\\\\d{2})|([1-9]\\\\d{3}))))\\\\z`,\\n val\\n )\\n} else = \\"UNLABELED\\" {\\n true\\n}\\n"
}
}'

The classifier Rego code goes in the ruleCode property of the classificationRule object. Note that since the ruleCode property is just a JSON string, it must be properly escaped (quotes, newlines, and so on) to work with cURL.