Poor Dave. He’s just trying to get his work done! Yet for some reason, it seems that an automated system has decided that he is performing “dangerous SQL queries,” and that his queries need to be re-written to be more innocuous.
Or, is he really trying to get work done? Perhaps he is somebody that has stolen Dave’s credentials, and is trying to exfiltrate the data. Or, perhaps, it really is Dave, but HE is trying to exfiltrate the data. All that the system can determine is that his queries fit a profile that it deems “dangerous”, and hence his queries are being rewritten.
Welcome to the world of automated query rewriting! Traditionally, automated query rewriting has been used primarily for improving query performance. Since SQL queries can be rather complex, and since their performance behavior is often opaque to mere humans, query-rewrite tools can intervene to improve performance.
Yet, with data moving to the cloud, where traditional gate-keeping tactics become less useful, there is a strong temptation to apply query rewriting operations to improve privacy and security. Depending on how you apply query rewriting operations, you will have varying degrees of success.
Choosing a Query Rewrite Strategy
There are three main types of rewrite strategies, allowing us to decide on which operations to use: specifically, a rule-based strategy, a cost-based strategy, a rule-based, and a machine-learning based strategy. It is fairly common as well to have a mixed-mode set of strategies where you might first apply a rule-based strategy, and then a cost-based strategy, which is in turn complimented by a machine-learning strategy.
Rule Driven Strategy
The simplest strategy to understand and to implement is the rule-based strategy. The rules-based strategy will use rewrite-rules that are pretty much guaranteed to provide better performance or solve a specific kind of problem.
The rule-driven strategies need only consider syntactic information from the query, with no knowledge of the underlying table sizes, or any statistical information about the columns involved in the query. You can think of this as following the algorithm specified in this python-like pseudo-code:
RuleDrivenQueryRewrite(query): rewrittenQuery = query queryRewriteDone = false while(!queryRewriteDone): queryRewriteDone = true for each rule R: if QueryMatchesPattern(rewrittenQuery, R.GetPattern()): rewrittenQuery = Rewrite(rewrittenQuery, R) queryRewriteDone = false
In other words, keep looking for patterns to apply until there are no more patterns. Care must be taken to make sure you don’t have contradictory patterns, but this is still the overall flow.
A cost-driven strategy will contain rewrite rules that not only consider the syntax of the SQL queries involved, but also, some measure of the expected costs to invoke the queries. The cost-driven strategy then is to apply a set of rules that minimize the cost.
For each query, you can assume that you have a “cost model” for the query that involves knowing how much space (RAM and/or disk) and how much time it will take for the query to complete. This is based on knowing certain statistics of the tables involved: Number of rows of a table, amount of memory per row, selectivity of a particular attribute, presence (or absence) of indices for the tables, etc. These are typically gathered via database statistics, and can be directly queried. The cost-driven strategy then will consider a sequence of transformations to the query, and keep track of the best-costing query encountered so far.
If you attempt to implement a cost-based strategy, I recommend that you be aware that the problem is not completely solvable in every situation. When the number of tables or objects involved is small, then brute-force tactics can work, but the number of possible rewrites you might need to consider can grow exponentially with the number of tables involved. Nevertheless, cost-based strategies are extremely useful in improving query performance
Machine Learning (AI)-Driven Strategy
In a machine learning strategy, the input is not only the query itself and the cost model, but also, the history of queries that have occurred before. In particular, for determining costs, consider the sum of all costs for queries involved in the history – not just the cost of a single query. In addition, you might not be purely looking into costs, but also trying to pinpoint potentially troublesome behavior.
The machine learning for query rewriting typically includes two phases: the learning phase, and then the deployment phase. Often, you might have an “online learning” system wherein there is no separate training phase; instead, the system is always trying to learn and adapt the best available model. At certain intervals, it might decide to apply what it has learned, but it will do this separately from the main query path.
The learning phase is where you would have the machine learning system learn about certain patterns of behavior. This might include repeated use of certain sub-queries, or it might include “exploratory” behavior of someone trying to gain as much knowledge as they can about a system. In the deployment phase, you put your models to work: You consider the history of queries and determine “Should action X take place?”
If you are looking for people (like Dave) trying to exfiltrate your data, the action might be to redirect Dave with a rewritten query. If instead, you are solely focused on performance, during the learning phase (either online or offline) you might create a new materialized view, and during the “deployment phase”, you might direct the query to use the materialized view instead of just using the query as-originally written.
When Does Query Rewriting Work Well?
There is a long history of query rewrites giving nice advantages when it comes to performance optimization. However, much of the “query rewrite” processes are already built into the database systems. It would be a waste to try to duplicate this behavior. For this reason, we will skip the possible rewrite strategies of “predicate pushdown” and “join reordering.” These have been covered in research extensively. If you are not familiar with these techniques, then I highly recommend that you look these up! Note that these terms also apply in non-traditional databases like Hive and database-like engines including Spark.
The best results come when you define a rewrite strategy that is not already employed by the database engine. We can consider one such case for working with partitioned databases (often referred to as “sharded” databases). On a sharded database, you might have a table split across multiple files or even distinct disks. You will see this with Google’s BigQuery, with Hive, and with Teradata and Netezza.
Consider a query that joins two rather large tables: say a SalesTransaction table with a Customer table based on SalesTrasanction.customer_id = Customer.id. But now, let’s say that the SalesTransaction.customer_id attribute is “skewed”: i.e., it has at least one value that shows up an inordinate amount of time. For example, suppose that half of the customer_id values are null. You can easily imagine this: not all customers want to provide information about themselves, so you perform a SalesTransaction without information about the actual customer.
In a sharded database, the goal is to split the data so that it can be performed by multiple machines (or processors in the case of Teradata/Netezza). To perform the join between the two tables, you need to co-locate all of the data that needs to be joined. If you have 100 workers, you split your Customers table based on the id values, and then you stream the SalesTransaction data — based on customer_id — to the same location. And then you do the join.
The problem is: what do you do with the skewed values? All those sales transactions having the skewed values will want to go to the same location.
The trick then is to split the query into multiple subqueries: one subquery to handle the non-skewed values, and one other subquery for each of the skewed values. (i.e., if you have 3 skewed values, you would have 4 subqueries total).
With this trick, assuming you can identify the skewed values, you can turn a query that never completes into a query that might take mere seconds.
Another kind of optimization benefit can occur from the introduction of materialized views. This would typically happen in a machine-learning scenario with a “live” environment.
Suppose that when querying our SalesTransaction table from the Query Partitioning example that there is a particular filter on the SalesTransaction table: WHERE st.transactionYear = 2019.
If this filter seems to show up with regularity in your queries, it would be helpful to make a “pre-filtered” materialized view of the SalesTransaction table that includes just the data from the year 2019. Now, this has a cost for us, as we will consume extra disk space in the process. However, this might be outweighed by the time savings. The overall cost model must consider both the cost in disk resources and the time costs.
When Does Query Rewriting Work Somewhat Well?
With differential privacy, the idea is to allow certain kinds of potentially sensitive information to be shared, as long as the data shared has a sufficient amount of “noise” involved. For example, when querying data that has age demographics, it might be disastrous to share a person’s actual age, as it might reveal who that person is. However, revealing a person’s age +/- 10% of their age might be okay. By adding a sufficient amount of “noise”, you can effectively anonymize the data.
A problem here can still occur if you don’t add “enough” noise. As the amount of data involved gets smaller, you need more and more “noise” in the system to prevent successful attacks. A simple query rewrite would not suffice unless the rewriting process could also effectively estimate the amount of data to return.
Nevertheless, this can work rather well if reasonable safeguards are in place.
When does query rewriting not work at all?
AppSec Example (Identity-Based Rewrites)
One possible consideration is to use a rule-based rewrite system that automatically modifies a query based on the identity of a user. Basically, allow a user to see “only their data.” There are a few problems with this idea: First, as the number of tables in the repository increases, the complexity of deciding “which is the person’s data” becomes more difficult to establish. In effect, such systems become fragile.
Moreover, if you are dependent on a rewrite system to secure the content, it implies that you are either doing redundant work in the application, or you are allowing the application to be “lazy” about using queries that are already pre-filtered based on user identity.
If you are doing redundant work, the problem becomes that you now have two versions of the truth: the information can quickly get out of sync, and in effect, you have zero versions of the truth.
If instead, you are allowing the application to send insecure queries – expecting the rewrite system to intervene – then you are begging for trouble. In particular, once any part of the application needs content that is not directly related to the user, such as seeing aggregate data, then you would effectively be blocking the application from doing it’s actual work.
Finally, this does not actually solve the “AppSec” problem. You still need to check for software vulnerabilities. Sure, if the query rewrite system is filtering data, they can only see a limited view of data, but in such case, they might also have access to multiple users. That’s why data exfiltration on a per-user basis is still possible.
Insider Threat Example
Assuming that we want Dave to be able to complete his work implies that we cannot expect any sort of query rewrite process to block him. In the case of identity-based rewrites, he would be unable to do his job if all of his queries were filtered, so this would imply that he has alternative access to the data. If he has alternative access, he cannot be effectively stopped from exfiltrating whatever data he desires.
In addition, if trying to determine his behavior via machine learning, and determining if he is doing something harmful, it might still not be the best practice to shut down his queries. In most cases, he would already know about the counter-measures, and he would have a way to work around them. Moreover, such systems often generate false-positives, so this might be terribly counterproductive.
Instead, the best way to handle the possible insider threats is to have separate querying and logging capabilities. The people who need access to the data can get it. At the same time, there would be separate people with access to the logs data. As long as sensitive queries are logged, machine learning can be applied to determine if Dave is doing something that might need to be stopped. But in this case, don’t force Hal to make that decision: Hal might decide to close the pod bay doors, leaving Dave out in the cold. Instead, let Hal give us the warning, and get humans involved to verify.