Observability Metrics for Troubleshooting Database Performance
In this blog post, we show how Cyral’s observability metrics can be used by DevOps and SRE teams for tracking usage of and diagnosing performance issues with their SaaS BI tools, ETL jobs, and DBaaS endpoints.
The observability challenge
Logs, metrics, and traces (oh my!) are often referred to as the three pillars of observability—artifacts you can use to better understand and more easily manage your systems. What makes metrics especially powerful for observability is the ability to aggregate them along a number of dimensions to provide wide-ranging perspectives of your systems. Moreover, the numerical nature of metrics makes them conducive to mathematical transformations such as sampling, summarization, and correlation. As such, metrics are suitable for health status reporting, alerting, inference, and more. For example, response time metrics measured by an application performance management (APM) tool can be used to identify and optimize application code paths that are taking too long to execute due to suboptimal use of storage, network, or database resources.
While there are many observability tools on the market, several interesting database performance and usage insights remain difficult to track, and, in heterogeneous database environments, they cannot be derived reliably using native database logs and metrics:
- Which users are responsible for the lion’s share of request execution time?
- How does the number of requests from each accessing entity change over time, and can we identify bottlenecks?
- How do settings like the connection pool size affect service performance?
There are two main reasons why native database metrics don’t provide the observability you need to answer questions like those listed above:
- Most databases do not report metrics at the request level, instead reporting general aggregate metrics. This is especially true for high-throughput transactional databases.
- Many services including third-party SaaS BI tools typically use shared credentials, resulting in a lack of real identity attribution from the database’s perspective.
How Cyral can help
To help with this problem, Cyral publishes observability metrics collected at the request level and complements them with labels that provide high-resolution context of the data activity. For example, some labels describe the entities from which the data activity originates, distinguishing between human and application accessors as well as humans accessing data through applications. Such labels, along with many others, can be leveraged to observe your data activity through myriad lenses, as we’ll show in the following sections. These metrics can be fed into visualization tools like Grafana or Datadog to produce dashboards that translate the raw data into actionable insights.
Grafana dashboard for monitoring database performance
The following sections describe examples of how Cyral’s data layer metrics enable better understanding of your data activity.
Monitoring Looker Query Performance
With many BI tools such as Looker, all database accesses are done through a single service user, and there’s no way to attribute requests to the individuals responsible for them. As we mentioned earlier, labels associated with Cyral metrics can identify the user responsible for each data access event, both for direct queries and for access through an application. Using these labels, we can attribute data activity with greater fidelity, even when it’s all aggregated through shared repository credentials. The following chart is composed of execution time metrics filtered to include only Looker users, and illustrates the query response time experienced by each user.
The chart shows a clear spike in response time at a point in time, and since the data is provided on a per-user basis, we know that it’s specifically attributed to our teammate Randy Chang (email@example.com). This spike in execution time indicates a long-running query, which can ultimately impact the performance of the underlying data repository. Monitoring tools such as Grafana allow us to trigger alerts based on metrics like this, minimizing the time to resolution, and consequently minimizing ripple effects felt by other consumers of the data repository.
Monitoring Weekly Snowflake Usage
Snowflake usage is often a metric of interest. In any given billing period, Snowflake’s usage metric tracks closely with the cumulative execution time of the queries your team runs against Snowflake repositories. Leveraging metric labels describing request destination, we can filter our view of query activity to show only time spent on queries to Snowflake, and we can group the totals by user, creating a chart like the one shown below. This chart shows Snowflake usage over a period of time for the five users contributing the most usage, highlighting their total usage in the current cycle.
In this specific example, our friend Nancy Drew (firstname.lastname@example.org) seems to have a disproportionate amount of Snowflake usage attributed to her, compared with the rest of the highly active users. Based on this, we can conclude either that Nancy is working substantially more than her colleagues, or that she’s executing unusually long-running queries. Digging deeper, we examine a chart showing query execution times for the same users.
The chart shows that Nancy consistently runs queries which take an order of magnitude longer to execute, compared with those of her peers. Thus, we conclude that Nancy’s queries should be examined for inefficiencies, and we’re thankfully assured that she hasn’t been working absurdly long hours!
Diagnosing ETL Throughput Issues
ETL pipelines are a critical data engineering service, and every ETL process needs to execute on time and without problems. With the right chart in Grafana, we can quickly spot possible problems in our ETL jobs. The first step in creating the chart is to filter our statistics to show only those jobs we care about. We do this by using labels on the metrics data to indicate the source entity for each. In this case, we first filtered out all human-generated actions, and then, from the set of application-based actions, we chose to show only the metrics for our ETL service, called “emr-ingest”. Below, we show two such charts, request execution time and request throughput:
These charts show an oscillating pattern since ETL services spin up intermittently. Once each job is done, the service goes to sleep until the next job. As such, the periods showing non-zero metric values show us when an ETL job is running. In these charts, we can see that there’s a point in time when our ETL jobs start showing much longer runtimes than before, indicated by the longer periods of uptime. In the Service Request Throughput graph shown in the figure above, we can see that each slowdown corresponds with consistently decreased request throughput.
To investigate the cause behind this, we examine another graph for the same period, this time showing the perceived configuration of the ETL service, namely the connection pool size and batch size.
This chart shows that the increased runtime corresponds with a decrease in connection pool size and no change in the batch size of each job. This explains the decreased throughput we saw in the earlier graph. We can infer that once the connection pool shrinks, our ETL job is able to push fewer requests per minute through the lowered number of connections available to it.
Because the connection pool size is consistently lower across all ETL jobs after a certain point in time, it’s likely it was caused by a change in the configuration or code of the “emr-ingest” ETL service. In any case, by looking at the performance metrics over time, we’ve been able to identify a likely cause for our problem: a decrease in connections available to the ETL job.
Observability metrics are a powerful tool for understanding how large systems behave. When diagnosing problems, timely metrics are an invaluable source of information, helping answer questions that lead to fast solutions. Giving your organization higher resolution data activity metrics is one of many ways Cyral helps answer questions about the data layer, like understanding where your cloud data usage is highest or finding the root cause of an ETL slowdown, as we explored above. If there are other interesting problems you want to share with us, please feel free to drop us a note here or sign up to see a demo!
The Security Digest: #36
Hello and welcome to TSD, your weekly blog post with top of mind security issues. TSD began as an internal newsletter that our Security Lead, …
Attribute Based Access Control in Databases
Introduction Databases are one of the most security-sensitive components of the modern day information infrastructure of any organization. This is because they are the custodians …