Debugging an Unexpectedly Slow SQL Query Powering our Dashboards

March 8, 2021

Abnormal Security has a data-driven culture that permeates the entire organization, from the engineering team to product, sales, and beyond. We make decisions by analyzing data monitoring relevant metrics.

For online analytics, we use a great tool called Grafana, an open-source analytics and monitoring solution that can connect to a variety of data sources. It looks something like this.

Image for post
Grafana dashboard example (www.grafana.com)

Different teams build their dashboards to serve their unique needs. The types of charts we have range from systems stability monitoring to product performance metrics to business metrics. The data that feeds into these disparate dashboards comes from a wide range of sources, including traditional SQL databases and time-series databases like Prometheus and more. You can see the types of data sources available for integration here.

Image for post

A conventional SQL database is central to our organization and is one of the most important and used data sources for our Grafana charts.

The Problem with Our Dashboards

A hyper-growth company naturally has growing pains. Some of the dashboards set up in the early days that were initially snappy were becoming slower and slower. For a long time, we decided to kick the can down the road and put up with long load times for some dashboards. Like many startups, we always have more to do than we can take on, and this seemed like a lower priority item.

While the fifteen minutes it took to open these dashboards was just an annoyance, it had a distinct perk of providing a great coffee break in the morning. Finally, one day, some of them started to timeout entirely, which prevented users from viewing the data. Not having access to metrics can cause a lot of anxiety and without metrics, we were essentially flying blind. We still needed them fixed ASAP. Therefore, we started investigating the issue and found a short-term solution that will give us 6 to 12 months of runway.

Initial Investigation

In the beginning, we were not sure what the underlying issue was. Many issues can lead to slow queries, but the two most likely are that (1) the compute instance runs Grafana and (2) the MySQL database itself. The possibility that the Grafana instance was the culprit was quickly ruled out after a cursory glance at the instance’s systems metrics. The CPU, memory, and disk IO on the instance did not look significantly different when we tried to open these dashboards that were timing out.

That led to the next candidate: MySQL. One thing that immediately caught my eye when looking at the dashboards was that they were all utilizing this one particular view. Since a view in MySQL is essentially a stored query (not a result), the first idea that came to mind was that we could make this a materialized view. Unfortunately, MySQL doesn’t support materialized views, so achieving the equivalent of a PostgreSQL materialized view on MySQL would require a bit more work.

Before jumping to the conclusion and doing the work to create something like a materialized view, I wanted to dig deeper. We had reports from users that exploding out the view would make a chart load much faster (that is manually running parts of the query). Because MySQL views are stored queries, it is surprising that exploding a view out in a query would result in a performance increase; the difference was most likely due to the queries themselves. However, it did signal something important—the queries involving the views were more inefficient and could be rewritten to be more efficient.

A Closer Look at the MySQL View

The above conclusion, that the view query was the culprit, surprised me because the view contained a fairly straightforward query. It was not clear why it would be slower than the manually exploded versions of parts of the query.

We will use the following tables over “Author” and “Book” as an illustrative example:

Image for post

Based on a brief look at the query, one might expect this query to run quite fast since the WHERE clause’s column is indexed. But when I ran the query, it timed out. To figure out why we could use the SQL keyword EXPLAIN and look at the optimizer’s plan.

Image for post

From the output of the EXPLAIN query, it was evident that the expected index, datetime_idx, was not used and author_idx was chosen instead. This was surprising. Why was the optimizer choosing an index that doesn’t use the column in the WHERE clause? The author_id in the author_idx might have helped with the join, but I failed to see how it could help the query run faster than selecting the datetime_idx. “using where” in the extra column was telling us that the date filtering was done by scanning through all the rows post JOIN without using the index. Surely this couldn’t be optimal.

After figuring out that the optimizer was choosing the wrong index, the impulse might have been to add a FORCE_INDEX to the SQL query. This would help but is this what you should do? The answer is no. The reason is that the MySQL query optimizer should pick a reasonable plan most of the time—that is what it is designed for, after all. If we forced it to use an index, we could get into a situation where we do not allow the optimizer to choose the best plan possible when the query is sufficiently complex. To figure out why the query optimizer chose the suboptimal plan, we have to understand why a plan is selected.

MySQL Query Optimizer

The query optimizer uses a cost-based approach in determining the execution plan to use and derives the cost from table statistics. For InnoDB, the storage engine we use, these statistics are an estimate and not the true statistics. InnoDB samples random pages from each index on a table to calculate the index’s cardinality to update the statistics. With that in mind, let’s look at what the index statistics looked like.

Image for post

Again, the result here was unexpected. The number of rows in the Author table was roughly 200, but here, the estimated number was 20000, which was off by a factor of two! We knew joining Book with Author would produce the number of rows equal to the total number of rows in the Book table, but the query optimizer failed to recognize this because of the wrong statistics. The query optimizer first did a table scan on Author and saw only 200 rows, or 1% of the estimated number of Authors. Based on the inaccurate cardinality, it inferred the number of output rows when joining 200 authors to books would yield a much smaller number of rows in the Book table and thus, joining by using author_idx was the best choice.

So why was the estimation so far from the actual number? The first suspicion was that the statistics hadn’t been updated for quite a while. I first tried to fix it by running ANALYZE TABLE, but the statistics remained pretty much the same after the successful execution. It turned out that there was an option that one could tweak named innodb_stats_sample_pages, which had a default of 20.

To illustrate why this might be a problem, consider the following example: if there are 200 unique ids in 20 pages and supposed 20 of them only showed up once. For a total of 200,000 pages, the estimated number will be something like 20180. The actual estimation algorithm is probably a lot more complicated than this, but the same idea applies. A solution to this problem is to increase the number of pages sampled when calculating the statistics.

This, however, has some downsides. First, we have to manually examine the statistics periodically and readjust the number of pages sampled. The statistics are updated automatically whenever the table grows by 10%. However, as the table grows, we will likely need to sample more pages and ANALYZE TABLE will become slower and slower over time. While ANALYZE TABLE is not blocking, it can potentially lead to some downtime.

Our Solution to Effectively Update our Dashboards

To solve the problem, we decided to use STRAIGHT_JOIN instead of INNER JOIN. STRAIGHT_JOIN forces the join order of the table. This way, the execution scans the Book table in the outer loop, and the query optimizer will choose the right index to filter it down first. This solution was certainly not the panacea and is not optimal in all cases.

We picked it here because the view was a straightforward query, and most of the charts derived from this view were not complicated queries that might benefit from joins in a different order. Also, for the time being, we were anticipating the number of rows in the Author table to be much smaller than the cardinality of the datetime_idx. After the change, the charts went from timing out (5 mins) to running in ~10 seconds. Unfortunately, the team no longer has time for a coffee break, but fortunately, they no longer are smashing their monitors in frustration.

In the end, the solution was relatively simple, but the investigation was an enriching experience. I gained much more in-depth knowledge about how the MySQL query optimizer works and how to interpret the output from EXPLAIN to optimize queries. The fix is only short-term, however. As the company grows rapidly, we will need to rework how we compute metrics, as making ad-hoc queries won’t be a good idea when the data grows sufficiently large.

This year, one of the infrastructure and platform team goals is to scale the system to handle 10x the current traffic and beyond. Aside from metrics, there are also challenges around data processing, storage solutions, cost and scaling the engineers’ effectiveness, etc. If the challenges are interesting for you, we’re hiring!

Previous
Blog earth space
We are starting by eradicating advanced social engineering attacks through email. These attacks bypass traditional email security solutions and are a major problem for customers. Business email compromise (BEC) is the most dangerous cybercrime...
Read More
Next
Blog rock build
At Abnormal Security, we’re constantly exploring opportunities to improve our customer’s user experience. In this blog post, we’d like to share Abnormal’s process to design a framework to identify gaps and improve customer’s first-time user experience to onboard our platform.
Read More

Related Posts

Blog hiring cybersecurity leaders
As with every equation, there are always two sides and while it can be easy to blame users when they fall victim to scams and attacks, we also need to examine how we build and staff security teams.
Read More
Cover automated ato
With an increase in threat actor attention toward compromising accounts, Abnormal is focused on protecting our customers from this potentially high-profile threat. We are pleased to announce that our new Automated Account Takeover (ATO) Remediation functionality is available.
Read More
Email spoofing cover
Email spoofing is a common form of phishing attack designed to make the recipient believe that the message originates from a trusted source. A spoofed email is more than just a nuisance—it’s a malicious communication that poses a significant security threat.
Read More
Cover cybersecurity month kickoff
It’s time to turn the page on the calendar, and we are finally in October—the one month of the year when the spooky becomes reality. October is a unique juncture in the year as most companies are making the mad dash to year-end...
Read More
Ices announcement cover
Abnormal ICES offers all-in-one email security, delivering a precise approach to combat the full spectrum of email-borne threats. Powered by behavioral AI technology and deeply integrated with Microsoft 365...
Read More
Account takeover cover
Account takeovers are one of the biggest threats facing organizations of all sizes. They happen when cybercriminals gain legitimate login credentials and then use those credentials to send more attacks, acting like the person...
Read More
Blog podcast green cover
Many companies aspire to be customer-centric, but few find a way to operationalize customer-centricity into their team’s culture. As a 3x SaaS startup founder, most recently at Orum, and a veteran of Facebook and Palantir, Ayush Sood...
Read More
Blog attack atlassian cover
Credential phishing links are most commonly sent by email, and they typically lead to a website that is designed to look like common applications—most notably Microsoft Office 365, Google, Amazon, or other well-known...
Read More
Blog podcast purple cover
Working at hyper-growth startups usually means that unreasonable expectations will be thrust on individuals and teams. Demanding timelines, goals, and expectations can lead to high pressure, stress, accountability, and ultimately, extraordinary growth and achievements.
Read More
Blog yellow skyline
No one wants to receive an email from human resources that they aren’t expecting. After all, that usually means bad news. And when we think there may be bad news, cybersecurity training tends to fall by the wayside. Threat actors know this, and they’re taking advantage of human emotions.
Read More
Blog rising building
There is little doubt that business email compromise and other advanced email threats are causing significant damage–both financial and reputational—to organizations worldwide. Because these never-before-seen attacks contain few indicators of compromise, they evade secure email gateways and other traditional email infrastructure...
Read More
Blog purple person outline
Identity theft is not a joke, impacting more than 14 million people each year in the United States alone. Over the course of their lifetime, nearly one-third of all people will become victims of identity theft—often as a result of a corporate data breach. Once attackers have access to identifying information like your full name, address, date of birth, and/or social security number...
Read More