Abnormal Security’s James Yeh wrote this medium post here. Read the full story below.
Abnormal has a data-driven culture that permeates the entire organization, from the engineering team to product, sales, and beyond. We make decisions by analyzing data monitoring relevant metrics.
For online analytics we use a great tool call Grafana, an open-source analytics and monitoring solution that can connect to a variety of data sources.
Different teams build their dashboards to serve their unique needs. The types of charts we have range from systems stability monitoring to product performance metrics to business metrics. The data that feed into these disparate dashboards come from a wide range of sources, including traditional SQL databases and time-series databases like Prometheus and more.
A conventional SQL database is central to our organization and is one of the most important and used data sources for our Grafana charts.
A hyper-growth company naturally has growing pains. Some of the dashboards set up in the early days that were initially snappy were becoming slower and slower. For a long time, we decided to kick the can down the road and just put up with long load times for some dashboards. Like many startups, we always have more to do than we can take on and this seemed like a lower priority item. While the fifteen minutes it took to open these dashboards was just an annoyance, it had a distinct perk of providing a great coffee break in the morning. Finally, one day, some of them started to timeout entirely, which prevented users from viewing the data. Not having access to metrics can cause a lot of anxiety. Without metrics, we were essentially flying blind. We still needed them fixed ASAP. Therefore, we started investigating the issue and found a short-term solution that will give us 6 to 12 months of runway.
In the beginning, we were not sure what the underlying issue was. Many issues can lead to slow queries, but the two most likely are that the (1) the compute instance runs Grafana and (2) the MySQL database itself. The possibility that the Grafana instance was the culprit was quickly ruled out after a cursory glance at the instance’s systems metrics. The CPU, memory, and disk IO on the instance did not look significantly different when we tried to open these dashboards that were timing out.
That led to the next candidate: MySQL. One thing that immediately caught my eye when looking at the dashboards was that they were all utilizing this one particular view. Since a view in MySQL is essentially a stored query (not a result), the first idea that came to mind that we could make this a materialized view. Unfortunately, MySQL doesn’t support materialized views, so achieving the equivalent of a PostgreSQL materialized view on MySQL would require a bit more work.
Before jumping to the conclusion and doing the work to create something like a materialized view, I wanted to dig deeper. We had reports from users that exploding out the view would make a chart load much faster (that is manually running parts of the query). Because MySQL views are stored queries, it is surprising that exploding a view out in a query would result in a performance increase; the difference was most likely due to the queries themselves. However, it did signal something important: the queries involving the views were more inefficient and could be rewritten to be more efficient.
The above conclusion, that the view query was the culprit, surprised me because the view contained a fairly straightforward query. It was not clear why it would be slower than the manually exploded versions of parts of the query.
We will use the following tables over “Author” and “Book” as an illustrative example:
Based on a brief look at the query, one might expect this query to run quite fast since the WHERE clause’s column is indexed. But when I ran the query, it timed out. To figure out why we could use the SQL keyword EXPLAIN and look at the optimizer’s plan.
From the output of the EXPLAIN query, it was evident that the expected index, datetime_idx, was not used and author_idx was chosen instead. This was surprising. Why was the optimizer choosing an index that doesn’t use the column in the WHERE clause? The author_id in the author_idx might have helped with the join, but I failed to see how it could help the query run faster than selecting the datetime_idx. “using where” in the extra column was telling us that the date filtering was done by scanning through all the rows post JOIN without using the index. Surely this couldn’t be optimal.
After figuring out that the optimizer was choosing the wrong index, the impulse might have been to add a FORCE_INDEX to the SQL query. This would help but is this what you should do? The answer is no. The reason being that the MySQL query optimizer should pick a reasonable plan most of the time; that is what it is designed for, after all. If we forced it to use an index, we could get into a situation where we do not allow the optimizer to choose the best plan possible when the query is sufficiently complex. To figure out why the query optimizer chose the suboptimal plan, we have to understand why a plan is selected.
The query optimizer uses a cost-based approach in determining the execution plan to use and derives the cost from table statistics. For InnoDB, the storage engine we use, these statistics are an estimate and not the true statistics. InnoDB samples random pages from each index on a table to calculate the index’s cardinality to update the statistics. With that in mind, let’s look at what the index statistics looked like.
Again, the result here was unexpected. The number of rows in the Author table was roughly 200, but here, the estimated number was 20000, which was off by a factor of two! We knew joining Book with Author would produce the number of rows equal to the total number of rows in the Book table, but the query optimizer failed to recognize this because of the wrong statistics. The query optimizer first did a table scan on Author and saw only 200 rows (1% of the estimated number of Authors). Based on the inaccurate cardinality, it inferred the number of output rows when joining 200 authors to books would yield a much smaller number of rows in the Book table and thus, joining by using author_idx was the best choice.
So why was the estimation so far from the actual number? The first suspicion was that the statistics hadn’t been updated for quite a while. I first tried to fix it by running ANALYZE TABLE, but the statistics remained pretty much the same after the successful execution. It turned out that there was an option that one could tweak named innodb_stats_sample_pages, which had a default of 20.
To illustrate why this might be a problem, consider the following example: if there are 200 unique ids in 20 pages and supposed 20 of them only showed up once. For a total of 200,000 pages, the estimated number will be something like 20180. The actual estimation algorithm is probably a lot more complicated than this, but the same idea applies. A solution to this problem is to increase the number of pages sampled when calculating the statistics. This, however, has some downsides. First, we have to examine the statistics periodically manually and readjust the number of pages sampled periodically. The statistics are updated automatically whenever the table grows by 10%. However, as the table grows, we will likely need to sample more pages and ANALYZE TABLE will become slower and slower over time. While ANALYZE TABLE is not blocking, it can potentially lead to some downtime (link).
What we ended up doing was using STRAIGHT_JOIN instead of INNER JOIN. STRAIGHT_JOIN forces the join order of the table. This way, the execution scans the Book table in the outer loop, and the query optimizer will choose the right index to filter it down first. This solution was certainly not the panacea and is not optimal in all cases. We picked it here because the view was a straightforward query, and most of the charts derived from this view were not complicated queries that might benefit from joins in a different order. Also, for the time being, we were anticipating the number of rows in the Author table to be much smaller than the cardinality of the datetime_idx. After the change, the charts went from timing out (5 mins) to running in ~10 seconds. Unfortunately, the team no longer has time for a coffee break, but fortunately, they no longer are smashing their monitors in frustration”
In the end, the solution was relatively simple, but the investigation was an enriching experience. I gained much more in-depth knowledge about how the MySQL query optimizer works and how to interpret the output from EXPLAIN to optimize queries. The fix is onlyshort-term, however. As the company grows rapidly, at some point, we will need to rework how we compute metrics as making ad-hoc queries won’t be a good idea when the data grows sufficiently large.
This year, one of the infrastructure and platform team goals is to scale the system to handle 10x the current traffic and beyond. Aside from metrics, there are also challenges around data processing, storage solutions, cost and scaling the engineers’ effectiveness, etc. If the challenges are interesting for you, yes, we’re hiring!
Abnormal is the email security company that stands for trust.
© 2021 Abnormal Security Corporation.
All rights reserved.