Optimising a complex query in MySQL: avoid derived tables!

Recently I had to optimise a query that was part of some slow-running ETL: it was taking longer than its allocated load window, and when run it would bring the source systems to their knees. Here’s how I solved it- along the way I found some interesting limitations to what MySQL supports and how it is not a fan of derived tables…

Several sets of data were joined together in one rather large MySQL query that would often take a couple of hours to run. The first necessary step for any kind of SQL optimisation is to spend time with the query to work out what it is supposed to do- this is made much easier if complex queries are commented. Once familiar with its purpose and how it got there, I could turn my hand to trying make it run a little more like a finely tuned sportscar than the clapped out old banger that it was.

One of the things that complicated the query was that several full outer joins were necessary, and MySQL doesn’t support those. Each full outer join had to be emulated by unioning together two very similar queries, each returning thousands of rows. This couldn’t be avoided so this was an area that we couldn’t improve futher. I hope this is fixed in later versions of MySQL- it is something I’d rather not have to jump through hoops to achieve.

Each large data set (including simulated full outer joins) was being treated as a derived table, and subsequently joined to other derived tables, e.g.
FROM (SELECT key, things FROM sometables) as DerivedTable
INNER JOIN (SELECT key, things FROM sometables) as AnotherDerivedTable
ON DerivedTable.key = AnotherDerivedtable.key

I googled somewhere that in MySQL, views perform better than derived tables, so my first attempt at optimisation replaced each derived table with a named view.

INNER JOIN AnotherView
ON AView.key = AnotherView.key

Initially we thought that it had worked- the query time was reduced to less than half an hour. However when I came back in the next day, the query was back to its old, creaking performance. Clearly the cache had played its part and views weren’t any better after all.

The next method of attack was to look at the joins- neither the derived tables or the views could be indexed on the columns we were joining on.

The solution was to break the query down into a few steps:

  1. Load each derived table into a (temporary) table,
  2. Add indexes to each (temporary) table.
  3. Execute the main query, which joined the temporary tables together on the indexed columns.

This gave great results: the query took 5 minutes to run, including loading the temporary tables and adding the indexes.

Conclusion: MySQL doesn’t perform well when joining together large derived tables or views, and a better solution is to use indexed temporary tables.

This entry was posted in Solutions and tagged , , , , . Bookmark the permalink.

2 Responses to Optimising a complex query in MySQL: avoid derived tables!

  1. Chareth Cutestory says:

    I don’t know if your conclusion is supported by your argument: I’m assuming when you say “…neither the derived tables or the views could be indexed on the columns we were joining on”, you mean that because derived tables and views aren’t physical tables you can’t create an index on them. What you could try though is to create an index on the key column in the “sometables” tables referenced in the FROM clause in the derived table expressions, and then retry your original query — I would think that you would get a similar speed-up as you do using the temp tables.

    In other words, it’s not really a question of using temporary tables vs. a derived tables vs. views; it’s a question of joining on indexed columns vs. joining on non-indexed columns. If you have the ability to clone the database for testing purposes, you may want to try creating indexes on the source tables’ join key columns (“sometables.key”) in the cloned instance and run the original query, to see if you get the same speed up. I also suspect that if you reviewed the query plans of the non-indexed join in the original query vs. the indexed join in the last query, you’ll see very different plans (likely a hash join on the non-indexed join vs. a merge join on the indexed join).

    • rjback says:

      It may well be that MySQL has moved on since I wrote this post. I can assure you that the indexes on the underlying tables were already in place or were infeasible in the circumstances ( they weren’t a single column, and I can’t recall the issue)- I can’t remember which I’m afraid. Of course, indexing is the first thing you’d look for, but it didn’t help us here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s