Recently I had to optimise a query that was part of some slow-running ETL: it was taking longer than its allocated load window, and when run it would bring the source systems to their knees. Here’s how I solved it- along the way I found some interesting limitations to what MySQL supports and how it is not a fan of derived tables…
Several sets of data were joined together in one rather large MySQL query that would often take a couple of hours to run. The first necessary step for any kind of SQL optimisation is to spend time with the query to work out what it is supposed to do- this is made much easier if complex queries are commented. Once familiar with its purpose and how it got there, I could turn my hand to trying make it run a little more like a finely tuned sportscar than the clapped out old banger that it was.
One of the things that complicated the query was that several full outer joins were necessary, and MySQL doesn’t support those. Each full outer join had to be emulated by unioning together two very similar queries, each returning thousands of rows. This couldn’t be avoided so this was an area that we couldn’t improve futher. I hope this is fixed in later versions of MySQL- it is something I’d rather not have to jump through hoops to achieve.
Each large data set (including simulated full outer joins) was being treated as a derived table, and subsequently joined to other derived tables, e.g.
FROM (SELECT key, things FROM sometables) as DerivedTable
INNER JOIN (SELECT key, things FROM sometables) as AnotherDerivedTable
ON DerivedTable.key = AnotherDerivedtable.key
I googled somewhere that in MySQL, views perform better than derived tables, so my first attempt at optimisation replaced each derived table with a named view.
INNER JOIN AnotherView
ON AView.key = AnotherView.key
Initially we thought that it had worked- the query time was reduced to less than half an hour. However when I came back in the next day, the query was back to its old, creaking performance. Clearly the cache had played its part and views weren’t any better after all.
The next method of attack was to look at the joins- neither the derived tables or the views could be indexed on the columns we were joining on.
The solution was to break the query down into a few steps:
- Load each derived table into a (temporary) table,
- Add indexes to each (temporary) table.
- Execute the main query, which joined the temporary tables together on the indexed columns.
This gave great results: the query took 5 minutes to run, including loading the temporary tables and adding the indexes.
Conclusion: MySQL doesn’t perform well when joining together large derived tables or views, and a better solution is to use indexed temporary tables.