domingo, 8 de enero de 2017

Referenced Queries And Caching In Power BI And Power Query

Last week, Maxim Zelensky (whose blog is well worth checking out) tweeted about a very interesting answer he had received to a question he posted on the Power Query MSDN forum, on the subject of caching and referenced queries in Power Query. You can read the thread here:
…but since this is such important information – it’s certainly something I’ve wondered about myself – I though I would share Maxim’s question and the response from Ehren of the Power Query dev team here in full so it gets the wider visibility it deserves. I’m very grateful to Maxim for letting me share this and to Ehren for writing such a detailed response.
First, Maxim’s original question:
There are two different scenarios I am working with:
1) Query1 connects to the data source (flat file) and make basic cleaning and transformations.
Then Query2 and Query3 reference to Query1, performing other transformations needed.
Query3 also take some data from Query2.
Query2 and Query3 then exported to sheet, Query1 – connection only.
As far as I can understand, PQ can define refresh chain like this: evaluate Query1, then evaluate Query2, then evaluate Query3 (as it need the results from Query2). The question is: When PQ performs calculation of Query3, will it recalculate Query1? Or, as it was evaluated in the chain before, Query3 will use cached results of previous Query1 calculation (performed when Query2 was evaluated)?
2) I have a set of flat files, and I take data from them with one Query1. Query1 also performs some transformations. Then I have two independent Query2 and Query3, both connected to Query1 performing different transformations of source data. Results of Query2 and Query3 evaluations then used in Query4, which exports its results to the sheets, Query1, Query2 and Query3 – connection only
The second question is: performing "Refresh" on Query4, how much times will be Query1 evaluated – one, two or three ? Or there also will be chain: calculate Q1, caching, then Q2 or Q3, calculate next using cached results of Q1 evaluation, and then – Q4?
3) Is there is a difference with connection to database?
4) Is there any rules of evaluation chain (like each expression/query will be calculated once in the evaluation chain)?
And here’s Ehren’s reply:
There’s a lot involved in answering your question, so let me back up and explain a few things first.
Caching
Power Query (in both Excel and Power BI Desktop) utilizes a "persistent cache", stored on disk, when refreshing queries. But what exactly does that mean?
First, let’s look at what gets cached. The persistent cache does not store the results of your M Queries (Q1, Q2, etc. in your example). Instead, it stores the results of the behind-the-scenes requests sent to data sources.
So if Q1 queries a SQL Server database called "MyServer/MyDatabase" and returns a single unfiltered table called "MyTable", the query sent to the server might be "select [Col1] from [MyTable]". In this case, the persistent cache will now know the result of sending "select [Col1] from [MyTable]" to "MyServer/MyDatabase". If another M query (whether through referencing Q1, or by querying the same table directly) needs the same result, the persistent cache can provide it, and the result won’t have to be fetched a second time from the SQL Server.
"Great," you might say. "So if I’m pulling from a flat file in Q1, and in a few places in Q2 I need to do Table.RowCount(Q1), the file should only be read from disk once, right?" And the answer would be…no. This is because not all data sources are cached. Specifically, the results of calls to File.Contents are not stored in the persistent cache. Why not? Well, the cache is stored on disk, and caching local files (which are already on disk) elsewhere on disk doesn’t really make sense. (Using Table.Buffer in this context may help…see more on Table.Buffer below.)
"Okay", you might say. "But if Q1 is pulling from a SQL table, and in a few places in Q2 I reference Q1, that should hit the persistent cache, right?" Maybe. It depends on how Q2 is using Q1, since doing additional operations on Q1 (such as filtering or merging) might cause the M engine to compute a different SQL query, resulting in the server being hit again.
Next, let’s look at the scope of caching. The scope of caching differs depending on what you’re doing, as well as what tool you’re using.
Previewing
If you’ve opened the Power Query editor in Excel or Power BI Desktop, you might have seen warnings like "This preview may be up to 3 days old". This is because there is a persistent cache used for interactive previewing of query results. As you can imagine from the fact that we have warnings about preview results being days old, this cache is long-lived and is intended to make the experience of working in the editor faster.
Loading to Excel
If you load/refresh three queries in Excel, each of them gets their own persistent cache. So the fact that a SQL result is cached during the load of Q2 won’t benefit the loading of Q3, even if it needs the same result.
Loading to Power BI Desktop
If you load/refresh three queries in PBI Desktop, they all share a single persistent cache. When you refresh multiple times, each refresh operation gets its own cache (shared by all the queries being refreshed at that particular time). This means that if SQL result is cached during the load of Q2, it will still be cached during the loading of Q3 (assuming they’re both being loaded at the same time).
What about Table.Buffer?
Table.Buffer can be useful if you want to store an intermediate result in memory and avoid pulling content from disk, a remote file share, a SQL Server, a website, or any other data source multiple times during an evaluation.  Think of Table.Buffer as, "load this table into memory, and stop folding subsequent operations back to the data source".
However, because buffering happens in memory and is not persisted on disk, buffering during the load of one query does not affect the load of another query. If Q1 is buffered when Q2 is loaded, Q1 will be re-buffered when Q3 is loaded.
And now to answer your question…
Now let’s take a look at your example (Q4 references Q2 and Q3, and Q2 and Q3 both reference Q1).
Since you’re pulling from a flat file, and File.Contents results aren’t cached, the flat file will be read each time Q1 is referenced (twice in Q4, once in Q3, Q2, and Q1). If you buffered the result of Q1, then Q4 would only read the file once. But when Q1, Q2, and Q3 are loaded (even in PBI Desktop), they will still each also read the file.
What about immutability?
You asked about the fact that M values are supposed to be immutable. This is true for the "pure" parts of the language, but breaks down when you introduce external data sources and folding. (In fact, you could think of Table.Buffer as transferring a table from the fuzzy unpredictable world of folding to the immutable world of pure M values.) You can see this in action by doing the following test, using a query called "MyTextFileQuery" that pulls from a local file on disk.
Reads the file five times
= Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery)
Reads the file once
= let rowCount = Table.RowCount(MyTextFileQuery) in rowCount + rowCount + rowCount + rowCount + rowCount
I’ve read this response several times and it’s still sinking in, but clearly there are some important implications here for anyone doing more advanced data loading work in Power Query and Power BI. I’m sure it will be the inspiration for many future blog posts on tuning Power Query query performance
Chris Webb BI Blog