Why would I want to do this?
Tableau Server resource is a valuable commodity and one of the activities which use this, like no other, is the rebuilding/refreshing of Data Extracts (TDEs). At Atheon, our data (usually) refreshes each day, with many hundreds of extracts to refresh. This uses a huge amount of resource which impacts our users, where their models are likely to slow down while this process is happening (…we schedule our refreshes overnight to reduce this impact, but this is restrictive). The Backgrounder is the process which performs this task and its resource can’t be limited (without buying an extra license and moving, that process, to a second machine). This means it will take as much resource as it feels necessary, at the expense of all others. However there is another way…use the SDK to rebuild the TDEs (outside Tableau) and then push the refreshed TDE to Tableau Server.
In this post, I want to direct you to an open source wrapper for Tableau SDK (in Python) to create and publish TDEs (built by our resident genius Vathsala), as well as touch on a few differences when using these SDK-built TDEs, within a workbook.
Open source wrapper for Tableau SDK in Python to create and publish .tde
We have tested this against EXAsol (in-memory database) but not much else, as we don’t have a need to (currently). Initially, this library was created to support MSSQL and all the code is in place (in the library) to connect to MSSQL (but it has not been tested yet). We welcome anyone to give this a try and give us any feedback, and so contribute to making this as general purpose and stable as possible.
Using SDK built TDEs in Workbooks
The first thing we need to cover here is the murky world of where Tableau calculated fields are stored/written. This is a huge, and not (that) well-documented subject…I will just give a generalised* overview. For a more in-depth look at this topic, I’ve found no better investigation than (Tableau Zen and general genius) Tamás Földi’s post
*Like most software, there are many nuances in Tableau, so the behaviour described below is “what usually happens”, but there will be some situation when this is not the case
Where do Tableau Calculations get written/stored?
So let’s start at the beginning!…
We connect to a data source (say an MSSQL view), via Tableau Desktop, and extract the data to a (Local) TDE. At this point, both the TDE and original view from MSSQL contain the same information (just the raw data from the view). I’m going to use a retail EPoS (Electronic Point of Sale/Till) example, and the raw data looks like the below;
So first I want to create an aggregated calculated field to return the Average Selling Price, which I do like this
and then I create the following (non-aggregate/row level) calculation to count the number of day/product combinations without any stock (abbreviated to OOS, Out of Stock, in the trade), which looks like
At this point, both calculations are written into the workbook (.twb). I can verify this by opening up the TDE in a new workbook, and I don’t see the calculations. However, if I optimise my extract…
These calculations get written into the TDE. Again I can verify this by opening the TDE in a new workbook and see that my calculations are there (this is a new workbook, so it can’t have got them from anywhere else, but the TDE).
This is all very interesting, but what has it got to do with SDK created TDEs?
Well, when we publish a TDE to Tableau Server, it optimises the TDE (writing the calculations to the TDE) in the publish process…and we CAN NOT (at this time) create these calculated fields (definitions) in the SDK-created TDE. So the SDK created TDE will only ever be the fields (columns) that are in the data source (original) view.
This affects things in 2 ways…the way we work and potential performance issues. Let’s look at working practices. Previously we would create a Tableau dashboard (or set of dashboards) in a workbook, from the (Local) TDE . So we’d add in any calculated fields we need, build the sheets/dashboards…etc. and then we’d publish the TDE to the server (now using the “Update workbook to use the published data source”…btw THANK YOU TABLEAU for this tick-box…it means a lot!)
and then we’d publish the workbook.
Now recall that our calculations are now written into the (server-side) TDE, and not (necessarily*) the workbook. So this means if we use our SDK to rebuild this data source and publish it over the version we (previously) published, the workbook will break. The workbook is relying on the TDE to tell it how to compute any calculated fields and the (server-side-SDK-created) TDE no longer has that information.
So what we need to do is create the TDE using the SDK, and get the SDK to publish the TDE to Tableau Server. We then connect our (fresh/empty) workbook to the server-side-SDK-created TDE, than start developing…adding in any calculations (as we would do normally). As we can’t write to a server-side data source, any calculations we create are written into the workbook (this why when you try to edit a calculation from a server-side TDE you only get the Edit Copy option…it’s writing a copy into the workbook). Once our workbook is ready, we publish the workbook to Tableau Server. Now, when we rebuild the TDE (using the SDK) and publish (over the old) TDE, the workbook won’t break as it now only needs the TDE for the raw data (and not the calculation definitions).
It’s important to note here that we are copying over the previous TDE (same name and location). The SDK isn’t doing anything to the workbook. This means when the workbook is opened, it is none-the-wiser that the TDE is actually a refreshed/rebuilt version. It looks for the TDE (with that name in that location) to get the data, and finds it. This is akin to how a Live connection works (but the data source is the TDE)…The calculated field instructions are in the workbook, referencing the raw data from the Live view/table. The advantages over just using a direct live connection is that TDEs are (generally) faster than most (non-column-store) databases, it has access to all Tableau’s functions (eg DATEPARSE), and the database isn’t constantly being hit with queries (generated from Tableau as users use the workbook)…big shout-out to my friend (and Server guru) Toby Erkson for helping come up with the Live analogy!
So what about these potential performance issues?…Well, there is another thing that the optimisation process does…it can actually write (non-aggregated) row-level calculations into the TDE definition. So rather than run these calculations, on the fly, each time they are used…it materialises the values when it rebuilds the TDE. This means that these fields can run much faster**
**You can try this yourself, although you’ll need a reasonable (5M+ rows) size data source to perceive the difference! Create a row level (ideally muti-nested IF statement) calculated field and bring it into the canvas…it may take 2-3 seconds to compute/render. Close the workbook down (to clear the cache…no cheating!), now re-open it, create the same calculated field, optimise the TDE, and now try…milliseconds! Tamás Földi’s Blog goes into this in much more detail (NB. Tableau doesn’t do this for every row level calculation, such as one relying on a parameter)
For the aggregate calculation, we created (Average Selling Price)
it doesn’t matter where it sits (TDE or workbook), as it always has to be run on the fly…as the VizLoD (viz Level of Detail) control how it’s calculated (if that doesn’t make sense, read this), whereas the result of each row level calculation is the same regardless of the VizLoD, filtering…etc.
In many cases this won’t be an issue…we’ve done some tests using both a TDE materialised row level calculation, and a workbook (on the fly) version and noticed no difference. However these were only on data <250k rows, and I can see situations where it would have a material impact. But there is a solution…any row level calculation (such as the OOS one we created here) could easily be written into the data source view definition, and that way the data source (MSSQL in our example) is materialising the result of that calculation, so Tableau gets a materialised field (as it would do if it had done the materialising). Again this comes down to working practices….
So the final work flow is…create the TDE (locally) do some experimental development, so we know which fields we’ll want to be materialised, ask our DBA (nicely!) if they’ll add these to the view definition…get the SDK to create the TDE from this view, and publish it to the server. We then connect to the server-side-SDK-created TDE (from Tableau Desktop), then develop the final model and publish.
Hope people find this useful, and can free their server resource to allow Tableau Server to do what it does best…process data queries and return the results in beautiful renderings!
Please give it try and any feedback is always welcome.