When you’re designing a dimensional model for a data warehouse, it’s very important to look long and hard at your dimensions. How will you track changes to their properties over time? Should we be using Type 1, Type 2 or Type 3 Slowly Changing Dimension (SCD) handling? Should some fields be Type 2, and others in the same dimension be Type 1?
The main factor that helps you make up your mind is how the users will want to query that dimension and the facts it relates to. Most of the time, with enough analysis of the user requirements and some caffeine to focus the mind, you can come to a conclusion that will cater for all but the rarest of queries.
Sometimes, though, you can’t come to that kind of decision. It’s a sad fact of life that often you don’t have enough access to users or subject area experts to know how a dimension will be used, or worse, they might want to use it any which way they choose. Increasingly, self-service BI based on OLAP cubes or semantic layers will encourage you to provide a flexible reporting capability, perhaps at the expense of simplicity and elegance. The ‘Type 6 SCD’ hybrid approach is one way of doing this.
SCD handling choices
Type 1 simply overwrites the old values with the new- this is great if you want to query facts based on the current dimension attributes, and want to include facts that occurred in the past, when dimension attributes may have been different. It doesn’t let you look at the history of a dimension, and it doesn’t let you query by how things were at the time.
Type 2 handling creates a new dimension record when values change, leaving the old record in place. This is perfect for querying facts based on the dimension values in force at the time, and it lets you keep track of the history of how the dimension has changed. If you want to query by the current state of the dimension, that is fine, but you won’t pick up the facts that took place in the earlier life of the dimension.
Type 3 handling suggests you create new fields to track the new way of looking at a dimension. For example, if an organisation has a brand new way of organising its stores into regions, and you don’t yet want to abandon the old way, you’d add a ‘NewRegion’ field to your Dim_Store table. You could then query by ‘Region’ or ‘NewRegion’.
What if you want to be able to do the kinds of querying that Type 1 provides, but in other cases you want to query in a Type 2 kind of a way? Or what if you need to provide the users with the flexibility to do both in a semantic layer or OLAP cube?
Introducing Type 6 SCD handling
Type 6 is a hybrid: a combination of all three techniques, hence its name: 1+2+3. Clever, eh?
Actually there are a number of ways to achieve hybrid SCD handling, but I’ll just describe the one that has worked well for us at IPL . The links at the bottom of the post provide a few interesting alternatives.
The approach we have used is to have a standard Type 2 dimension, but to have a Type 1 outrigger that you can use when you want to query in a Type 1 way. The Type 1 table will have a primary key of the business key of the dimension (or another durable key, maybe one you make up yourself), and will be overwritten when values change. The fields inside this table should be named differently to distinguish between them and the Type 2 fields: E.G. Region in the Type 2 table would be CurrentRegion in the Type 1 table.
This gives you a great deal of flexibility- if you want to query through history on the current state of your dimensions, use the Type 1 table, and if you’re interested in querying based on the values at the time, use the Type 2 dimension.
A good default approach would be to create the Type 1 table as a view of the Type 2 table’s current records, rather than explicitly building it. This will be a relatively cheap view, but will leave you the option of building it for real if performance is found to be an issue.
What about your facts? You can make the most of your Type 6 SCD by including the business key of the dimension in the fact along with the Type 2 dimension’s surrogate key. This allows you for even more flexibility- you can even query your facts based on dimensional values that applied at a point in time- for more information check out the second link below.
We’ve used this Type 6 approach to provide users with flexibility when requirements have been hard to come by, or when flexibility itself is a definite requirement. You should use it with caution, since in most cases you can eventually decide between Types 1, 2 and 3- this would be preferable in terms of simplicity. Type 6 adds overhead and complexity to your ETL. The model becomes harder for the end user to understand, which is something to avoid if you can.
Overall, this technique is a useful tool to have stashed away in your dimensional modelling toolbelt. If you’ve implemented this or something like it, please add to the comments!
If you’d like to read more about Type 6 dimensions, or other ways of combining SCD techniques, check out these links: