How Should We Implement A Slowly Changing Dimension?
Currently, our data warehouse has only Type 1 Slowly Changing
Dimensions (SCD). That is to say we overwrite the dimension record
with every update. The problem with that is when data changes, it
changes for all history while this is valid for data entry
corrections, it may not be valid for all data. An acceptable example
could be Customer Date of Birth. If the date of birth was changed,
chances are the reason was that their data was incorrect.
However, if the Customer address were changed, this may and probably
does mean the customer moved. If we simply overwrite the address then
all sales for that customer will belong to the new address. Suppose
the customer moved from Florida to Ohio. If we were trying to track
sales patterns by region, all of the customer's purchase that were
made in Florida would now appear to have been made in Ohio.
Type 1 Slowly Changing Dimension
Customer Dimension
ID CustKey Name DOB City State
1001 BS001 Bob Smith 6/8/1961 Jacksonville FL
1002 LJ004 Lisa Jones 10/15/1954 St. Augustine FL
Customer Dimension After Edits
ID CustKey Name DOB City State
1001 BS001 Bob Smith 6/8/1961 Dayton OH
1002 LJ004 Lisa Jones 10/15/1954 St. Augustine FL
In the example above, the DOB change doesn't affect any dimensional
reporting facts. However, the City, State change would have an
affect. Now all sales for Bob Smith would appear to come from Dayton,
Ohio rather than from Jacksonville, Florida.
The solution we have chosen for solving this problem is to implement
a Type 2 slowly changing dimension. A Type 2 SCD records a separate
row each time a value is changed in the dimension. In our case, we
are declaring that we will only create a new dimension record when
certain columns are changed. In the example above, we would not
record a new record for the DOB change but we would for the address
change.
Type 2 Slowly Changing Dimension
Customer Dimension
ID CustKey Name DOB City State Current Effective Date
1001 BS001 Bob Smith 6/8/1961 Jacksonville FL Y 5/1/2004
1002 LJ004 Lisa Jones 10/15/1954 St. Augustine FL Y 5/2/2004
Customer Dimension After Edits
ID CustKey Name DOB City State Current Effective Date
1001 BS001 Bob Smith 6/8/1961 Jacksonville FL N 5/1/2004
1002 LJ004 Lisa Jones 10/15/1954 St. Augustine FL Y 5/2/2004
1003 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/27/2004
As you can see, there are two dimension records for Bob Smith now.
They both have the same CustKey values, but the have different ID
values. All future fact table rows will use the new ID to link to the
Customer dimension. This is accomplished by the use of the Current
Flag. The ETL process looks only at the current flag when recording
new orders. However, in the case of an update to an order the
Effective Date must be used to determine which customer the update
applies to.
The primary issue with Type 2 SCD is the volume of data grows
exponentially as more changes are tracked. This can impact
performance in a star schema. The principle behind the star schema
design is that while facts are few columns, they have many rows but
they only have to perform single level joins to resolve their
dimensions. The assumption is that the dimensions have lots of
columns but relatively few rows. This allows for very fast joining of
data.
Conforming Dimensions
For the purposes of this discussion conforming dimensions only need a
brief definition. Conforming dimensions are a feature of star schemas
that allow facts to share dimensional data. A conforming dimension
occurs when two dimensions share the same keys. Often they have
different attributes. The goal is to ensure that any fact table can
link to the conforming dimension and consume its data so long as the
dimension is relevant.
Conforming Dimension
Customer Dimension
ID CustKey Name DOB City State
1001 BS001 Bob Smith 6/8/1961 Jacksonville FL
1002 LJ004 Lisa Jones 10/15/1954 St. Augustine FL
Billing Dimension
ID Bill2Ky Name Account Type Credit Limit CustKey
1001 9211 Bob Smith Credit $ 10,000 BS001
1002 23421 Lisa Jones Cash $ 100 LJ004
In the example above, we could use the ID from the Customer dimension
in a fact and in the future a link to the Billing dimension could be
established without having to reload the data.
We are considering a slight modification to the standard Type 2 SCD.
The idea is to maintain two dimensions one as a Type 1 and one as a
Type 2. The problem with this is we lose the ability to use
conforming dimensions.
Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
ID CustKey Name DOB City State Current Effective Date
1001 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/1/2004
1002 LJ004 Lisa Jones 10/15/1957 St. Augustine FL Y 5/2/2004
Customer Dimension Type 2
ID CustKey Name DOB City State Current Effective Date
1001 BS001 Bob Smith 6/8/1961 Jacksonville FL N 5/1/2004
1002 LJ004 Lisa Jones 10/15/1957 St. Augustine FL Y 5/2/2004
1003 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/27/2004
As you can see, the current ID for Bob Smith in the Type 1 SCD is
1001, while it is 1003 in the Type 2 SCD. This is not conforming.
Our solution is to create a composite key for the Type 2 SCD.
Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
ID CustKey Name DOB City State
1001 BS001 Bob Smith 6/8/1961 Dayton OH
1002 LJ004 Lisa Jones 10/15/1957 St. Augustine FL
Customer Dimension Type 2
ID SubKey CustKey Name DOB City State Current Effective Date
1001 001 BS001 Bob Smith 6/8/1961 Jacksonville FL N 5/1/2004
1002 001 LJ004 Lisa Jones 10/15/1957 St. Augustine FL Y 5/2/2004
1001 002 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/27/2004
In the example above, the Type 1 and the Type 2 dimensions conform on
the ID level. If a fact needs the historical data it will consume
both the ID and the SubKey.
>> Stay informed about: Alternative Type 2 Slowly Changing Dimension