Friday, September 18, 2009

Surrogate Key in Data Warehouse

Simply put, Surrogate Keys are substitution to original Primary Key of a Dimension Table. They still act as a Primary Key but are Data Warehouse generated unique Integer values.

One may argue if these are simply Substitution then why can't we simply use the original Primary Keys. Well, it has its own advantages:
  • What if the Source Primary Key Changes? (We agree this is an unlikely situation, but yes this can happen)
  • What if data is integrated from Multiple Sources? (How do we uniquely identify a row in this case. Because both system may have their own key, and the key can be repeated in both systems. Well, this is not possible if we use a GUID column, because as the name says it is Globally Unique Identifier. But we will come back to this later.)
Thus, as the Surrogate Key is generated and maintained by the Datawarehouse System and not the the Source/Transactional Sytem, we can easily overcome these issues. The Surrogate Key acts as a bridge or substitution over the Source System.

It also helps us in dealing with Dummy Values. Normally -1 or 0 is used to denote a Dummy Value. (e.g. If you receive Sales without Promotions you can put -1 as a value for "No Promotions" instead of NULL Values). While handling large queries joins between Integer values are more efficient than NULL Values.

Another important characteristic of Surrogate Keys are it should be meaningless. No Intelligence should be derived from these keys. A lot of source system has keys where there are hidden meanings in the Keys. (e.g. A source key value XXYYYZZZ1234 the first 2 characters may refer to country code, the next 3 department code and so on.) This may be good for a Transactional System but not for a Data Warehouse environment. If we come across such scenarios it is always good to bifurcate the meanings and store it into different columns using ETL. So that later they can be referred as individual columns in a SQL Query which will be more efficient.

Another great advantage of Surrogate Keys are it helps optmize space. These days a lot of Source System has their Primary Keys as AlphaNumeric Values or GUID Columns. A GUID column occupies 16 bytes and it can be anything for a AlphaNumeric value. Whereas an integer value generally uses 4 bytes. Thus, in comparison to GUID it saves 12 bytes. This may look small in isolation or when you simply see a Dimension Table. But Imagine a saving of 12 bytes over a 100 million rows Fact Table, which is pretty moderate. This means a saving of around 1 GB and we are talking only one column. What if the table has 1 Billion rows, it will save around 11 GB and if you have multiple GUID columns.. Huh!!!

Thus, we see the benefits of having a Surrogate Key over a normal Primary Key. It also goes by the following names in many environments.
  • Alternate Key
  • Non Natural Key
  • Meaningless Key
  • Integer Key
  • Artifical Key

No comments:

Post a Comment