Why Data Vault is not complicated

I have heard a few times, also from very skilled people, that Data Vault is “too complicated”.

I also thought that Data Vault was complicated, before getting to know Data Vault 2.0 better.
If you are unsure or you also think DV is complicated… keep reading and you might change your mind !

I believe Data Vault is not complicated and offers a clean and powerful way to solve pretty complex and general problems that many companies have.

DV uses a small set of primitives to allow you to solve even the most complex use cases, always applying a fairly small set of well tested best practices.

DV is definitely not trivial and it takes more than a couple of hours of reading to get the hang of the mechanic and feel at ease with it. Anyway the reduced set of primitives and the use of well known best practices make it easy to grasp for experienced SQL and DWH practitioners, allowing them to apply it to the most complex use cases they get into.

After all, what was your first impression when starting to drive? What is your impression right now?
I suggest that getting confident with DV is no way harder than learning to drive with confidence.
Of course, learning F1 racing is another story…  

If I got you interested, and you want to learn why I changed my mind… try bearing along with me to understand my 180 degrees change towards Data Vault methodology.

Definitely not love at first sight, with Data Vault 1.0 

A few years ago, when I looked the first time at Data Vault, it was still around as version one and it had a few features I did not like, like surrogate keys from sequences and not using hashes to spot changes.
Unimpressed, I did not dig more into it.

I also did not have one of the main problems DV targets: data integration from disparate sources.
Having worked a lot in online companies where most of the data was born internally we did not have huge data integration problems, and whenever an integration with the external world was done, the way to “join” the new data with the existing one was baked in the integration process.

As a result I lived for a few years with this notion of DV to be both missing key features and adding verbosity / complexity, as I did not have the problem DV was addressing.

Then when I have been re-introduced to Data Vault, now version 2.0, I could see that it had adopted all the nice features I was using and on top of that offered a lot of tried and tested solutions for problems I started to have and for which I was still struggling to articulate a good and general solution, so I got really interested and dug deeper!

What attracted me in Data Vault 2.0

When I got introduced to DV 2 I was positively surprised that version 2 had adopted a few main features I was already keen on using, just because I saw them a good ways to build a data warehouse:

  • natural keys instead of sequences
    While the exact way to use natural keys can be different (hashes, concatenation, multi field keys), what I really love is being able to generate the DWH keys directly from the incoming data, instead of having to create and lookup surrogate keys. This makes ingestion easier and faster.

  • hashing to spot changes in rows
    Since when I discovered that I could calculate a hash (actually one hash for each change set I cared about) to spot changes… I never wanted to be forced anymore to compare more than one or two fields.
    Besides being quicker to write and perform, it is also much less error prone and maintenance friendly.

  • set operations to implement Slowly Changing Dimensions
    Having already in mind, but not formalised, my set-based patterns to implement SCDs, I was just happy to find a methodology that adopts and refines the same idea.

  • insert only to keep history
    Dealing with billion of rows I was already trying to avoid updates or deletes, so I was happy that DV 2 was modified to allow insert only history keeping, even if it was not initially part of DV2.

These were already cornerstone ideas of how I thought a DWH should be built, so when I found them in DV 2.0 I though that it was worth investigating better and trying hands on.

What made me appreciate Data Vault 2.0

When I started to look in more detail into DV I found that it offered also good solutions to problems that I have been facing in recent projects, like the following:

  • stitching together events from micro-services - identity is key
    As a company I worked with moved to micro services and an event sourcing model, the data warehouse had to start receive many different, tiny, but related messages for each entity / concept that the services were dealing with, like “user” or “payment”.
    To be able to relate all those events the key feature was a robust way to implement and use identity.
    Data Vault is exceptional at handling identity, as it puts it at the center of everything. This was the kind of foundational thinking we were orbiting around, but not yet able to articulate in a good way.
    We could have got there, but getting a polished and tested solution lowers risk and time.

  • keeping a consistent view across evolving messages - handling changes
    Once you start getting a lot of events… you then will start receiving a lot of changes, especially when events are recent, as fields are added and renamed or usage / meaning of fields varies a bit (e.g. the meaning of null or email becoming non unique at some point…)
    We were looking for a way, a pattern to keep this evolution under check and being able to provide business users with a un-interrupted, consistent view of the events… and I found that this is what DV excels at !
    DV provides great pattern to load ever evolving data, as an example by keeping every evolution stream isolated, and then merging all the related ones back into the Business layer.
    Should your business users know or even care if your registration events land in one, two or more tables? That you have multiple versions overlapping in time as we are at V3, but V2 is still being sent because some clients have not yet migrated? I think they should not care !
    DV makes it clear how.

  • bridge source data thinking with business desired outputs in a consistent way
    Again I have long been thinking of different layers, moving from staging being source like and the final refined data entering the data mart being, hopefully, business like.
    BUT. Where to do such transformation and what to store and when to apply business rules has been more often than not a company / team / individual decision more than an explicitly agreed upon recipe.
    Data Vault provides a clear way to move from single source thinking to finally providing the information fulfilling the business needs the data warehouse is built for.
    The clear schema, the meaningful separation between the two concerns of storing history as happened in the source systems versus deriving the information needed by the business users is a simple, yet very powerful tool to improve across all metrics: time to market, maintainability, easy of ingestion, transparency of business rules, agility in adding new sources or deriving new information.
    I think that this very simple realisation is the single one that is worth keeping as great lesson.

  • master data manangement
    One important application of the previous point is seen very clearly in action in the case of master data management. You want in fact store the incoming data as your source systems see it, but you also want your users to be able to speak the company “master data” language, see and manipulate the information in terms of the “master data” form, not the source one.
    Data Vault has clear tools, like same-as links, to deal with this situation, as with any other business rule, wether it is a simple or complex one.

  • handling bi-temporality
    Do you like tongue twisters? Well bi-temporality can become kind of a mind twister.
    Dealing with when an event occurred and when you got to know about it does not seem so complicated at first, but when you start dealing with the real word effects of it… it becomes quickly complex.
    To understand the problem think about producing a report… with the best information you have now (easy), and also with the best information you had at some point in time (e.g. when you filed your tax declaration). Add in the fact that info arrives when arrives, not always in order and you can get changes / corrections at any time… and you start to see that it is not so simple anymore.
    This is typical for compliance departments and Data Vault helps you with storing the information in a suitable way and providing you all the info you need to produce the report you have to provide.
    So you do not have to invent and test a new way, but you can rely on the experience embedded in the method.

Finding reasonable and already well thought out solutions to these kind of problems that I have been facing, together with a great group of professionals, made me realise that Data Vault has a lot of experience, wisdom if you want, to share beyond the basic technical layer of hub, satellites and links.

Data Vault is not a magic wand

You have probably started to realise that I got to like Data Vault pretty much, and you are right, I do.

Yet I have to openly tell you that there is at least one area that is complicated and will remain complicated as Data Vault can help, but will not be able to remove the core of the problem.

I am talking of the business vault or, if you want, the implementation of the business rules.
Data Vault definitely helps you by providing a rock solid foundation with all the history at your disposal and also a couple of constructs like Bridge and Point In Time tables to help you query your data according to bi-temporality.

This is welcome help, but the key challenge in capturing, coding and evolving the business rules' application is still one of the main challenges in building in a data platform.

The good news is that now you can concentrate more effort on that, as the remaining part gets a huge boost in simplification and productivity from the DV best practices.

Conclusion

In exchange for using a few simple primitives, that you can learn in days, and best practices, that I already agreed with, I found DV to provide already designed, mindful solutions for complex problems that otherwise we would have had to develop and test ourselves.

Data warehousing is not a simple thing, building a data platform will not become a trivial task, but exactly because of all of this high complexity I have welcomed a methodology that helps me and my teams to remove or reduce it as much as possible.

Do you need to use Data Vault to benefit from these patterns and body of knowledge?
Definitely not, but it is a great way to put them in practice and if you are not keen on re-inventing wheels, chances are you better use DV and concentrate your creativity with the complexity that still remains!

On my part, for the remaining amount of complexity, I am happy to use my brain and experience to keep doing one of the most interesting jobs around.

I hope that you found at least a bit of inspiration to learn more about Data Vault.
I you do want to know more or you have questions, I will be happy to get in contact!


Roberto Zagni
Principal Consultant @ Kaito Insights