Consolidated vs. Distributed approach to Insight

Most of the large enterprises we work with have invested a lot of money to build a consolidated data platform. What is the reasoning behind this trend and how can we make sense of it? In this text I will try to highlight couple reasons from my perspective.

There is a need for combining information between separate systems

In modern world the idea that everything happening in a business could be done in one monolithic system is not feasible. The number of systems and sources is growing fast and there is no going back.

Because of this getting a clear picture of a business environment often requires gathering information from several sources. Or from a different point of view, often the most interesting and valuable insights require varied data from different sources. Ice cream sales numbers make a lot more sense if you combine them with open weather data.

So, to make decisions based on a rich variety of data from different sources can be crucial to make a difference compared to competitors. First step to get to this point is collecting these data sources together and combine them some way. Knowledge is power, is that what they say?

Combining data from different sources is never as simple as first seems

In the beginning there was a database ;) Operative system or application needs to follow its’ own logic and that determines the structure of the data serving that purpose. Second system needs to do the same. Already these 2 might differ from each other in form or logic or syntax but still they need to be put together to get an insight.

So, let’s get our hands dirty and get into work:

Our 1st application in this example stores data about media happenings.

It has its’s 1st subset of data presented in this way when called from the API it provides:

The second application in this example provides a bit different kind of data, in slightly different form, but certainly it has a relation to this 1st data-example.

This application stores (among other things) exact timestamps of audience opening these blog texts (for reading purpose hopefully) and one datetime or timestamp of this particular example looks like this:

The second application is a UNIX -machine and therefore it stores timestamps in UNIX timestamp format, which is currently defined as the number of non-leap seconds which have passed since 00:00:00 UTC on Thursday, 1 January 1970. Knowing that we can see that this article was opened at 2024-04-24 10:44:15.

Another thing to notice is that it offers the data in Parquet -format, so we need to handle the columnar expression somehow to map it into the nested JSON -format in the 1st example.

Unfortunately the identification code used for the article in this second application seems to differ from the id of the 1st application regardless the fact that it is the very same article. Therefore we also need a mapping for these id:s:

Luckily there are many convenient ways to do all these conversions and transformations. For example with Snowflake we can use lateral flatten -function to get all the necessary data into the same record from the nested JSON -format. With python datetime -library it’s easy to convert Unix-timestamps into other date formats. We can use for example dbt to load, map, and transform these data together. There are many ways to create different kinds of orchestrators if you want to do all this with a programming language of your choice directly with API-calls, hubs or gateways.

If the combination of data is laborious, we should engineer this work smartly

But isn’t all this conversion work still kind of laborious thing to do? It can also occasionally be a bit hard to find people willing and capable to do this kind of work for you. And in most cases these people are nasty enough to ask some € to do it. To achieve a competitive edge with minimum amount of integration work put into it should be visible in your EBITDA shouldn’t it?

Let’s mark this integration work with this kind of notation

And let’s keep in mind that drawing this kind of arrow costs some € before diving into next scenario.

Scenario: Several data producers provide data in different forms and these different pieces of data form a puzzle that needs to be put together.

Step 1: Two systems need to share some data with each other

Step 2: Third system is added into scenario

Step 3: Fourth system is added into scenario

And so on …

Notice how those arrows keep adding up.

Actually these arrows keep adding up according to this exponential sum formula:

dp = amount of data producers
i = integrations required to connect all data producers into each other

i = dp2- dp

Let’s try another way of connecting these data producers into each other:

We can consolidate all data sources needed to get an insight by using a Data Platform as a connection point.

Step 1: Two systems need to share some data with each other

Step 2: Third system is added into scenario

Step 3: Fourth system is added into scenario

And so on…

Arrows still keep adding up but this time the curve is no more exponential:

dp = amount of data producers
i = integrations required to connect all data producers into each other

i = dp

In real world there usually is no need to connect every data source into each other and integrations are not equally expensive or time consuming to implement but there are many more reasons* to build a Data Platform and consolidate the data, not only that it’s cost- and time-effective to do so. On the other hand, there can be a reason or two to NOT collect all data into one connection point technology-wise:

Imagine a situation where some industrial manufacturer wants to reduce the maintenance breaks in their factories to the very minimum, to the point of

“No standstills anymore!”

, if that’s possible. To achieve this goal, the manufacturer starts using machine learning based on the sensor data in their manufacturing lines. This data is dull in a sense that it consists of only timestamps and some sensor values. However, by following these sensor cycles over time this manufacturer can see patterns that can forecast the need for maintenance cycles in the future. Should we collect all these sensor-values in the same data platform where we already collect all the rest of our other data sources?

Short answer: it depends.

If this sensor data can be combined with the other data sources and get some meaningful insight value from that, it might be worth the development cost that usually accrues when working with so called Big Data (Data that grows or changes so fast that it becomes hard to handle).

If the only value with this kind of dull data is this machine learning case for following the maintenance cycle, the same result can be achieved - maybe even with less development cost - by letting your data scientists show their R, MATLAB, Python, Julia, Go or

<place your favorite programming language for data science here>

-skills directly on files found from your Data Lake or -stream.

It is also possible that very rapidly growing stream data starts spending the calculation power from other producers or consumers of your data, or just simply grows so big that joining it to other piece of information becomes nearly impossible or requires too much effort. To avoid this, careful consideration of how to cluster your calculation power and how to arrange the rapidly growing data into hot/cold storages. This obviously adds some complexity into your environment.

Before, the Big Data case was quite clear, often the systems were separate and expensive. Now many of our customers are using Snowflake for this kind of use cases. The data platforms are flexible enough to handle even these more extreme use cases. Some things are still best to be done onsite or “edge”.

Nevertheless, here are some reasons to build a Data Platform besides that it saves money*:

Analytical tools that offer drill-down ability work best when extracting data from a data warehouse.

Most data platforms are modeled, built, and optimized for read access, and that means fast report generation and response times.

A data platform makes role-based access control (RBAC) easy by giving access to specific data to qualified end users, while excluding others.

Data stored appropriately in a data platform provides a complete audit trail of the time when data was loaded and from which data sources.

A data paltform can merge disparate data sources with capabilities to preserve history while operative systems can focus on current matters.

Storing business logic in the data platform minimizes the chances of having multiple versions of the truth.

Photo: chivozol