Tuomas Nevanranta Tuomas Nevanranta is a senior consultant in Big Data Solutions. He masters Snowflake, Birst, AWS, Azure and various ELT tools, his expertises lay on data modelling and architecture.

What is Data Integration?

Data Integration on the Mechanism of Metal Gears.Integration and ELT/ETL are traditionally separated processes. Meaning that integration consists traditionally about whole systems to be integrated with other systems. ELT/ETL then on the other hand traditionally consists only chosen parts of the system to be integrated with other systems. In this writing we consider integration to include both of these worlds.

Basically, integration is moving data from systems to a combined system, so that the data is usable in the system where the data ends up.

Streams, batches and events

Data can be loaded in streams, batches or events. Stream means a continuous queue of data which is then handled and ingested in the target system. Streams usually are the basis of a real-time or near-real-time systems. There is also a buffer in most of the streams, so that error handling is a bit easier and data doesn’t get lost that easily.

Data loaded in batches means that the frequency of the loads is not continuous. Data is gathered in batches and then loaded into the target system from the accumulated heap of data. In batch load the data usually cannot be that real-time as in event- or stream-based loading. In traditional databases batch loads have been run at night time when there are no consumers for the data, due to the high consumption of data processing powers.

Event-based loading refers to a system where there are triggers to start actions when there is a change in the data and these triggers cause data to be loaded in the target system(s). Event-based loads are quite similar to the stream-based system from data ingestion point-of-view. Considering the source system end though, streams and events are very different. In the stream-based system, there is an application forming the stream at the source system end. And in event-based load there is only the trigger and some actions that are invoked by the trigger.

Amount of data

In the era of Big Data, IoT and Data Lakes, the amount of data to be integrated has exploded. The basics of handling data has remained pretty much the same. You will have to understand the data you own, so that you are able to ask relevant questions and then try to find the answers from the heaps of data. Also, in the current world it usually costs money to move the data around. If you have at least some hint what data is important and what is not, then you don’t have to integrate all of the data, but you can only choose the parts you really need.

Cloud-based systems make it a lot easier and cheaper to handle big quantities of data. Storing is usually cheaper than in on-premise systems. Integration from various systems is also easier, when there are ready made connectors in the cloud-systems – you don’t have figure out all of it yourself. Scaling for data storage in the cloud happens instantly or with little effort compared to the storage increase needs in on-premise systems.

 business man drawing a business plan on screen over a white background

Data cleansing, joining, master data, data modelling, etc.

Machine learning is the thing every data analytics company wants to say they are doing. What they are actually doing (for the most of the time) is data cleansing and re-organizing. Machine learning is not that useful if the data it is learning from is very bad. Usually the data that is integrated needs to be cleansed and re-organized to meet the target system standards.

When you are integrating multiple systems into one, there is also a need for commonality. Data structures need to be made similar, so that they can be used in for example reporting together. If different systems have different data types for the same entities, or different date formats then it will be very difficult to join these together and to make comparison between them. If data is not integrated in the actual data level, then those different systems will remain in their own silos.

Master data management is usually used to map different data entities to mean the same thing. This can mean that different account numbers in different systems are mapped using master data management system to mean the same accounts or event the metric system values can be equalized with imperial system values (yards to meters, Fahrenheit to Celsius, etc.).

When you have done this many changes and integrated multiple systems into one. How are you ever going to remember what has been done? The answer is data modelling. Data Modelling should be done simultaneously with the integrations, so that the integrated system is always up-to-date. A good data model documents the integrated system. It is the basis for conversations between business owners and the IT. Data model should include some information of the data flow (source systems to target tables), data dictionary (natural language explanation on the data entity) in addition to the actual physical data model.

Real-time integrations

There is an increased need to react immediately to threats and/or to opportunities. The thing should ask yourself, what does “real-time” mean to you? Does it mean instantaneous, seconds, minutes or even hours? With enough money and effort put to the integration, any of these options are possible. The quicker you want the data to be consumable, the more it is going to cost and the more effort you have to put to build the integration system and to monitor the operations.

Which ever real-time solution you choose, the best place for the integration platform is the cloud. The solutions work with a fraction of the cost that they used to. Also the time used for the design implementation is a lot less nowadays than when working with on-premise platforms. We have build terabyte size solutions to the cloud that work so fast that you could never had imagined it in the recent past.