Computer science loves abstraction, and now, as it turns out, so does data management.
Abstraction means reducing something complex to something simpler that elegantly delivers its essence.
Applications all over the world become more robust and easier to maintain and evolve when a simple interface is put in front of a complex service. The consumer of the service is able to say:
- This is what I want in as simple a way as possible.
- The service itself then, under the hood, can do what it is required to provide what the consumer wants.
This is a lot simpler than allowing the consumer to reach directly under the hood and mess with the engine. In that model, if you want to change the engine, you have to redo all the code connected to the parts of the engine that are changing.
How Abstraction Saves Us in the Distributed World
In the realm of data management, increasing abstraction means changing programs so that they no longer directly interact with data repositories. The repositories are the complex engines under the hood.
Instead, applications should ask a data catalog: Where is the data and how should I access it? The same result can be achieved by using an abstract data access method that can have any number of repositories behind it. However it gets done, those methods become a data access abstraction layer.
It turns out that once you do that, a whole new world of data management opens up that solves a number of problems we are having building applications in the modern edge-to-cloud landscape.
The first benefit is that the data abstraction layer provides a way to bring order to the massively distributed data landscape we now face. As you start putting more and more data behind the data abstraction layer, applications and workloads start accessing data the same way. A catalog of available data is created. A method for introducing new data to the population of data consumers is born.
Remember: this data abstraction layer is not the same as the “one repository to rule them all” vision that some implementations of data warehouses and data lakes sought. The data abstraction layer assumes there will always be multiple repositories and tries to bring order to that world with abstraction.
The second benefit is that the storage methods underneath the data abstraction layer can be changed without breaking the applications accessing the data. This opens the door to a whole new world of capabilities we will discuss next.
New Capabilities Enabled by Abstract Data Access
Here are just a few things that are possible once a data abstraction layer is in place.
Policy-Based Data Management: A data abstraction layer allows a user to set policies with regard to tradeoffs between the cost of the data repository and its performance. For batch applications, it may make sense to optimize for reducing the cost of storage. For higher performance workloads, a faster and more responsive repository may make sense. Or the system could observe the workload and make a recommendation. The repository can be changed under the hood. The users don’t need to be involved or disturbed.
Automated Metadata Enhancement: Data often has clues in it that can be used to create a more complete, usable set of metadata. If there is a geolocation coordinate, the postal code, city, state and country can be added. If a product code is available, product information can be added. The source of the data can also provide clues.
Data Transformation Zones: There are also transformations that are common in many data pipelines that can be standardized. Instead of having dozens of data pipelines implement deduplication and merging routines, place the data in a deduplication zone that assures all data is deduped. The same thing can be true for storing the data in standardized forms or with complete lineage.
We are just at the beginning of this journey. The Lumada software portfolio for DataOps is an example of technology providing an effective data abstraction layer that has many of the properties just discussed. This paper explains the vision.
Perhaps the most exciting part of this vision is that we can look forward to a day when the arrival of new data is met with delight because we know it will add as much value as possible. Right now, too often, more data means more heavy lifting to make it usable. That’s a world we need to move away from as fast as possible.