What is Data Virtualization?

What is data virtualization?

Data virtualization is an approach to data analysis that overcomes the challenges of drawing on data stores in various physical locations by creating a virtualized logical data layer that can integrate data sources of multiple types from many global sources without the need to draw in and manipulate data into an additional data store as in data warehouses. This approach beneficially eliminates the need for error prone data replication or data migrations that can lead to data corruption. For the end data users, data is presented in a single unified view, often with advanced visualizations.

Data virtualization was born as a solution to the challenges of data lakes. When data lakes began to pool up as a result of massive data producing technologies like social media, IoT enabled devices, and mobile devices, organizations began searching for a way to integrate these data sources, often in unstructured formats, with their traditional structured business data.

Data warehouses take unstructured data lake data and through a data extraction, transformation, and loading (ETL) process structure it to be analyzed later. It's a process that requires extra data replication and storage. This approach has a high potential for data inconsistencies as numerous ETL processes are necessary. Also repositories must be kept in sync. Data replication, further, raises data security and governance concerns that impact sensitive data, namely where to store users' sensitive data. The solution to these challenges is data virtualization with features that allow data customers to view, access, and perform analytics on disparate data sources regardless of location without having to move that data.

Data virtualization vs. data federation

Data federations are subsets of data virtualization. Conceptually, data federations are like virtual data warehouses, where they can logically map remote data sources and then run queries on those sources to draw data into a structured schema. This strict data model is created within a virtual environment, unlike a data warehouse which replicates data within its own physical stores.

Why is data virtualization important?

Two factors are pressuring enterprise data strategies, the massive volume of data generated by devices and systems used in day-to-day operations, and the incapability of on-premise infrastructure to cost-effectively keep pace with those demands. While this has technologically cornered some organizations who have stalled in the efforts to migrate to the cloud, others have learned of the benefits of data virtualization to corral disparate data sources and incorporate them into a unified real-time data view.

The main features of data virtualization middleware are:

  • The ability to abstract data using a virtualized layer.
  • To integrate unstructured and structured data from disparate sources.
  • To facilitate the retrieval and manipulation of that data.
How does data virtualization work?

Data virtualization requires software to connect with data sources and create the abstracted data layer where data scientists can then design data views and perform rich analytics. With many of the large cloud providers, Azure, IBM, AWS, etc., installing a data virtualization can be as easy as a 3-step process, amounting to simply opting to turn on the data virtualization service and making some configuration choices. These cloud providers offer data virtualization as an extension.

When companies are not operating in the cloud, and on their own systems, they may be required to install server and client applications that perform the virtualization, and then make a few configuration choices, such as adding and selecting data sources.

Advantages of data virtualization

Deploying a data virtualization architecture awards several advantages to organizations.

  • Real-time Access — Data virtualizations provide real-time data access to disparate data sources, geographically separate.
  • Reduced Errors / Increased Data Accuracy — AI and automation as well as the elimination of data replication improves data accuracy.
  • Less Resource Consumption / Cost Savings — Data virtualization reduces the steps traditional data warehouses must go through, i.e. ETL, directly reducing costs.
  • Less Disk Requirements — Because data remains at the source, disk requirements are not duplicated to store processed data.
  • Enforce Data Governance — DV tools can integrate with data governance tools and create a stable endpoint for sensitive data stores to ensure data governance remains intact.
  • Logical Abstraction and Decoupling of Data — Data virtualization allows for heterogeneous data sources to interact.
  • Incorporates Structured and Unstructured Data — Data virtualizations bridge semantic gaps between different types of structured and unstructured data.
  • Share Data Easily / Eliminate Data Silos — Data virtualization can overcome departmental data silos
Logical data warehouse vs. traditional data warehouse

Logical data warehouses (LDW) and traditional data warehouses (or classical data warehouses) are terms that complement each other. A traditional data warehouse is a physical set of servers and storage that draws data in, transforming it to fit its schema, and then analyzing and storing it for consumption. But to meet the challenges presented by using disparate data sources alongside data warehouses, and while avoiding the extra effort and capital costs of integrating a data lake, a logical data warehouse can be used.

Logical data warehouses are virtualization architectures that rest atop a data warehouse that extend a traditional data warehouse's reach into other data repositories, like enterprise data lakes, data warehouses, Hadoop, or NoSQL. LDWs provide a holistic view of an organization’s data without the need to format or transform consolidated data. Essentially, LDWs give data warehouses the advantages of data virtualization, including the ability to access and combine with unstructured data without requiring its movement.

Data lake vs. data warehouse

Data warehouses and data lakes are two elements within a larger data storage ecosystem. While they both have massive storage features, each stores data differently which gives each their advantages. Data lakes store data in unstructured, native, raw formats. Data warehouses hold their data in schema structured formats which requires first processing to align data to the schema.

For data lakes, the purpose is to be optimized to store data in a manner that allows it to pile up. Because data today is being produced in such volumes as to require data analysis when resources allow or on-demand when cases arise, data must be allowed to easily accumulate. In this way data is never disposed of, just saved until needed. Data lakes use a system called object storage, which treats huge tracts of unstructured data as individual objects rather than file systems which treat data as files. Object storage facilitates massive data storage, usually in the petabyte range.

Data warehouses on the other hand can store huge amounts of data only after it’s been fitted into a data schema that structures the data warehouse (data warehouses are more selective of their data, so they store much less data than data lakes). This data manipulation requires resources, and often presents a necessary extra step for drawing insights out of data. Because the function of the data warehouse is migrating to the cloud, being replaced with data virtualization and data federations in combination with AI and data lakes, the use of on-premise data warehouses will decline sharply in the coming years.

Data virtualization use cases

Generally, data virtualization, like other virtualizations such as application virtualization and server virtualization, provides IT departments with exceptional flexibility in designing and architecting solutions to larger business needs. Virtualization provides an abstracted separation between the user of services and those responsible for ensuring the services are available. With data virtualization, data consumers can draw upon data analytics unconcerned with where and how the underlying data is stored, while admins can organize how they provide that service without disrupting its use.

Data virtualization supports several use cases. The following are ideal uses for data virtualization.

  • Cloud Data Sharing — Data virtualization is at home in the cloud where virtualization makes sharing data easier by connecting diverse data sources and networks.
  • Data Access / Semantic Layer — Data virtualization is vendor-agnostic, overcoming vendor-specific semantic layers that can cause challenging integrations.
  • Data Hub Enablement — Data hubs connect data producers with data consumers, and can be logically organized into any number and type of domain: for example, geographically, application-focused, or process-focused.
  • Data Preparation — The agility virtualization provides allows engineers to prepare data sets along with data users despite growing data sources with widening data types.
  • Legacy System Migration — Before data virtualization, migrating to the cloud carried tremendous costs and risks. Data virtualization provides a path for companies to dip their feet into the water before committing a full exodus to the cloud.
  • Logical Data Lakes (LDL)Data lakes growing to astronomical sizes can be organized logically using data virtualizations.
  • Logical Data Warehouses (LDW) — LDWs extend the strength of on-premise repositories with data virtualizations.
  • Physical Integration Prototyping — Since physical integrations require extensive time to implement, data virtualization can be used to prototype potential successful models, clarifying issues before implementation.
  • Virtual Operational Data Stores (VODS) — VODS are virtualized versions of operational data stores (ODS), used to coordinate data from multiple disparate real-time business transaction systems.
Data virtualization tools

Data virtualization tools all aim to create a unified view for users through a logical collection of separate data sources. The following are identified by Gartner and G2 as the leaders in data virtualization tools.

  • AWS Glue — A fully managed solution to unburden clients of their data virtualization administration.
  • Denodo Platform — Highly touted as user friendly with a quick learning curve.
  • Oracle Virtualization — A popular brand name built for a Linux Kernel-based Virtual Machine (KVM) environment.
  • SAP HANA — SAP HANA promotes itself as a single In-Memory architecture platform, and is highly-rated for disaster recovery, access control, and authentication.
  • TIBCO Data Virtualization — TIBCO devoured the Cisco data virtualization package through acquisition in 2017, and touts one of the widest data source connectivity capabilities.