All Blogs

Enter the World of Automated Data Management and Governance with Hitachi’s Lumada Data Catalog

Madhup Mishra Madhup Mishra
Director, DataOps Product Marketing, Hitachi Vantara

August 31, 2021

The era of manual data management and governance is rapidly coming to a close. The size of the trove of data at nearly every company has become so enormous that it cannot be maintained using manual cleaning, cataloging, governance and search methods. The release of Lumada Data Catalog 6.1 breaks new ground in automating data management, cleaning and governance processes, making it easier to find data and grant access to those who need it.

The downside of the data explosion is that enterprises are left with an ever-expanding, unruly data mess. Understanding and organizing data to derive value from it is not a one-time challenge; it is constant. The more data and records you have across your datasphere, the larger your digital footprint. Multiple, duplicated records and datasets are both data management and data protection burdens. The more data you have, the more you need to protect. Lumada Data Catalog 6.1 can help you reduce duplication to shrink the potential surface area that’s vulnerable to breaches.

Layered on top of the data explosion is the proliferation of data privacy and management regulations. The pressure to meet increasingly stringent data regulations and regulatory requirements fuels the need for better data governance. For example, New York’s SHIELD Act requires a procedure to dispose of private information within a reasonable amount of time if it is no longer needed for business purposes. So how does your organization figure out what is considered private information and then determine a reasonable amount of time? The artificial intelligence (AI) driven data-fingerprinting capability in Lumada Data Catalog 6.1 can help you figure that out.

Data clutter causes a multitude of problems. Maintaining, storing and protecting extra copies of data weighs down the data processing elements; the more data you have, the more processing horsepower you need. Data duplication can negatively affect a company’s brand, too, not just around breaches but also in how that data is used for communication and marketing campaigns. How can you know your customers when there are 13 versions of each of them? Lumada Data Catalog 6.1 can help you reduce the data clutter.

Search swamping is another problem that arises from having tens of hundreds of duplicate records. Search swamping happens when analysts are searching for data that has all these duplicates. Those duplicates elongate their search for the correct data. As a result, they keep hunting and pecking because there’s so much data. Now the problem is not to delete the duplicate data, that’s easy. But how do you find it? Nobody knows how much of that data is duplicated because it’s not an identical match — it is 80% or 90% the same. Lumada Data Catalog 6.1 addresses all these pain points, helping you deliver more intelligent data operations.

What’s New in Lumada Data Catalog 6.1?

The new release of Lumada Data Catalog 6.1 is ideal for data governance customers looking for cutting edge data cataloging improvements to accelerate data discovery, govern and secure sensitive information, and prevent data clutter.

The new version of Data Catalog is designed to address three challenge areas:

  • Faster data discovery.
  • Improving data governance.
  • Reducing data clutter.

Data Catalog drives intelligent data operations, equipping your enterprise with the ability to govern your data in a way that it can be discovered faster, secured as information, and used as a single source of truth — by removing data clutter.

Features that form the bedrock of Lumada Data Catalog 6.1 include:

  • Data rationalization, which gives you the ability to tame your duplicate data and lower operations cost and risk.
  • Powerful search and export features, which enable flexible data discovery. And note that important support for Collibra is targeted for the upcoming Lumada Data Catalog release:
  • Collibra integration will let you accelerate self-service by combining Collibra’s governed business terminology with Lumada’s AI-driven data discovery.

Better Data Governance

Data governance has two functions: quality and compliance. Every organization wants to ensure they are using the best quality data they can. Maintaining good data hygiene ensures you have the correct data, that it is of good quality, and that it is easily accessible to the people who need it.

Regulatory compliance entails passing security audits and meeting GDPR or CCPA requirements. To remain compliant, organizations need to know where their data is and whether the sensitive data within the record is secure. The data must be masked based on its level of confidentiality so it can’t be inadvertently shared, shown, viewed or used. That’s a pretty big component of governance.

For many organizations, data clutter is a significant obstacle to achieving good data governance, and it affects data quality and compliance. For example, enterprises can maintain up to 13 copies of any given data object in an enterprise. Often, these 13 copies aren’t the same file; they might be 85% to 90% the same, which causes problems.

First, storing multiple copies of data objects is expensive. Keeping all that duplicate data — clutter — in the cloud adds unnecessary costs. Ideally, before moving that data to the cloud, you want to eliminate the trash — those duplicates. Second, holding on to 13 copies of nearly identical data increases the risk that your organization will end up with multiple versions of the truth.

Let’s say someone only updates one of the 13 copies of data records, not knowing that 12 other copies are remaining. Then, another person uses the second, third, or 12th copy. Suddenly, you have people not working from the same set of data, which can contribute to the wrong decisions. Nobody knows which copy to update, and, therefore, there is no single source of truth. Without a single version of the truth, your organization ends up working with incomplete data, all because of data clutter. Lumada Data Catalog 6.1 minimizes data clutter, dramatically improving data quality and reducing regulatory risk.

Data Rationalization

Worldwide data is expected to hit 175ZB by 2025, representing a 61% CAGR. In addition, every duplicate data record increases the surface area for potential data breaches, adding to the data management and protection burden. That’s why data rationalization is a critical update to Lumada Data Catalog 6.1.

Data rationalization gives enterprises a view of their entire data estate through the lens of data duplication. The new Data Catalog release helps measure the degrees of duplication between files, clustering copies into groups for investigation and rationalization. Data rationalization is enabled by the ability to leverage AI to remove duplicate copies of data (partials included) to reduce the cost of data storage and the risk of using the wrong copies with partial updates.

Data rationalization uses the AI technology that Lumada Data Catalog excels at to find commonalities between different datasets. Data Catalog looks inside your data files to see what is the same, identifying duplicate data in a granular way that helps identify the minor differences in records. In the process, users gain visibility of the records, so they are better positioned to get rid of duplicates. As a result, Data Catalog helps you to lower both cost and risk, including risk associated with decisions made from bad data and regulatory risk.

Collibra Integration

Lumada Data Catalog’s connector feature for Collibra is targeted for an upcoming release to give organizations using Collibra Data Governance bi-directional connectivity with Collibra’s data governance capabilities. This ability enables Collibra users to tap into the business terms from Collibra and use Data Catalog’s AI to find those tags across large swaths of data. Once that is complete, they can easily re-import that cataloged data and business rules back into the Collibra environment to be used by various business teams.

Think of Lumada Data Catalog as something that can sit below Collibra, from a governance standpoint, where it can find the key business terms that an organization has defined there. Collibra then activates Data Catalog’s AI tagging to find that term anywhere in the datasphere. Many organizations use more than one data catalog, including Collibra. Their data is in different places and tagged in different ways, making it impossible to find the right data in time to use it, even after having the right tools and catalogs.

Using the connector enables Data Catalog to work closely, “hand in glove,” with the Collibra setup. First, Lumada Data Catalog interprets the business language, and then its AI finds all those tags within the online data. So now, you can not only define your business terms, but also get more value out of your data by being able to find, see, move and use it. This integration accelerates self-service by combining Collibra’s governance, terminology and business terminology with Data Catalog’s AI-driven data discovery. Data Catalog is excellent at data discovery using AI. By bringing these two together, organizations get the best of both worlds.

Collibra enriches Lumada Data Catalog by providing a dictionary of consensus nomenclature to label data: training tag recognition using known data assets related to a term, which known as “transfer learning.” Data Catalog enriches Collibra by “connecting the car to the road.” In other words, it contributes discovered data assets back to Collibra, where they can be governed.

Improving Search and Export

And finally improving search and export, gives enterprises the ability to use regular expressions and advanced search capabilities to filter the datasets very flexibly. These capabilities use CSV-based metadata export, which can be utilized by other systems. As a result, business analysts, data scientists and businesspeople can conduct data discovery much faster. What’s speeding things up is the use of an AI-driven catalog that quickly tags all the information.

Clean Up the Data Mess With Lumada Data Catalog 6.1

Lumada Data Catalog 6.1 can help your organization clean up the data mess by helping to find what data is there, reduce it, clear it and remove duplicate records. Once that’s done, the new Data Catalog makes the data better by improving its quality, making it searchable, and more governable. As a result, organizations can derive value from their data faster, reduce risk caused by data clutter, and improve data governance relating to data quality and regulatory compliance.

See Lumada Data Catalog page for details.

Madhup Mishra is Director of Product Marketing, Lumada DataOps Suite at Hitachi Vantara.

Madhup Mishra

Madhup Mishra

Madhup drives product marketing for Hitachi Vantara's DataOps portfolio business. He has +20 years' enterprise software experience, and covers a range of topics including data operations, big data analytics, data governance and Internet of Things (IoT).