Pentaho Releases “Filling the Data Lake” Big Data Blueprint for Hadoop

New Reference Architecture Accelerates Data Integration and Analytics Pipeline for Hadoop at Scale

June 28, 2016, HADOOP SUMMIT, SAN JOSE, CA Pentaho, a Hitachi Group Company, today announced “Filling the Data Lake”, a blueprint that helps organizations architect a modern data onboarding process for ingesting big data into Hadoop data lakes that is flexible, scalable, and repeatable. Data management professionals can now offload the drudgery of the data preparation process and spend more time on higher value-added projects.

According to Ventana Research, big data projects require organizations to spend 46 percent of their time preparing data and 52 percent of their time checking for data quality and consistency. By following Pentaho’s “Filling the Data Lake” blueprint, organizations can manage a changing array of data sources, establish repeatable processes at scale and maintain control and governance along the way. With this capability, developers can easily scale ingestion processes and automate every step of the data pipeline.

“With disparate sources of data numbering in the thousands, hand coding transformations for each source is time consuming and extremely difficult to manage and maintain,” said Chuck Yarbrough, Senior Director of Solutions Marketing at Pentaho, a Hitachi Group Company. “Developers and data analysts need the ability to create one process that can support many different data sources by detecting metadata on the fly and using it to dynamically generate instructions that drive transformation logic in an automated fashion.”

Revealed in a Forrester Consulting report commissioned by Pentaho, on average, 52% of firms blend together 50 or more distinct data sources to enable analytics capabilities, about a third (34%) blend 100 or more data sources, and 12% blend 1,000 or more. While many organizations use Python or other scripting languages to code their way through these data sources, the “Filling the Data Lake” architecture reduces dependence on hard-coded data ingestion procedures to unlock huge potential for operational efficiency gains, increase cost savings, and greatly ease the maintenance burden.

“A major challenge in today’s world of big data is filling Hadoop data lakes in a simple, automated way. Our team was passionate about identifying repeatable ways to accelerate the big data analytics pipeline and have developed an approach to drive more agile and automated big data analytics at scale,” added Yarbrough.

Pentaho has created four other blueprints to help enterprises quickly optimize and tackle their big data projects. Find out more on: Optimize Data Warehouse, Monetize My Data, Streamlined Data Refinery and Customer 360-Degree View.


About Pentaho, a Hitachi Group company
Pentaho, a Hitachi Group company, is a leading data integration and business analytics company with an enterprise-class, open source-based platform for diverse big data deployments. Pentaho’s unified data integration and analytics platform is comprehensive, completely embeddable and delivers governed data to power any analytics in any environment. Pentaho’s mission is to help organizations across multiple industries harness the value from all their data, including big data and IoT, enabling them to find new revenue streams, operate more efficiently, deliver outstanding service and minimize risk. Pentaho has over 15,000 product deployments and 1,500 commercial customers today including ABN-AMRO Clearing, BT, EMC, NASDAQ and Sears Holdings Corporation. For more information visit

You’re in the Right Place!

Hitachi Data Systems, Pentaho and Hitachi Insight Group have merged into one company: Hitachi Vantara.

The result? More data-driven solutions and innovation from the partner you can trust.

You’re in the Right Place!

REAN Cloud is now a part of Hitachi Vantara.
The result? Robust data-driven solutions and innovation, with industry-leading expertise in cloud migration and modernization.

You’re in the Right Place!

Hitachi Consulting and Hitachi Vantara have integrated into a new company under the Hitachi Vantara brand. We help you connect what’s now to what’s next.

You’re in the Right Place!

Waterline Data is now Lumada Data Catalog, provided by Hitachi Vantara. Lumada Data Catalog, available stand-alone, is now part of the Lumada Data Services portfolio.