DataOps, an umbrella term, refers to an organizations' collections of technical practices, workflows, cultural norms, and architectural patterns surrounding how they integrate their data assets into their business goals and processes. This means that each organization's data pipelines are likely to be configured differently, however, in general, DataOps efforts intend to enable four capabilities within the company:
1. Rapid innovation, experimentation, and testing, to deliver data insights to users and customers.
2. Heighten data quality with extremely low error rates.
3. Synchronized collaboration across teams of people, environments, and technologies.
4. Orchestrated monitoring of data pipelines to ensure clear measurements, and transparent results.
Data Pipelines are a key concept in DataOps and encompass all data processing steps to rapidly deliver comprehensive and curated data to data consumers. Analogously, these are sometimes referred to as data factories to impart the feeling of tangibility to data assets and the thinking that data is a raw material to be processed.
Data engineers may architect multiple data pipelines in their DataOps designs, making pipelines between data providers and data preparers, and pipelines to serve data consumers. For example, one data pipeline may flow application data into a data warehouse, another from a data lake to an analytics platform, and further pipelines may feedback into themselves simply for processing, like sales data back into a sales platform.
Technological and business conditions, however, have only over the last decade set the stage for DataOps as a recognized emerging set of practices, developing out of other practices like DevOps. A formalized framework has yet to coalesce around developing best practices. In all, many believe that DataOps is still transitioning through the early stages of the hype cycle, though the marketplace has welcomed many vendor solutions, including end-to-end DataOps platforms.
DataOps has foundations stemming from many processes historically grounded in DevOps. Today three schools of thought comprise the main foundational principles of DataOps: Agile Development, DevOps, and Lean Manufacturing.
·Agile Development —A popular software development process, agile development methods in DataOps are founded on rapid development cycle times, and highly collaborative environments. Agile teams work in short cycles, called 'sprints', which may last just a day to weeks. Quick cycle times allow the team to reevaluate their priorities often and make course corrections to their application as necessary based on the needs of users.
In practice, teams can hunt down bugs and roll out fixes faster (potentially within hours of discovery), as well as receive user feedback and deliver improved application features as they are imagined, instead of bundling updates and fixes to be released in a single version late to market.
The flexibility and quickness of agile development methods provide developers an ideal framework for rapid development around growing DataOps assets.
DevOps (Development Operations) — DevOps, a building block of DataOps, relies on the automation of repetitive tasks, like testing code, to produce an environment of continuous development and accelerate the build lifecycle. DevOps practices allow teams to easily communicate and collaborate, release fixes and updates faster and reduce time-to-resolution.
Although DataOps borrows liberally from the ideas of DevOps, DevOps is a process amongst technical roles on development teams, whereas DataOps serves not only technical but non-technical roles, like data consumers, inside and outside the organization.
Lean Manufacturing — Lean Manufacturing is a set of principles and methods that aim to reduce wastage while maintaining productivity in manufacturing settings. These principles have proved to be portable, and DataOps borrows methods that suit its needs well, like statistical process control (SPC), which has been applied to data factories to great effect.
In the case of data errors, SPC allows the testing of data that flows through the DataOps pipeline to ensure its validity at every stage: at the input stage tests catch data supplier inconsistencies, tests verify data integrity throughout processing, and output tests can catch final data errors before passing data downstream to other data consumers.
To maximize data as an enterprise asset, DataOps takes a holistic approach, focused on improving communications, integrations, and automation. Fundamentally, DataOps is founded on people, processes, and technology.
At small levels, companies can improve their DataOps processes and accuracy using specialized data tools that improve their data pipelines. The overarching mission of DataOps, however, is to achieve a full organization-wide culture change that appreciates data first and drives to maximize all data assets. The following DataOps framework elements help guide organizations in thinking holistically about their people, processes, and technology.
1. Enabling Technologies — Use technologies such as IT automation, and data management tools.
2. Adaptive Architecture — Deploy systems that allow for continuous integration (CI) and continuous deployment (CD).
3. Intelligent Metadata — Technology that automatically enriches incoming data.
4. DataOps Methodology — Game plan for deploying analytics and data pipelines, and adhering to data governance policies.
5. Culture and People — Cultivation of organizational ethos that appreciates and utilizes data and aims to maximize data assets.
DataOps is not DevOps, but DataOps processes have benefited significantly from DevOps, one of its foundational methodologies. DevOps introduces two capabilities that enable Agile development within DataOps, continuous integration (CI) and continuous deployment (CD). Agile methods demand quick development times, in the form of sprints, however, when it comes to running tests and deployment, the process is manual. This process is slow and error-prone. But, with CI and CD capabilities automation does away with the challenges of Agile thinking, namely the time-consuming and risky aspect of manual workflows.
DataOps introduces to its workflow the CI and CD concepts, enabling the same agile thinking in its data preparations and designs, yet also automating process ala DevOps thinking, so that for data users, data factories seem to disappear. Common stages for a DataOps workflow are:
1. Sandbox Management — The process of creating an isolated environment for experimentations.
2. Development — The design and building of apps.
3. Orchestrate — Two data orchestrations stages occur, the first orchestrates a representative copy of data for testing and development, and the second orchestrates the data factory itself.
4. Test — The testing stage targets code rather than data. However, in the orchestration stages, the testing of data is a primary task.
5. Deploy — Similar to DevOps, after successful code tests, the code is deployed to production.
6. Monitor — At all stages, monitoring occurs, but particular attention on end monitoring of the data factory so data exits pristine and honest.
A defining characteristic of DataOps is the numerous roles that interact and contribute to the accumulation, processing, and use of a company's data assets. Towards the extreme, companies whose data assets are their main value proposition have the most immediate need to understand the people engaging with proprietary information. These DataOps roles can be generally classified as data consumers, data preparers, and data suppliers.
Data Suppliers — Data suppliers are the end data owners, like database administrators, responsible for data management, processing, and user access control of a company’s DataOps. Data Preparers — Due to the ever-complicating nature of DataOps, a middle ground of roles is developing between data engineers, data suppliers, and data consumers. Data engineers build the pipelines that refine raw data into new usable, valuable, and monetizable data. Data curators is a developing role that begins with the needs of consumers to optimize accordingly DataOps content to businesses the needed context for enhancing final assets. Another developing role due to heightened requirements around data governance is the data steward which is responsible for developing company data governance policies and ensuring compliance. Data Consumers — Data Consumers receive the final data output and the largest group that interacts with DataOps assets. Many roles have emerged: data scientists apply data to solve business problems, data citizens are frontline workers in need of real-time information, and data developers need accurate DataOps as they build business applications that use those pipelines.
You’re in the Right Place!
Hitachi Consulting and Hitachi Vantara have integrated into a new company under the Hitachi Vantara brand. We help you connect what’s now to what’s next.