Welcome to episode 5 of the first season of "Your DataOps Advantage," a podcast series by Hitachi Vantara! In this episode, podcast host Bill Schmarzo (our CTO, IoT and Analytics) sits down with Mauro Damo (Senior Data Scientist) to highlight the data science 'gotchas' along the DataOps and Digital Value Enablement journey. Together, these two data science gurus discuss big surprises, unanticipated tradeoffs and words of wisdom. Cue it up folks – it's time to get geeky! (Note: the quality of audio in some parts of this podcast may not be great, but the content is, so we encourage you to listen!)
CTO, IoT and Analytics, Hitachi Vantara
Bill Schmarzo is regarded as one of the top Digital Transformation influencers on Big Data and Data Science. His career spans over 30 years in data warehousing, BI and advanced analytics. As the current CTO, Analytics and IoT for Hitachi Vantara, "The Dean of Big Data" guides the company's technology strategy and drives "co-creation" efforts with select customers to leverage IoT and analytics to power digital transformation.
Senior Data Scientist, Hitachi Vantara
For 19 years, Mauro has been developing analytical solutions for several industries, including telecommunications, media, e-commerce, finance, internet, retail, supply chain, health care and advertising, and in different functional areas such as marketing, finance and operations.
Hello. Welcome to the Hitachi Vantara dataOps advantage podcast. My name is Bill Schmarzo and I'm the CTO, IoT and Analytics here at Hitachi Vantara. The DataOps advantage podcast is going to track the trials and tribulations of different organizations of different sizes across different industries as they wrestle with how do you get value from your data? What we're going to do is we're going to talk to these organizations about how they're leveraging DataOps to uncover value in their data and help them to figure out how to drive return on their data investments.
Hello everybody and welcome to episode five of Hitachi DataOps advantage podcast series. This is a podcast series where we talk about the trials and tribulations of organizations in their data off journey, especially with respect in many cases to how they move forward on their data life. As you know from our previous podcast at Hitachi Vantara, started down a Data Lake strategy, didn't have much luck and Renee, our CIO decided to reset and restart that. Hopefully if you've listened to the other podcasts that take you up to this point, because today we're going to go totally nerd. I've got Mauro Damo on this, on the line here who's our Senior Data Scientist. It's hard to find somebody in the organization is more nerdy than Mauro. Mauro, thanks for joining us today.
Thank you for the opportunity to talk with you. Yeah. It's a pleasure to be here.
So Mauro tell me. Let's start off by helping me understand what do you view the role of data science in this data-lake second surgery initiative we're doing at Vantara.
Yeah. Today, the data science role here is crucial, right? So it's very important. So, because in general what we see in the companies and the data lakes and EDP and we have all of BI business intelligence works, right? So we need to have like some dashboards we do in general, we look at the past. So BI worked at all the, almost all companies does. They know it's like very important because they ditch track can see all the historic on the progress stuff to company to KPI indicators. But actually what the assigned us, we work in the, in the future, right? So, we look ahead. So, we use the past to track and explain to future and try to, not to explain but predict the future. So, I think it is a main role here because, after we got the additional video and they meant, there's a workshop that all the stakeholders of the company was daring this workshop technology market team, uh, finance, saw a lot of people there. We are there. So the idea here is to how we can extract value from the, from the data. So in this, this case, this sounds data science is very important in his role is a push show, right?
I love your explanation Mauro, that, that you work in the future, right? You're kind of like Doc Brown and back to the future you're jumping your DeLorean and off you go into the future. It isn't sufficient for businesses to have reports to tell them what happened. They need to start transforming their mindset to thinking about the problems they can predict and the prescriptive actions they can take it. And I laid a term off earlier on, let me come up, kind of explain. I talk about this data lake second surgery problem, which is a lot of organizations built these data lakes. They just, they, they got some random technology, probably Hadoop and they put some random data in it and they hired some data scientists and waited for magic to happen and waited for magic to happen and waited for magic to happen. And it didn't happen. Part of it had to do with that, that BI mindset that looks at building reports and dashboards about what happened. But I think what you highlighted here is the business-critical nature of being forward looking as you try to look at what's going to happen with the business. So, Mauro as you jumped in your DeLorean and you jet it off into the future, what were some of the gotchas that you hit along that path? The things that kind of surprised you both pleasant and unpleasant. Right?
It's very interesting to be able to point that the try now, so I can see in general what is more unpleasant, right? So, it's like a VB. We think that the science is fancy work like a lot of algorithm, techniques, math and modelling, but 80% of our work is trying to extract the right data, how to understand the database, and learning how it works. The first time we saw EDP, there were 600 different types of tables. So, it's a huge data lake. It's very important that I have this data lake because we are still one step ahead of other companies, because some of them don't have a lot of data. Getting the data is very important. It's very difficult, right? So, to get the right data from the systems, takes a lot of time and I think this step we left behind. And right now, we are in good shape to create value from this data, but even if we have this data lake and all this data here, you still have 600 tables. We need to understand which one we're going to use for the use case. What is the best one? We are building our data management, so we are building our data catalogue. IT is still building this with the business unit. It's very important to have one because it's going to be much, much simpler. That worked for our data engineers and data science. And of course this is a focus on pleasure. And of course the data is not always the way that we want. Right? It's not academia, right? So it's not like a graphic. So, we have a missing data, have problems with the data, we have different types of problems that we need to solve and to apply some transformation, which one is better for the for the modelling. This kind of work is a very intense in big seller around 80% of the time for data science. I think this is one of the most pleasurable parts of the work. All that data sets like to do it right. We love to do it right, too. So it's not a problem, but it's sometimes very painful. I think that it's the way that we view the solution. I think it was very good in my humble opinion? And so, we review the actual three models, right? We talk all the model off model. Not just the young and all of the other data science, but all the IT team visits a unit. So we have a lot of people we talk with. So we need to have a lot of discussion about how to understand the data. Everybody, thank you for your help. Without you is not possible to build a model like this. In going back to the modern TESL, we did three types of models…So these three models are very important. I think it's the way that we are work together here. It's a very important and the way that it's a very amazing the outcome that we are getting.
Mauro were you surprised? Because I know I was surprised that when you build these models, the process you went through was very thorough, very interactive, very collaborative. And when you came out, you came up with a, you know, a propensity to buy model we think surround an 81% accuracy, which is really quite impressive. But what really struck me was you only need a three datasets, right? You think about how so many organizations in their data lakes, focused on spending so much time getting data into the data lake, do not spend enough time figuring out how to derive and drive new sources of value. So to me, I was totally shocked. I had always kind of hoped it was the case or thought it might be the case, but it was amazing to me to see that we only needed three datasets to generate an 81% accuracy model. Is that normal or is that sort of the exception?
We achieved this performance. It seems at the beginning of the project we got this DC, they're still available, but you understand exactly what, what the business eats, right. I think it's very important because we can go straight to the point, right? We don't need to do a huge big exploration; we can put all the data together and try to understand what the problem we've got is a problem. Right? So it's so funny. So I think a lot of companies make mistakes. Okay. Let's put all the data together and let's see what's it going to get? Right? But I think that this is not the right, not the best way, not the more intelligent way. The best way is to, let's see what the problem is. We can do what we need from the business side and then start to understand what it is we need. And then we, we go through all the data science processes, and I think that's is the accuracy that we got there is more than 81%. Some products reach 90%. If its only three data sets. It seemed very important to, to know exactly what you're doing and what data we are using. And this is the point. So based on the process, we are able to reach this. I think it's the accuracy level. Of course. It's not an intrusive process. The data helps a lot. The way that we build in the model helps too. We are different, we try different types of modelling. So we try, just for permissive to buy, we tried 30 different types of models. To consider iPad parameters and do different scenarios and datasets. And so the way that we'd be able to be trying to extract more value than we can from the data, you'd want to monitor the two and see both. We do exhaustive work, to build the best one. So, Bill, I think the process was very important to help us with the guidelines and set up expectations because sometimes, the businesses that have a high expectation about what they want. So, we set up this expectation because it is very important. We can't see the future, right?
You see the future as probabilities.
Sometimes we are right, sometimes we're wrong. This is, I think it's that we have a good motto, a good material to work. And of course this is not finish here. So this, I think it's a continuous process. We are going to hand over and to transition to marketing. I think it's very important to have a continue on this work and improve the model and see the benefits, you're going to achieve and how is the performance in the field is.
Great. Well thank you Mauro for your time. And I think the, the, the last observation that you've made is that the digital value enablement process highlights the importance of data science as a team sport, right? If by bringing together the best of what we have from a data science, data engineering, IT and subject matter expert perspective, we're able to become very laser focused on what we're trying to achieve. It helps us to figure out what data sources we're going to need. And what models we need to build, but equally important what data sources we don't need and what models we don't need. So, Mauro, thank you very much for your time. It was a great, a great observation. You, you lay it out some nerdy terms out there about hyper parameters. I'm glad to hear that. I'll have to go look those terms up now. Thank you very much. And for folks on listening to our podcast, I hope you subscribe. We're going to be doing a lot more of these podcasts, not only to continue to follow the internal Hitachi Vantara project, which we code named project champagne and we're going to be bringing other customers up through this process as well. So, if you want to learn more about DataOps, want to learn more about data science, you want to learn more about how to resurrect and save those data lakes, please subscribe to the Hitachi Vantara DataOps advantage.
I hope you enjoy this podcast and you certainly want to come back the next one as we talk again to more organizations about how they're leveraging data ops to drive value out of their data. If you want to learn more about Hitachi Vantara, track us on Twitter @Hitachi Vantara, or if you want to follow me, follow me @Schmarzo. I'm the only one on Twitter. Thanks for your time. Until next time, cheers.
Stay connected with updates from Hitachi.
© Hitachi Vantara LLC 2020. All Rights Reserved.