October 03, 2022
Second in a Series: As compute-heavy applications spread to all corners of the enterprise, the need to consider such workloads before architecting the infrastructure intensifies.
As I wrote in my story last week, selecting strategic storage technologies can make a big difference in the performance of the workloads – and the voracity of the outcomes – by providing on-demand data availability and accessibility. By contrast, an underpowered infrastructure can result less timely, less rigorous outcomes.
For example, such solutions must be able to meet the demands of such technologies as graphics processing units (GPUs) throughout the entirety of the data lifecycle. This goes beyond high throughput, which on its own can lead to net diminished results from the workloads that require very high IOPS, with low latency.
By far, however, the heavy-hitting workloads of today’s enterprise are AI-based, require storage solutions that can meet the data requirements to support model training efforts, at scale.
AI workloads can be classified into five types: machine learning (ML), deep learning (DL), recommender systems, natural language processing (NLP), and computer vision. Understanding each of these AI types and the types of data they interact with is important when looking to understand what kinds of storage are best.
Machine Learning: Typically, ML workloads utilize a tremendous amount of data, especially during the onset of training the decision tree. The more data you utilize for the ML baseline, the more accurate the models. These types of workloads require the storage platform to scale to meet the demand of the base data, while also growing as more data is collected and sampled. Most likely, ML will be more IO dependent than throughput, meaning that the data utilized is smaller in size, so the throughput of the storage system doesn’t matter as much as the ability for it to reach high IO rates.
Deep Learning: Modelled after how the human mind works, DL is focused on how to represent the world through a hierarchy of concepts.However, DL utilizes neural networks instead of traditional decision trees, like ML. These neural networks require graphics processing unit (GPU) powered systems to provide the necessary computing power needed to train the AI models. These GPU based systems require a tremendous amount data to be effective, requiring a system that continuously scales. Much like ML, these platforms are also much more dependent on high IO workloads than high throughput. They also require that the storage subsystems can maintain the large-scale parallelisation needed to support the commonly seen vast amount of GPU’s employed to build the neural network models. So, the ability to service IO at large scale, to a large number of systems is very important.
Recommender Systems: Recommender systems are often found in retail environments, where they, as their name outlines, recommend products and services to users or consumers based on analysis of their consumption patterns. This is derived through the combination of behavioural analysis across similar users and the history of the specific users.
These platforms are generally made up of millions of datapoints across the history of user and a culmination of data points across similar users. Much like both ML and DL, these systems are also very sensitive to high IO based workloads. However, Recommender Systems are unique in that most of the decisions made are based on real-time information. These systems need to be provided with these large data sets as quickly as possible, ensuring that the GPU’s powering the neural networks are able to observe the latest interaction, infer a decision based on real-time and historical data, and recommend to the consumer the best product or service for their need.
Natural Language Processing: NLP allows for computers to understand, interpret, and even manipulate the human language. NLP is used to provide real-time translation between people, but also can be used to infer emotion in statements made orally, electronically, or otherwise. This is achieved through NLP having an ability to extract the meaning of text, including the nature and structure of the text. Some of the largest AI projects in the world are currently focused on improving communication between people. These NLP systems, just like the previous platforms we’ve discussed are also sensitive to IO driven workload and massive parallelization. But, depending on the use case, they are also extremely latent sensitive. Ensuring that the NLP platforms have enough IO while also keeping the latency down to the micro-second and lower is important. However, NLP models are traditionally not that large in capacity.
Computer Vision: Computer vision is the ability for an AI model to understand the world around us, interpret this data into something actionable, and then wrap an outcome around it (e.g., autonomous driving). These types of systems are by far the most significant when it comes to the volume of data processed.
For example, an electric vehicle may generate 2PB of data per year in computer vision content. This data then needs to be processed continuously as more accuracy in the AI model is inferred. Government regulations are also starting to require that the AI models and the data used to train them is kept for the service life of the vehicle. This turns into a monumental archive requirement and the workload generated by a computer vision platform is tricky. They start out as high throughput systems to ensure that data ingested into the system is read by the GPU platforms as quickly as possible.
However, two things must happen in a traditional GPU driven computer vision workload. For video, each frame is converted to an image, this requires the video to be read into the system and each frame analysed as a single image. Second, large resolution images need to be sliced up into smaller, more manageable pieces for the GPU to fit into memory and process.
The result is a combination of throughput-sensitive workloads, as the data is first being read, and then mixed with high-IO workloads for the actual analysis of the data through the slicing process. Computer Vision requires a platform that’s both highly-performant for both throughput and IOPS, and able to scale, all while maintaining a high level of parallelisation.
AI workloads can vary both in size and complexity. Some platforms require 100’s of PB of storage, while others can exist in a modest 10TB. However, the performance that platforms generate cannot be underestimated. AI workloads generate a tremendous amount of IO and, in a lot of cases, require systems that can support both IOPS and throughput for the workload. Traditionally this is done across multiple systems (scale-out), which can create a customary data sprawl problem as well as a complex storage management problem. That’s why selecting the most strategic data storage based on your workloads is the new normal in enterprise management.
This is the second in a series on the impact of the changing workload on the data center. The next story will discuss an interesting scale-out option that connects file system storage to object-based storage. Stay tuned!Related