Storing everything, without storing anything

The value in Delivering, not Persisting

Over the last decade, we have been experiencing an emergence of new technologies which resulted in dramatic changes in how IT infrastructure is managed, and a true revolution in the way production and development environments get deployed, scaled and maintained on top of such infrastructure. Despite this technological progress and unlike the few Internet giants, enterprise IT organizations are slow in taking advantage of the changes.

Is this because enterprise IT professionals are inherently conservative, or is there a fundamental underlying reason behind their sluggish disposition?

In this article we argue that intelligent, effective and scalable data management across infrastructure boundaries is the key ingredient missing today from the toolkit of IT organizations. This problem has been solved in an ad-hoc fashion by the Googles and Facebooks of the world, but it has not been solved for the general case of enterprise IT.

Who brought the change?

Two types of companies disrupted traditional enterprise IT:

  1. Hyper-scale internet companies, like Google, who needed to grow very fast, cut down development-to-deployment times, and had many geographically distributed teams accessing the same infrastructure. They started designing systems with a new mindset towards commoditization, large-scale and distributed architectures from the ground up.
  2. Large-scale public cloud providers, like Amazon (AWS), which proved that one can actually run certain types of infrastructure out-of-premise, for use cases such as test & dev, cloud-native applications, and even for disaster recovery or archival purposes. They introduced a more elastic and volatile model, where infrastructure is provided as a service.

What enabled the change?

It is two major forces that enabled this change and transformed modern data centers of hyper-scale internet companies and public cloud providers:

  1. Commodity Hardware
    Access to cheap, very high-density multicore hardware for Compute, that one could easily acquire off-the-shelf from any standard OEM, or in the case of hyper-scale companies, even build on their own. Access to 10 Gigabit Ethernet hardware, which also became a commodity for the interconnection of compute servers, and finally new generation storage devices that catapulted available storage capacities and performance to whole new levels.
  2. Compute & Network Virtualization
    First came Hypervisor and Container technologies for virtualizing the compute part. Each of these two abstraction layers comes with different pros and cons, and may even co-exist. Both changed the way software gets deployed and run on top of physical infrastructure, and the way operations people handle their infrastructure. Then came network virtualization in the form of Software-Defined Networking (SDN), which brought network programmability into the game and still pushes hard for further commoditization of networking hardware.

The combination of commodity hardware with virtualization software on top, allowed IT professionals to scale-out their infrastructure far beyond what was even imaginable a decade ago. Deployment and maintenance times were cut down and flexibility increased significantly.

Why is enterprise IT still stuck?

Although the IT infrastructure world has changed and most of the technology is there, enterprise IT has had a hard time catching up with the new era of infrastructure management. IT professionals face with skepticism the multitude of new technologies. New procedures and tools that blend the boundaries between development and operations are not adopted rapidly by teams that traditionally have limited software development culture.

As a result, traditional IT organizations are falling behind in terms of scale and operational efficiencies, compared to the hyper-scale internet companies, and having a real hard time embracing the new model of a fully commoditized infrastructure on- or out-of-prem. An infrastructure where all the hard work is done only by software and intelligent software orchestrators, all procedures are fully automated and can scale up, down and out on demand. Enterprise IT is still hesitant, even when all individual ingredients, in terms of technology, seem to be in place:

  • Commodity Compute HW
  • Commercial or Open Source Hypervisors
  • Commercial or Open Source Cloud Management Platforms
  • Nested Virtualization options
  • Commercial or Open Source SDN components
  • Close-to-commodity Networking HW
  • Public SaaS, PaaS, IaaS providers
  • Open Source Containers
  • Container Management Platforms and Orchestrators
  • Configuration Management Systems

And while the IT organization of an enterprise agonizes over managing its infrastructure as it scales, next-room Lines of Business flee to public clouds with their superior agility and operational benefits. They see the IT org as an impediment and not as a partner or help. This trend further aggravates the predicament of modern enterprises that now need to deal with a hybrid IT model (on- and off-perm) creating a nightmare of control, data protection and governance.

For the IT organization left behind, there’s always the fear of taking the wrong decision in front of too many choices. Is it just that though, or is there some underlying limitation of the current state of the art that makes it unsuitable for executing more traditional, business-critical workloads? It seems that the IT departments are left with a hot potato in their hands, the one that needs the most care, but none of the above technologies can help them truly manage it.

Data, Data, Data

Data is the most important asset for any enterprise, the most valuable, the most sensitive, the most sticky, the one that ties everything else together, and the only one still missing from the puzzle.

Data is what counts for enterprises at the end of the day. It is what they produce at the end of their pipeline, after everything has been acquired, computed, processed, transformed, and analyzed that matters most to them. And in many cases, data is valuable for the enterprise at all the intermediate stages of the processing / analytics pipeline, and in all its forms.

Once data needs to be stored persistently and then be made available when and where it is required, all sorts of problems arise:

  • Where to store the data in terms of location?
  • On what kind of hardware?
  • How to scale out?
  • How to back it up?
  • How to recover from disaster?
  • On primary, secondary or archival storage?
  • Use local, distributed or no caching?
  • Use de-duplication or compression or both?
  • Who accesses what data, when and from where?
  • How is the data going to be made available to the
    different kinds of applications that need it?

These problems are far from solved if someone chooses to deploy on-prem, where all kinds of silos and islands of storage exist, due to legacy hardware, special-purpose appliances, incompatible interconnects, different storage administration domains, and diverse application requirements. Despite improvements with infrastructure management, data management is still an art and a heavily manual one at that.

The problem gets even more complicated when one decides to put a cloud provider in the mix alongside their on-prem locations, thus resulting into a hybrid cloud setup. The process of moving data in and out of the cloud provider, accessing data both on the cloud provider and on-prem, syncing between the two, just adds to the burden.

Let us not even go into what happens when one wants to run on multiple public cloud providers, with or without on-prem infrastructure. Or when one decides to move applications and data from one public provider to another. Troops need to take action. Dedicated migration processes, run by expert teams, are required and may take ages to complete, if they are feasible in the first place.

Unfortunately, the evolution of storage during the past decade has not followed the breakthroughs that happened with Hypervisors/Containers on the Compute part, and with SDN on the Network part. And much less has it followed the evolution of how the whole new breed of applications get architected, developed and deployed.

Only lately has the enterprise storage industry started to move towards software-only, scale-out solutions that run on commodity hardware, rather than big, expensive, proprietary SAN/NAS appliances. At the very best, the “hyper-converged” (HCI) model, where the above storage software runs on the same off-the-shelf hardware as the one the Compute part runs on, is presented and evangelized to enterprises. This approach is definitely a step forward, by eliminating the operational silos of traditional IT environments. However, it does not in any case address fundamental problems, like enabling data to become mobile across heterogeneous, geographically dispersed and administratively distinct infrastructure, or dynamically adapt to application needs on demand. HCI is not the technological breakthrough necessary to solve the Data problems described above and to enable enterprises to ride the next wave of innovation, easily adopting the new technologies and tools that are so generously offered to them.

Delivering, not Persisting

The holy grail of the IT industry is making enterprise data instantly available and accessible irrespective of location and administrative domain, independently of where it is actually persisted, and independently of the application that needs to consume it at any point in time. Instantly transformable and adaptable to dynamic requirements and service level objectives.

In a perfect world, it doesn’t matter where the physical data actually resides, or what is the underlying storage platform in each location.

If data can just be delivered to the right place, at the right time and in the right form, then it becomes almost irrelevant where and how it is actually persisted. It is by definition truly mobile.

From a business perspective, this changes everything. Enterprises will never have to worry about how data is stored again. They will stop planning around that, and rather start thinking about where it fits best to consume it with regard to their needs and budget at any time. Data will stop being the resource that pulls everything else close to it, affecting an enterprise’s technology, compute platform, tools and decisions, and will stay completely out of the way, with the only purpose to produce the value it is meant for.

Of course, this is a lot easier said than done. A true paradigm shift is needed here. One needs to stop thinking about Storage, to solve a Data problem.

Update: Check out our Rok Data Management Platform and Enterprise Kubeflow solutions! Or, learn more about MLOps and our MLOps platform today.

Free Technical Workshop

Turbocharge your team’s Kubeflow and MLOps skills with a free workshop.