What is Data as Code

November 12, 2020

It’s been almost a decade since Marc Andreessen famously declared that “software is eating the world.” We’ve been talking about this for years, but recently, we’ve seen it with our own eyes.

Software is everything and everywhere. In almost every market, companies of all kinds and sizes are investing in IT and developing innovative software solutions that create value for their customers, help them enter new markets, and ultimately grow revenue.

But what is software without data? Why have we only had a revolution in software development while our data processes and handling remain largely unchanged?

A new data revolution

The DevOps revolution empowered developers and caused a “shift left” that focused on problem prevention while sprouting a new generation of tools like GitHub, Jenkins, CircleCI, Gerrit, and Gradle that allowed end users to ship software. What comparative tooling do we have for data? What enhanced processes do we have?

Welcome to the era of Data as Code (DaC).

Data as Code is an approach that gives data teams the ability to process, manage, consume, and share data in the same way we do for code during software development. Data as Code empowers end users to take control of their data to accelerate iterations and increase collaboration.

It’s based on a lot of the same capabilities that agile software development methodologies rely on, including:

Programmatic management
Continuous integration
Continuous deployment
Version control
Packaging
Cloning and branching
Comparing and merging
Traceability and lineage
Mobility and access anywhere
End-user managed
Distributed collaboration

Today, data is still largely kept in silos. Some of those silos are monolithic and some are distributed, but they’re still silos. While we are getting better at connecting systems through APIs, we have added entire DataOps teams whose job is to manage the data pipeline alongside the data user. As much as we try to “jazz it up,” we are still doing ETL (extract, transform, and load). Even with new age, distributed data lakes in the cloud, it still feels a lot like a high-end old array gathering dust in the data center: You just can’t move data off of it.

Putting DevOps into data

In our polyglot, microservice, cloud-native world we are developing and deploying distributed applications in containers, each with their own data stores.

Through DevOps, we’re closing the gap between development and operations with great success. The advent of cloud, and in particular cloud native computing with Docker and Kubernetes, has accelerated the empowerment of application owners to be the rulers of their fate rather than infrastructure administrators.

Now, when an application needs to be deployed, a DevOps engineer simply deploys it via automated pipelines. When they need storage, they programmatically request it from the cloud provider and attach it to their application. When they need to expose application access across the network, they create a service endpoint and call an ingress gateway.

But what happens when a developer or application owner needs data? They ask the DataOps team or hosting application owner for the data. What happens when they need to share that data with colleagues or move it between clouds? They wait for DevOps engineers to help them. What happens when they want to synchronize their datasets across lifecycles? They wait for DevOps engineers to help them.

The problem is acutely illustrated in data science and machine learning. Data scientists need to quickly and efficiently train and deploy models, but are hampered by inefficient data handling processes and technologies. How can they implement a CI/CD process for retraining their models as they collect updated data?

This is where we see that while software is indeed eating the world, data is the super food and no one has been watering the garden.

We’re constantly told that to be successful and outperform our competitors we must be data driven. That’s nice, but notice how you rarely see examples of HOW to be data driven.

Arrikto enables Data as Code

Arrikto is doing to data what DevOps did to software development. Remember that list of DevOps foundational capabilities we outlined above? We’re making this possible for data today.

Programmatic management

This is the core underlying function that truly enables Data as Code. Performing operations on data as though it were code, accessing and modifying it, moving and transforming it through automation and repeatability.

Continuous Integration and Continuous Deployment

Just as it is with code, data is relied upon by multiple developers, users, and applications. We collect, transform, consume, and update data constantly. As we integrate data pipelines with our applications and software development we need a similar CI/CD model to facilitate bringing these branches together with an automated, process-driven method.

Version Control

As we collect, transform, consume, and update our data we need to keep track of the multiple iterations and copies to ensure authenticity, enable collaboration, and guarantee reproducibility.

Packaging

Data doesn’t just live in a single location on its own. It needs mobility and portability across systems. Just as Docker containers provided a simple standardized format for packaging up software code and libraries, data needs a similar package format.

Cloning and Branching

We are familiar with, and rely on, the concept of multiple copies and branches of software code for the purposes of collaboration, innovation, and revisions. Data has similar requirements as we scale collaboration among peers and applications. This is especially needed as we adopt a CI/CD process to enable data enhancement.

Comparing and Merging

As systems evolve and we collect ever more data, we need a simple mechanism to enable merging of data across versions and branches. Whether we are debugging development vs. production issues, deploying updated applications and data packages, or enhancing data with newly updated segments, we need an automated, repeatable, and intelligent process for diff and consolidation.

Traceability and lineage

Data may evolve independently of code, but their relationship is still one of symbiosis. Provenance is required to ensure accuracy, consistency, and reproducibility of data and code. This is especially true within regulated environments where there are often frequent audits.

Mobility and Access Anywhere

As our world becomes smaller and our reach increases, we deploy further to the edge requiring data to be mobile, portable, and nimble. Fast and simple data movement must overcome data gravity so updates can be shared quickly and easily, enabling deployment to any location and device.

End-User Managed

As DevOps has empowered the developer, cloud has empowered the application owner, now is the time for application users to be empowered to take control of their data. No longer relying upon administrators to facilitate access and movement, users can retrieve, access, and control their data.

Distributed Collaboration

Applications, businesses, users, and teams are rarely all in the one location. Increasingly we are coordinating geographically dispersed teams and partnering with other organizations. Enabling secure collaboration between these groups accelerates development and innovation.

Data Democratized

We’re democratizing data management even further up the application stack with the Rok data management platform. It’s great that DevOps Engineers and Site Reliability Engineers (SREs) must no longer rely on request-wait style ITIL-based workflows for infrastructure administrators, but what would be even better is the actual user of the data taking control.

In the data science world, this is exactly what we’re doing. We’ve started with Kubeflow on Kubernetes as a platform for data science and machine learning pipelines, and we’re extending it to the entire Kubernetes data ecosystem.

Why Kubernetes? Because it’s the future of the application control plane.

Why Arrikto? Because we’re the future of the data control plane.

Giving data scientists and machine learning engineers the capability to manage data across any cloud, to collaborate on branches of versioned data sets, and continuously retrain their models by merging differential sets as they gather more inputs shifts the data equation left, just as DevOps has done for software development.

Enabling data scientists to rapidly prototype machine learning models, test them, and then deploy at scale is just the beginning.

These principles apply equally to all data rich applications and Arrikto is working towards making this a reality.

If you would like to understand how Data as Code can help you, please reach out. We’d love to show you how you can benefit.

In part 2 we continue to explore how Data as Code drives innovation and highlight an example machine learning powered healthcare application monitoring heart attack patients.