Addressing the Technical Debt of MLOps: Part 2 – From Platform to Pipeline

April 19, 2022

Blog and Kubeflow Updates | Kubeflow | News

In Part 1 of this series we introduced the concept of Technical Debt as it pertains to MLOps and acknowledged that the first step in reducing this debt is choosing to move to a platform for your MLOps work. The Kubeflow platform is the Data Science obsessed MLOps solution for Enterprises that solves many of the day one challenges that you will face on this journey of addressing technical debt. Choosing Kubelfow means that you are choosing to:

Reduce day to day friction / toil for Data Scientists and MLOps practitioners.
Accelerate time to production for trained and tested models.
Abstract and obfuscate the challenging parts of Kubernetes.
Improve environment, image and package maintenance and standardization across users.

In practice this approach is summarized as:

Data Exploration and Feature Preparation: Data Scientists begin with data transformation and experimentation in IDEs, such as JupyterLab Notebooks to review model quality metrics and identify models for further development or training.
Model Training: Data Scientists train their ideal models in development, perhaps using HP Tuning or AutoML, and prepare the model for production.
Model Serving: MLOps takes the model creation pipeline from the Data Scientist and deploys the creation and serving of the model in production.
Performance Monitoring: The system monitors the model which is continuously tuned, retrained and redeployed.

This approach, once adopted, maximizes both the velocity of model development as well as the resilience and reliability of the overall system. The reduction in friction / toil helps organizations achieve a reduction in hours committed to work that is manual, repetitive and has no intrinsic value to the organization. This is all made possible through the use of Kubeflow Pipelines, the focus of the rest of our discussion.

What Are Kubeflow Pipelines?

Model development life cycles are often addressed by multiple specialized services, for example: portable IDEs with Jupyter, training operators such as TensorFlow training, hyperparameter tuning with Katib, feature stores like Feast, and serving with KServe. The entire MLOps teams and associated organization had to both manage these technology stacks and the communities that used them. This led to additional friction, complexity, and costs when developing ML architectures, thereby increasing the technical debt in the enterprise. Kubeflow fuses all these traditionally separate components and communities into a unified platform and vision as to how an organization should support and leverage all these components. This is the marriage of the once separate machine learning and operations universes that we now refer to as MLOps. Kubeflow harnesses the API consolidation and orchestration powers of Kubernetes to host these service components in pods. Kubeflow also equips MLOPs teams with the power of solutions like KServe to provide “serverless inferencing on Kubernetes and performant, high abstraction interfaces for common machine learning (ML) frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX to solve production model serving use cases.” Kubeflow Pipelines are able to seamlessly take advantage of Kubeflow’s ecosystem; they also scale easily as Kubeflow can simply allocate more CPUs or GPUs as necessary from the hosting Kubernetes cluster’s infrastructure regardless of your pipeline depth. Kubeflow Pipelines are defined by multiple pipeline components, referred to as steps, which are self-contained sets of user code, packaged with a container image. Each step in the pipeline runs decoupled in pods that have the code and reference to the output of previous steps. The Kubeflow Pipelines that Data Scientists create are both portable and composable allowing for easy migration from development to production. Kubeflow facilitates greater partnership between Data Scientists and MLOps teams that are responsible for collectively developing and serving machine learning models. By giving teams the ability to collaborate more efficiently together Kubeflow reduces technical debt and pays off accumulated debt that is built up in any technical system of scale. The freedom to focus on the work and the technology that supports it reduces the cognitive load that teams typically allocate to managing K8s. Teams are now empowered to focus on solving more complex business and data science problems further unlocking the potential of the team and reducing future technical debt.

Kubeflow, Kubeflow Pipelines and CI / CD

Kubeflow Pipelines are a component of a larger ML solution built on Kubeflow, which is characterized as:

An abstracted methodology to interface with specialized services to get Model Development Lifecycle work done efficiently in a decoupled manner.
The ability to continuously build, train and improve models to ensure stability and accuracy during production.
An environment which is continuously deploying both the continuous training pipeline and the model as well as monitoring the quality of the model in production.
A unified vision and automated Continuous Training process which cycles through points 1 – 3 to improve the model in response to ever evolving business needs.

This portable, composable and resilient approach ensures consistency between development and production driving the velocity of the iterations. Keeping in mind that velocity can be at the cost of stability, the harmony between the two is critical. The byproduct of this CI / CD approach to a CT pipeline is the harmony you seek to ensure reduction in future technical debt!

Kubeflow Pipelines Can Be Complex!

We have established that Kubeflow and Kubeflow Pipelines will reduce future technical debt and the compounding nature of the associate interest. Now is the time to dive into the problem of addressing the existing technical debt of your system. The opportunity presents itself during translation of existing workloads into Kubeflow Pipelines. Existing math needs to be converted to code, which is then turned into repeatable functions. These functions are then associated with steps which are in turn represented as a pipeline which can be repeatedly executed. Additionally the data that is fed into an algorithm will affect the model therefore managing this relationship between data and code is imperative to successful MLOps deployments. There are manual ways and more programmatic ways to approach this problem. We will first present the traditional and more manual approach because we want to make sure you are aware of possible challenges. Then we will introduce the more automated approach so you are well equipped to facilitate a successful transition to Kubeflow Pipelines. Stay tuned for our next discussion “Addressing the Technical Debt of MLOps – Part 3 – Pipelines the Hard Way”.

What’s Coming In This Series

This series will continue to dive deeper into how to manage your MLOps technical debt and will feature the additional articles over the next few weeks:

Addressing the Technical Debt of MLOps – Part 3 – Pipeline the Hard Way
Addressing the Technical Debt of MLOps – Part 4 – Pipeline the Easy Way
Addressing the Technical Debt of MLOps – Part 5 – Beyond The Pipeline

Additionally we will follow up each blogpost with podcasts from Chase Christensen (Solutions Architect @ Arrikto) and Ben Reutter (Multimedia Lead). During these podcasts they will have a freewheeling discussion about the topics presented in the prior posts. We hope to provide you with further technical deep dives, laughs, critical thoughts and more!

What You Should Do Next

Want to learn more about Kubeflow? Check out our free events and register! https://www.arrikto.com/kubeflow-mlops-events/
Advance your Kubeflow education – head over to our Academy portal, explore some courses and get hands-on experience: https://academy.arrikto.com/
Stay connected via LinkedIn, Twitter or community slack will ensure you are most up to date with our progress: https://www.linkedin.com/company/arrikto
Check out our fundamentals sessions and associated webinars as well as our meetups: https://www.arrikto.com/events/

About Arrikto

At Arrikto, we are active members of the Kubeflow community having made significant contributions to the latest open source Kubeflow 1.4 and 1.5 releases.

Our projects/products include:

MiniKF, a production-ready, Kubeflow deployment that installs in minutes, and understands how to downscale your infrastructure
Enterprise Kubeflow (EKF) is a complete machine learning operations platform that simplifies, accelerates, and secures the machine learning model development life cycle with Kubeflow.
Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
Kale, a workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.