Addressing the Technical Debt of MLOps: Part 1 – From Nothing to Platform

Machine learning is becoming more and more ubiquitous across all manner of companies from start ups to global enterprises. Many teams have Data Scientists and ML researchers who build state-of-the-art models, but their process for building and deploying ML models is entirely manual. Google has published an important article that outlines six expectations for a fully optimized ideal MLOps deployment: 

  1. Development and experimentation: You iteratively try out new ML algorithms and new modeling where the experiment steps are orchestrated. The output of this stage is the source code of the ML pipeline steps that are then pushed to a source repository.
  2. Pipeline continuous integration: You build source code and run various tests. The outputs of this stage are pipeline components (packages, executables, and artifacts) to be deployed in a later stage.
  3. Pipeline continuous delivery: You deploy the artifacts produced by the CI stage to the target environment. The output of this stage is a deployed pipeline with the new implementation of the model.
  4. Automated triggering: The pipeline is automatically executed in production based on a schedule or in response to a trigger. The output of this stage is a trained model that is pushed to the model registry.
  5. Model continuous delivery: You serve the trained model as a prediction service for the predictions. The output of this stage is a deployed model prediction service.
  6. Monitoring: You collect statistics on the model performance based on live data. The output of this stage is a trigger to execute the pipeline or to execute a new experiment cycle.

[excerpt copied from:]

If you are reading this and the above sounds wonderful, but unattainable, then you are in luck. This blog post and the subsequent articles in this series are designed to help you think critically about how you can improve your MLOps functions and deployments. However before we can dive into how to address these issues we have to acknowledge what challenges we face. Fortunately for you as the problem solver we can classify the pain points under one title – Technical Debt.


The Technical Debt Problem

Technical Debt in this context can be defined succinctly as “the price of innovation and speed.” A more elaborate definition might be “the cost of additional rework to migrate from an existing legacy system, often manual, to a more modern platform which achieves substantial improvements to the existing solution.” For many enterprises with MLOps solutions the technical debt is a result of either having developed a manual process or an earlier decision to not invest in an automation platform. Unfortunately, addressing technical debt is not as simple as installing a new platform or writing new code – this exercise comes with a monetary cost as well as cascading impact to potentially other systems or processes. Not addressing the technical debt problem does not mean it goes away, on the contrary technical debt has compounding interest. The longer you wait the greater the debt and therefore the investment required to reduce it. Leaders and decision makers have to answer the question “when do we tackle our technical debt” and also have to answer “how do we make sure we minimize future technical debt” at the same time! Fortunately for the case of MLOps there is an Open Source Solution that can address the aforementioned MLOps challenges and also reduce future debt accumulation.


The Platform Solution

Kubeflow is a specialized ML platform that is built for Kubernetes and runs in Kubernetes clusters as a collection of Pods and Operators. Kubeflow harnesses the power of Kubernetes to orchestrate containerized environments allowing enterprises to optimize the path from development to production. Kubeflow provides container images to run ML workloads and IDEs, such as JupyterLab Notebooks. Kubeflow is a Data Scientist obsessed platform that leverages the power of Kubernetes to really improve the Model Development Lifecycle by abstracting away the K8s complexity so Data Scientists can focus on data science. Using the Kubeflow platform reduces the friction that Data Scientists and MLOps professionals face in their day to day allowing for greater collaboration and reducing model time to production. Migrating to Kubeflow, or managing and paying down your technical debt, introduces long term stability, increased reusability and decreases future maintenance. This first step is the most crucial because you are choosing to invest in your future technology stack. Once you’ve made this choice you are on an exciting journey along with the Open Source community and the large number of enterprises that have selected Kubeflow as their MLOps platform. Of course there is much more to discuss, which we will in subsequent posts in this series. Stay tuned for our next discussion “Addressing the Technical Debt of MLOps – Part 2 – From Platform to Pipeline”. 


What’s Coming In This Series

This series will continue to dive deeper into how to manage your MLOps technical debt and will feature the additional articles over the next few weeks:

  • Addressing the Technical Debt of MLOps – Part 2 – From Platform to Pipeline
  • Addressing the Technical Debt of MLOps – Part 3 – Pipeline the Hard Way
  • Addressing the Technical Debt of MLOps – Part 4 – Pipeline the Easy Way
  • Addressing the Technical Debt of MLOps – Part 5 – Beyond The Pipeline

Additionally we will follow up each blogpost with podcasts from Chase Christensen (Solutions Architect @ Arrikto) and Ben Reutter (Multimedia Lead). During these podcasts they will have a freewheeling discussion about the topics presented in the prior posts. We hope to provide you with further technical deep dives, laughs, critical thoughts and more!


What You Should Do Next


About Arrikto

At Arrikto, we are active members of the Kubeflow community having made significant contributions to the latest open source Kubeflow 1.4 and 1.5 releases. 

Our projects/products include:

  • MiniKF, a production-ready, single-node Kubeflow deployment that installs in minutes, and understands how to downscale your infrastructure 
  • Enterprise Kubeflow (EKF) is a complete machine learning operations platform that simplifies, accelerates, and secures the machine learning model development life cycle with Kubeflow.
  • Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
  • Kale, a workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.


Free Technical Workshop

Turbocharge your team’s Kubeflow and MLOps skills with a free workshop.