Build An End-to-End ML Workflow: From Notebook to HP Tuning to Kubeflow Pipelines with Kale

Machine Learning is quickly being adopted by Data Scientists and Engineers as a great way to solve complex problems. Containers and Kubernetes are a natural fit for Machine Learning in that you can quickly and easily experiment, train and deploy models that scale as needed.

However, building all the tools to do this is still a daunting task – especially for the Data Scientist who isn’t an infrastructure expert. This is where Kubeflow comes in as a way to package and deploy all of your Machine Learning libraries, frameworks, and tools.

This is part three in our series. Previously in Part 1 we showed you how to get started with Kubeflow in minutes and run an ML pipeline using MiniKF. In Part 2 we presented an innovative way of simplifying your complex multi-step ML workflows, so that you save time and iterate faster with your teammates. Starting from a notebook, you can run your Python code as a Kubeflow Pipeline with the click of a button using MiniKF and Kale.

There is still more work to be done however, and that’s where this article steps in.

Today we focus on bringing together popular Kubeflow components to deploy a complete and easy to use workflow based on Jupyter Notebooks, MiniKF, Kale, and Katib.

You are probably already familiar with:

  • Jupyter Notebooks, a popular IDE for data scientists
  • Katib, a tool for hyperparameter (HP) optimization
  • Kubeflow Pipelines, a framework for building and deploying ML pipelines based on containers

In this tutorial, we will use Kale to unify the workflow across the above components, and present a seamless process to create ML pipelines for HP tuning, starting from your Jupyter Notebook. We will use Kale to convert a Jupyter Notebook to a Kubeflow Pipeline without any modification to the original Python code. Pipeline definition and deployment is achieved via an intuitive GUI, provided by Kale’s JupyterLab extension.

As a next step, Kale will scale up the resulting pipeline to multiple parallel runs for hyperparameter tuning using Kubeflow Katib. Kale also integrates with Arrikto’s Rok to efficiently make the data available across Kubeflow components in a versioned way, and snapshot every step of each pipeline, making all pipelines completely reproducible.

This tutorial was presented as a workshop by Google & Arrikto during KubeCon Amsterdam 2020. Here are the Codelab, the slides, and the video of the workshop.

What you’ll build

In this tutorial, you will build a complex data science pipeline with hyperparameter tuning on Kubeflow Pipelines, without using any CLI commands or SDKs. You don’t need to have any Kubernetes or Docker knowledge. Upon completion, your infrastructure will contain:

  • A MiniKF (Mini Kubeflow) VM that automatically installs:
  • Kubernetes (using Minikube)
  • Kubeflow
  • Kale, a workflow tool for Kubeflow (GitHub)
  • Arrikto Rok for data versioning, data sharing and complete reproducibility

What you’ll learn

  • How to install Kubeflow with MiniKF
  • How to convert your Jupyter Notebooks to Kubeflow Pipelines without using any CLI commands or SDKs, but just an intuitive GUI
  • How to run Kubeflow Pipelines with hyperparameter tuning from inside a notebook with the click of a button
  • How to automatically version your data in a notebook and in every pipeline step

Install MiniKF

To install MiniKF on GCP read the step-by-step guide or follow the steps below:

  • Go to the MiniKF page on Google Cloud Marketplace.
  • Click the Launch on Compute Engine button.
  • In the Configure & Deploy window, choose a name, a GCP zone, a machine type, a boot disk, and an extra disk for your deployment. Then click Deploy.
  • When the VM is up, follow the suggested next steps under the Getting started with MiniKF section. It is important to follow these steps and make sure that you can log in to MiniKF successfully, before moving to the next step.

Note that you can always install MiniKF on your laptop/desktop via Vagrant.

Run a pipeline from inside your notebook

During this section, we will run the Dog Breed Identification example, a project in the Udacity AI Nanodegree. Given an image of a dog, the final model will provide an estimate of the dog’s breed.

Create a notebook server in your Kubeflow cluster

Navigate to the Notebooks link on the Kubeflow central dashboard.

Click on New Server.

Specify a name for your notebook server.

Make sure you have selected the following Docker image from the list (Note that the image tag may differ):

Add a new, empty data volume of size 5GB and name it data.

Click Launch to create the notebook server.

When the notebook server is available, click Connect to connect to it.

Download the data and notebook

A new tab will open up with the JupyterLab landing page. Create a new terminal in JupyterLab.

In the terminal window, run these commands to navigate to data folder and download the notebook and the data that you will use for the remainder of the lab.

$ cd data/
$ git clone

This repository contains a series of curated examples with data and annotated notebooks. Navigate to the folder data/kale/examples/dog-breed-identification/ in the sidebar and open the notebook dog-breed.ipynb.

Explore the ML code of the Dog Breed Identification example

For the time being, don’t run the cells that download the datasets, we are going to use some smaller datasets that are included in the repository we just cloned. If you are running this example at your own pace from home, feel free to download them.

Run the imports cell to import all the necessary libraries. Note that the code fails because a library is missing. Normally, you should create a new Docker image to be able to run this Notebook as a Kubeflow pipeline, to include the newly installed libraries. Fortunately, Rok and Kale make sure that any libraries you install during development will find their way to your pipeline, thanks to Rok’s snapshotting technology and Kale mounting those snapshotted volumes into the pipeline steps.

Run the next cell to install the missing libraries.

Restart the notebook kernel by clicking on the Refresh icon.

Run the imports cell again with the correct libraries installed and watch it succeed.

Convert your notebook to a Kubeflow Pipeline

Enable Kale by clicking on the Kaleicon in the left pane of the notebook.

Explore per-cell dependencies. See how multiple cells can be part of a single pipeline step, and how a pipeline step may depend on previous steps. For example, the image below shows multiple cells that are part of the same pipeline step. They have the same red color and they depend on a previous pipeline step.

The only thing one needs to do to define pipeline steps is edit the cell by clicking the pencil button on the top right corner of the cell, name the pipeline step as they desire, and define its dependencies. A dependency can be any of the pipeline steps. Kale discovers all the defined pipeline steps automatically and presents a dropdown list with them to the user.

Click the Compile and Run button.

Now Kale takes over and transforms your notebook, by converting it to a KFP pipeline. Also, because Kale integrates with Rok, it will take a snapshot of the notebook’s volumes, and you can watch the progress of the snapshot. Rok takes care of data versioning and thus allowing you to reproduce the whole environment as it was when you clicked the Compile and Run button. This way, you have a time machine for your data and code, an exact versioned and reproducible point from where your pipeline starts, and your pipeline will run in an identical  environment with the one you developed your code on, without needing to build new docker images.

The pipeline was compiled and uploaded to Kubeflow Pipelines. Now click the link to go to the Kubeflow Pipelines UI and view the run.

Wait for the run to finish.

Congratulations! You just ran an end-to-end pipeline in Kubeflow Pipelines, starting from your notebook!

Transfer learning with hyperparameter tuning

Examine the results

Note: these results are achieved when running with the big datasets. If you are running the pipeline with the small datasets included in the Kale repo, then the results will be far worse.

Take a look at the logs of the cnn-from-scratch step. This is the step where we trained a convolutional neural network (CNN) from scratch. Notice that the trained model has a very low accuracy and, on top of that, this step took a long time to complete.

Take a look at the logs of the cnn-tf-vgg16 step. In this step, we used transfer learning on the pre-trained VGG-16 model — a neural network trained by the Visual Geometry Group (VGG). The accuracy is much higher than the previous model, but we can still do better.

Now, take a look at the logs of the cnn-tf-resnet50 step. In this step, we used transfer learning on the pre-trained ResNet-50 model. The accuracy is much higher. This is the model we should use for the rest of this Codelab.

Hyperparameter tuning

Go back to the notebook server in your Kubeflow UI, and open the notebook named dog-breed-katib.ipynb. You are going to run some hyperparameter tuning experiments on the ResNet-50 model, using Katib. Notice that you have one cell in the beginning of the notebook to declare parameters:

In the left pane of the notebook, enable HP Tuning with Katib to run hyperparameter tuning:

Then click on Set up Katib Job to configure Katib:

We see that Kale auto-detects the HP Tuning Parameters and their type from the Notebook, due to the way we defined the parameters cell in the Notebook. Define the search space for each parameter, and define a goal:

Click the Compile and Run Katib Job button:

Watch the progress of the Katib experiment:

Click on View to see the Katib experiment:

Click on Done to see the runs in the Kubeflow Pipelines (KFP) Experiment:

In the Katib experiment page you will see the new trials:

And in the KFP UI you will see the new runs:

Let’s unpack what just happened. Previously, Kale produced a pipeline run from a notebook and now it is creating multiple pipeline runs, where each one is fed with a different combination of arguments.

Katib is Kubeflow’s component to run general purpose hyperparameter tuning jobs. Katib does not know anything about the jobs that it is actually running (called trials in the Katib jargon), all it cares about is the search space, the optimization algorithm, and the goal. Katib supports running simple Jobs (that is, Pods) as trials, but Kale implements a shim to have the trials actually run pipelines in Kubeflow Pipelines, and then collect the metrics from the pipeline runs. This way we completely unify Katib with Kubeflow Pipelines, providing full visibility, and reproducibility for each step of the HP Tuning process, via KFP.

As the Katib experiment is producing trials, you will see more trials in the Katib UI:

And more runs in the KFP UI:

When the Katib experiment is completed, you can view all the trials in the Katib UI:

And all the runs in the KFP UI:

Congratulations, you have successfully run an end-to-end ML workflow all the way from a Notebook to a reproducible multi-step pipeline with hyperparameter tuning, using Kubeflow (MiniKF), Kale, Katib, KF Pipelines, and Rok!