Getting Started | Model Experimentation | Tutorials
An End-to-End ML Workflow: From Notebook to Kubeflow Pipelines with MiniKF & Kale

by | December 2019

16 min read

Kubeflow is the de facto standard for running Machine Learning workflows on Kubernetes. Jupyter Notebook is a very popular tool that data scientists use every day to write their ML code, experiment, and visualize the results. However, when it comes to converting a Notebook to a Kubeflow Pipeline, data scientists struggle a lot.

Kubeflow is the de facto standard for running Machine Learning workflows on Kubernetes. Jupyter Notebook is a very popular tool that data scientists use every day to write their ML code, experiment, and visualize the results. However, when it comes to converting a Notebook to a Kubeflow Pipeline, data scientists struggle a lot. It is a very challenging, time-consuming task, and most of the time it needs the cooperation of several different subject-matter experts: Data Scientist, Machine Learning Engineer, Data Engineer.

This tutorial will guide you through a seamless workflow that enables data scientists to deploy a Jupyter Notebook as a Kubeflow pipeline with the click of a button. Moreover, we will showcase how a data scientist can reproduce a step of the pipeline run, debug it, and then re-run the pipeline without having to write a single line of code. We will focus on two essential aspects:

  1. Low barrier to entry: convert a Jupyter Notebook to a multi-step Kubeflow pipeline in the Cloud using only the GUI.
  2. Reproducibility: automatic data versioning to enable reproducibility and better collaboration between data scientists.

This tutorial was presented as a workshop by Google & Arrikto during KubeCon San Diego 2019. Here are the Codelab, the slides, and the video of the workshop.

What you’ll build

In this tutorial, you will build a complex, multi-step ML pipeline with Kubeflow Pipelines, without using any CLI commands or SDKs. You also won’t need to write any code for Pipeline components and Pipelines DSL, or build any Docker images. You don’t need to have any Kubernetes or Docker knowledge to complete this tutorial. Upon completion, your infrastructure will contain:

MiniKF (Mini Kubeflow) VM on GCP that automatically installs:

  • Kubernetes (using Minikube)
  • Kubeflow
  • Kale, a tool to convert general purpose Jupyter Notebooks to Kubeflow Pipelines workflows (GitHub)
  • Arrikto Rok for data versioning and reproducibility

What you’ll learn

In a nutshell, this tutorial will highlight the following benefits of using MiniKF, Kubeflow, Kale, and Rok:

  • How to install Kubeflow with MiniKF
  • How to convert your Jupyter Notebooks to Kubeflow Pipelines without using any CLI commands or SDKs
  • How to run Kubeflow Pipelines from inside a Notebook with the click of a button
  • How to automatically version your data in a Notebook and in every pipeline step
  • How to reproduce the whole state of a Pipeline step along with its data in a Notebook

Install MiniKF

To install MiniKF on GCP read the step-by-step guide or follow the steps below:

  • Go to the MiniKF page on Google Cloud Marketplace.
  • Click the Launch on Compute Engine button.
  • In the Configure & Deploy window, choose a name, a GCP zone, a machine type, a boot disk, and an extra disk for your deployment. Then click Deploy.
  • When the VM is up, follow the suggested next steps under the Getting started with MiniKF section. It is important to follow these steps and make sure that you can log in to MiniKF successfully, before moving to the next step.

Run a Pipeline from inside your Notebook

During this section, you will run the Titanic example, a Kaggle competition that predicts which passengers survived the Titanic shipwreck.

Create a Notebook Server

Navigate to the Notebook Servers link on the Kubeflow central dashboard:

Image for post

Click on New Server:

Image for post

Specify a name for your Notebook Server:

Image for post

Make sure you have selected this image:

gcr.io/arrikto/jupyter-kale:v0.5.0-47-g2427cc9

Note that the image tag may differ.

Add a new, empty Data Volume of size 5GB and name it “data” (you can give it any name you like, but then you will have to modify some commands in later steps):

Image for post

Click Launch to create the notebook server:

Image for post

When the notebook server is available, click Connect to connect to it:

Image for post

Download the data and notebook

A new tab will open up with the JupyterLab landing page. Create a new Terminal in JupyterLab:

Image for post

In the Terminal window, run these commands to navigate to the data folder and download the notebook and the data that you will use for the remainder of the lab:

$ cd data/
$ git clone -b kubecon-workshop https://github.com/kubeflow-kale/examples

This repository contains a series of curated examples with data and annotated Notebooks. Navigate to the folder data/examples/titanic-ml-dataset/ in the sidebar and open the notebook titanic_dataset_ml.ipynb.

Image for post

Explore the ML code of the Titanic challenge

Run the notebook step-by-step. Note that the code fails because a library is missing:

Image for post

Go back to the Terminal and install the missing libraries. Please make sure you install the libraries in your home directory as shown in the following command:

$ cd examples/titanic-ml-dataset/
$ pip3 install --user -r requirements.txt

Restart the notebook kernel by clicking on the Refresh icon:

Image for post

Run the cell again with the correct libraries installed and watch it succeed!

Convert your notebook to a Kubeflow Pipeline

Converting your Jupyter Notebook is very simple! Enable Kale by clicking on the Kale icon in the left pane:

Image for post

Explore per-cell dependencies. See how multiple cells can be part of a single pipeline step, and how a pipeline step may depend on previous steps:

Image for post

Click the Compile and Run button:

Image for post

Watch the progress of the snapshot:

Image for post

Watch the progress of the Pipeline Run:

Image for post

Click the link to go to the Kubeflow Pipelines UI and view the run:

Image for post

Wait for it to complete:

Image for post
Image for post

Congratulations! You just ran an end-to-end Kubeflow Pipeline starting from your notebook! Note that we didn’t have to create a new Docker image, although we installed new libraries. Rok took a snapshot of the whole Notebook, including the workspace volume that contains all user’s libraries. Thus, all the newly added dependencies were included.

Reproducibility with Volume Snapshots

During this section, we will explore a pipeline step, reproduce its exact state just before it ran, and debug it.

Examine the results

Have a look at the logs for the second-to-last pipeline step “results”. Notice that all the predictors show a score of 100%. An experienced data scientist would immediately find this suspicious. This is a good indication that our models are not generalizing, either we areoverfitting on the training dataset or there might be some other mistake in the input features. This is likely caused by an issue with the data consumed by the models.

Image for post

Reproduce prior state

Fortunately, Rok takes care of data versioning and reproducing the whole environment as it was the time you clicked the Compile and Run button. This way, you have a time machine for your data and code. So let’s resume the state of the pipeline before training one of the models and see what is going on. Take a look at the randomforest step, then click on Visualizations:

Image for post

Follow the steps in the Markdown, i.e. view the snapshot in the Rok UI by clicking on the corresponding link:

Image for post

Copy the Rok URL:

Image for post

Navigate to Notebooks:

Image for post

Click on New Server:

Image for post

Specify a name for your notebook:

Image for post

Paste the Rok URL you copied previously:

Image for post

All the snapshot details, including Notebook image and Volumes, will be retrieved automatically, and you will see this message:

Image for post

Make sure you have selected this image:

gcr.io/arrikto/jupyter-kale:v0.5.0-47-g2427cc9

Note that the image tag may differ.

The Volume information should have been filled automatically:

Image for post

Click Launch to create the notebook server:

Image for post

When the notebook server is available, click Connect to connect to it:

Image for post

Note that the notebook opens at the exact cell of the pipeline step you have spawned:

Image for post

In the background, Kale has resumed the Notebook’s state by importing all the libraries and loading the variables from the previous steps.

Debug prior state

Add a print command to this cell:

print(acc_random_forest)

Run the active cell by pressing Shift + Return to retrain the random forest and print the score. It is 100:

Image for post

Now it’s time to see if there is something strange in the training data. To explore and fix this issue, add a cell above the Random Forest markdown by selecting the previous cell and clicking the plus icon (+).

Image for post

Add the following text and execute the cell to print the training set:

train_df
Image for post

Oops! The column with training labels (“Survived”) has mistakenly been included as input features! The model has learned to focus on the “Survived” feature and ignore the rest, polluting the input. This column exactly matches the model’s goal and is not present during prediction, so it needs to be removed from the training dataset to let the model learn from the other features.

Add a bugfix

To remove this column, edit the cell to add this command:

train_df.drop('Survived', axis=1, inplace=True)
train_df
Image for post

The cell we added previously is by default a pipeline step cell type and since we have defined no name, it will be merged with the featureengineering step. Enable Kale and ensure that the cell that removes the Survived labels is part of the featureengineering pipeline step,that is it should have the same outline color.

Run the pipeline again by clicking on the Compile and Run button.

Click the link to go to the Kubeflow Pipelines UI and view the run.

Wait for the results step to complete and view the logs to see the final results. You now have realistic prediction scores!

Image for post

Congratulations, you have successfully run an end-to-end ML workflow all the way from a Notebook to a reproducible multi-step pipeline, which you debugged on-the-fly using Kubeflow (MiniKF), Kale, and Rok!

More MiniKF tutorials

Talk to us

Join the discussion on the #minikf Slack channel, ask questions, request features, and get support for MiniKF.

To join the Kubeflow Slack workspace, please request an invite.

Chris Pavlou

Chris is a Technical Marketing Engineer with a strong customer focus and background in Machine Learning and cloud native applications

More Like This

Build An End-to-End ML Workflow: From Notebook to HP Tuning to Kubeflow Pipelines with Kale

In this tutorial, we will use Kale to unify the workflow across the above components, and present a seamless process to create ML pipelines for HP tuning, starting from your Jupyter Notebook. We will use Kale to convert a Jupyter Notebook to a Kubeflow Pipeline without any modification to the original Python code. Pipeline definition and deployment is achieved via an intuitive GUI, provided by Kale’s JupyterLab extension.

You May Also Like

k3d + GitHub Actions: Kubernetes E2E Testing Made Easy

Combining k3d and GitHub Actions for easy, quick and cheap E2E testing The Old Way During my first year of Software Engineering, I understood the importance of two things: E2E tests running with CI. They save a ton of manual effort testing the software, increase...

Kubeflow 1.1 Community Release Update

September, 2020 @ Kubeflow Community Update Call Josh Bottum, Kubeflow Community Product Manager, leads the community call giving us all an update on the great things that came in Kubeflow 1.1 This includes updates from; Josh Bottum, Arrikto, Kubeflow Product...

Kubeflow 1.0 Update By A Kubeflow Community Product Manager

August 20, 2020 @ KubeCon + CloudNativeCon Europe Amsterdam 2020 This session will provide a Kubeflow 1.0 Update by a Kubeflow Community Product Manager. The presentation will include a review of the Kubeflow Community and feature development process, the Kubeflow...