An end-to-end ML pipeline on-prem: Notebooks & Kubeflow Pipelines on the new MiniKF

Update (August 14, 2020): This tutorial has been updated to showcase the Taxi Cab end-to-end example using the new MiniKF (v20200812.0.0)

Today, at Kubecon Europe 2019, Arrikto announced the release of the new MiniKF, which features Kubeflow v0.5. The new MiniKF enables data scientists to run end-to-end Kubeflow Pipelines locally, starting from their Notebook.

This is great news for data scientists, as until now, there was no easy way for one to run an end-to-end KFP example on-prem. One should have strong Kubernetes knowledge to be able to deal with some steps. Specifically, for Kubeflow’s standard Chicago Taxi (TFX) example, which will be the one we will presenting in this post, one should:

  • Understand K8s and be familiar with kubectl
  • Understand and compose YAML files
  • Manually create PVCs via K8s
  • Mount a PVC to a container to fill it up with initial data

Using MiniKF and Arrikto’s Rok data management platform (MiniKF comes with a free, single-node Rok license), we showcase how to streamline all these operations to reduce time, and provide a much more friendly user experience. A data scientist starts from a Notebook, builds the pipeline, and uses Rok to take a snapshot of the local data they prepare, with a click of a button. Then, they can seed the Kubeflow Pipeline with this snapshot using only the UIs of KFP and Rok.

MiniKF greatly enhances data science experience by simplifying users’ workflow and removing the need for even a hint of K8s knowledge. It also introduces the first steps towards a unified integration of Notebooks & Kubeflow Pipelines.

In a nutshell, this tutorial will highlight the following benefits of using MiniKF, Kubeflow, and Rok:

  • Easy execution of a local/on-prem Kubeflow Pipelines e2e example
  • Seamless Notebook and Kubeflow Pipelines integration with Rok
  • KFP workflow execution without K8s-specific knowledge

Kubeflow’s Chicago Taxi (TFX) example on-prem tutorial

Let’s put all the above together, and watch MiniKF, Kubeflow, and Rok in action.

One very popular data science example is the Taxi Cab (or Chicago Taxi) example that predicts trips that result in tips greater than 20% of the fare. This example is already ported to run as a Kubeflow Pipeline on GCP, and included in the corresponding KFP repository. We are going to showcase the Taxi Cab example running locally, using the new MiniKF, and demonstrate Rok’s integration as well. Follow the steps below and you will run an end-to-end Kubeflow Pipeline on your laptop!

Install MiniKF

Open a terminal and run:

vagrant init arrikto/minikf
vagrant up

Open a browser, go to 10.10.10.10 and follow the instructions to get Kubeflow and Rok up and running.

For more info about how to install MiniKF, visit the MiniKF page:

https://www.arrikto.com/minikf/

Or the official Kubeflow documentation:

https://www.kubeflow.org/docs/started/workstation/getting-started-minikf/

MiniKF is also available on the GCP Marketplace, where you can install it with the click of a button. You can find it here.

Create a Notebook Server

On the MiniKF landing page, click the “Connect” button next to Kubeflow to connect to the Kubeflow Dashboard:

MiniKF: Connect to Kubeflow

Log in using the credentials you see on your screen.

Image for post

Once at Kubeflow’s Dashboard, click on the “Notebooks” link on the left pane to go to the Notebook Manager UI:

Image for post

You are now at the Kubeflow Notebook Manager, showing the list of Notebook Servers, which is currently empty. Click on “New Server” to create a new Notebook Server:

Create a new Notebook Server

Enter a name for your new Notebook Server, and select Image, CPU, and RAM.

Make sure you have selected the following Docker image:

gcr.io/arrikto/jupyter-kale:v0.5.0-47-g2427cc9

Note that the image tag may differ.

Image for post

Add a new, empty Data Volume, for example of size 2GB, and name it “data” (you can give it any name you like, but then you will have to modify some commands in later steps):

Image for post

Once you select all options, click “Launch” to create the Notebook Server, and wait for it to get ready:

Image for post

Once the Server is ready, the “Connect” button will become active (blue color). Click the “Connect” button to connect to your new Notebook Server:

Image for post

A new tab will open up with the JupyterLab landing page:

Image for post

Bring in the Pipelines code and data

Create a new terminal in JupyterLab:

Image for post

Bring in a pre-filled Notebook to run all the necessary commands:

wget https://raw.githubusercontent.com/arrikto/kubeflow-examples/kubecon-demo/taxi-cab-on-prem/taxi-cab-pipeline-snap.ipynb

The Notebook will appear on the left pane of your JupyterLab:

Image for post

Double-click the notebook file to open it and run the cells one-by-one:

Image for post

Run the first cell to download the Arrikto’s pipeline code to run the Chicago Taxi Cab example on-prem:

Image for post

Run the second cell to ingest the data:

Image for post

Run the third cell to move the ingested data to the Data Volume of the Notebook. (Note that if you didn’t name your Data Volume “data”, then you have to slightly modify this command):

Image for post

Run the fourth cell to verify that data exist inside the Data Volume. Note that you should have the taxi-cab-classification directory under the data directory, not just the files. You now have a local Data Volume populated with the data the pipeline code needs (if you didn’t give the name “data” to your Data Volume, then you have to make sure that your Data Volume has these files in it):

Image for post

Run the fifth cell to compile the pipeline source code:

Image for post

Run the sixth cell to verify the output of the compiled pipeline:

Image for post

The file containing the compiled pipeline appears on the left pane of your JupyterLab:

Image for post

Snapshot the Data Volume

In later steps, the pipeline is going to need the data that we brought into the Data Volume previously. For this reason, we need a snapshot of the Data Volume. As a best practice, we will snapshot the whole JupyterLab, and not just the Data Volume, in case the user wants to go back and reproduce their work.

We will use Rok, which is already included in MiniKF, to snapshot the JupyterLab. On the Kubeflow left pane, click on the “Snapshots” link to go to the Rok UI. Alternatively, you can go to the MiniKF landing page and click the “Connect” button next to Rok:

Image for post

This is the Rok UI landing page:

Image for post

Create a new bucket to host the new snapshot. Click on the “+” button on the top left:

Image for post

A dialog will appear asking for a bucket name. Give it a name and click “Next”. We will keep the bucket “Local” for this demo:

Image for post

Clicking “Next” will result in a new, empty bucket appearing in the landing page. Click on the bucket, to go inside:

Image for post

Once inside the bucket, click on the Camera button to take a new snapshot:

Image for post

By clicking the Camera button, a dialog appears asking for the K8s resource that we want to snapshot. Choose the whole “JupyterLab” option, not just the single Data Volume (“Dataset”):

Image for post

Most fields will be pre-filled with values automatically by Rok, for convenience. Select your JupyterLab from the dropdown list:

Image for post

Provide a commit title and a commit message for this snapshot. This is to help you identify the snapshot version in the future, the same way you would do with your code commits in Git:

Image for post

Then, choose a name for your snapshot:

Image for post

Take the snapshot, by clicking the “Snapshot” button:

Image for post

Once the operation completes, you will have a snapshot of your whole JupyterLab. This means you have a snapshot of the Workspace Volume and a snapshot of the Data Volume, along with all the corresponding JupyterLab metadata to recreate the environment with a single click. The snapshot appears as a file inside your new bucket. Expanding the file will let you see the snapshot of the Workspace Volume and the snapshot of the Data Volume:

Image for post

Now that we have both the pipeline compiled and a snapshot of the Data Volume, let’s move on to run the pipeline and seed it with the data we prepared.

Upload the Pipeline to KFP

Before uploading the pipeline to KFP, we first need to download the compiled pipeline locally. Go back to your JupyterLab and download the compiled pipeline. To do so, right click on the file on JupyterLab’s left pane, and click “Download”:

Image for post

Once the file is downloaded to your laptop, go to Kubeflow Dashboard and open the Kubeflow Pipelines UI by clicking “Pipelines” on the left pane:

Image for post

This will take you to the Kubeflow Pipelines UI:

Image for post

Click “+ Upload pipeline”:

Image for post

Choose a name and a description for your pipeline. Then click the “Upload a file” option and choose the .tar.gz file you downloaded locally. Finally, click “Create”.

Image for post

The pipeline should get uploaded successfully and you should be seeing its graph.

Image for post

Create a new Experiment Run

Create a new Experiment by clicking “+ Create experiment”:

Image for post

Choose an Experiment name, and click “Next”:

Image for post

By clicking “Next”, the KFP UI sends you to the “Start a new run” page, where you are going to create a new Run for this Experiment. Enter a name for this Run (note that the Pipeline is already selected. If this is not the case, just select the uploaded Pipeline):

Image for post

Note that the Pipeline’s parameters show up:

Image for post

Seed the Pipeline with the Notebook’s Data Volume

On the “rok-url” parameter we need to specify the snapshot of the Notebook Server’s Data Volume, which we created in the previous step. Click on “Choose” to open the Rok file chooser and select the snapshot of this Data Volume:

Image for post

Enter inside the bucket you created previously:

Image for post

Expand the file inside the bucket:

Image for post

Find the snapshot of the Data Volume and click on it. Make sure that you have chosen the Data Volume and not the Workspace Volume:

Image for post

Then click on “Choose”:

Image for post

The Rok URL that corresponds to the Data Volume snapshot appears on the corresponding parameter field:

Image for post

Select a bucket to store the pipeline snapshot

On the “rok-register-url” parameter we need to choose where we are going to store the pipeline snapshot. For that, we need to specify a bucket and a name for the file (snapshot) that is going to be created. We can select an existing bucket and an existing file. This would create a new version of this file. Here, we will create a new bucket and a new file. First, we click on the “Choose” button to open the Rok file chooser:

Image for post

We create a new bucket:

Image for post

We give it a name and click “Next”:

Image for post

Then we enter the bucket:

Image for post

We create a new file:

Image for post

We give it a name:

Image for post

And click “Choose”:

Image for post

The Rok URL that corresponds to the pipeline snapshot appears on the corresponding parameter field:

Image for post

Run the Pipeline

Now that we have defined the data to seed the pipeline and the file to store the pipeline snapshot, we can run the pipeline. We are leaving all other parameters as is, and click “Start”:

Image for post

Now, click on the Run you created to view the progress of the Pipeline:

Image for post

As the Pipeline runs, we see the various steps running and completing successfully. The Pipeline is going to take 10 to 30 minutes to complete, depending on your laptop specs. During the first step, Rok will instantly clone the snapshot we provided, for the next steps to use it:

Image for post

Note: If the validation step fails with:

/mnt/taxi-cab-classification/column-names.json; 
No such file or directory

then you have to make sure that you have given the name “data” to your Data Volume. If not, then make sure that you amended the command of cell 3 accordingly, so that the data of the pipeline gets stored to your Data Volume.

Image for post

The training step is going to take a few minutes:

Image for post
Image for post
Image for post

Pipeline snapshot with Rok

As a last step, Rok will automatically take a snapshot of the pipeline, so that we can later attach it to a Notebook and do further exploration of the pipeline results. You can see this step here:

Image for post
Image for post

Update: in later versions of MiniKF, the `deploy` step will fail. This happens, because this step assumes the pipeline runs inside the `kubeflow` namespace.

If we go back to the Rok UI, we can see the snapshot that was taken during the pipeline run. We go to the Rok UI landing page:

Image for post

We find the bucket that we chose earlier to store the snapshot, in this case “taxi-cab-pipeline-snapshot”:

Image for post

And we see the pipeline snapshot that has the name that we chose earlier, in this case “pipeline-snapshot”:

Image for post

Explore pipeline results inside a Notebook

To explore the pipeline results, we can create a new Notebook and attach it the pipeline snapshot. The Notebook will use a clone of this immutable snapshot. This means that you can do further exploration and experimentation using this Data Volume, without losing the results that the pipeline run produced.

To do this, we create a new Notebook Server and add an existing Data Volume. Here, the volume will be the pipeline snapshot that lives inside Rok. We go to the Notebook Manager UI and click “New Server”:

Image for post

Then, we enter a name for the new Notebook Server, and select Image, CPU, and RAM:

Image for post

To add an existing Data Volume, click on “Add Volume”:

Image for post

Then change its type to “Existing”:

Image for post
Image for post

Now, we will use the Rok file chooser to select the snapshot that was taken during the pipeline run. This snapshot contains the pipeline results. The Notebook will use a clone of this immutable snapshot, so you can mess with the volume without losing any of your critical data.

We click on the Rok file chooser icon and open the file chooser:

Image for post
Image for post

We find the bucket that we chose earlier to store the snapshot, in this case “taxi-cab-pipeline-snapshot”:

Image for post

We click on the file (snapshot) that we created during the pipeline run, in this case “pipeline-snapshot”:

Image for post

And then click “Choose”:

Image for post

The Rok URL that corresponds to this snapshot appears in the “Rok URL” field:

Image for post

You can change the name of the volume if you wish. Here, we change it to “analysis”:

Image for post

Then, we click “Launch”:

Image for post

A new Notebook Server appears in the Notebook Manager UI. Click on “Connect” to go this JupyterLab:

Image for post

Inside the Notebook, you can see the Data Volume that we created by cloning the pipeline snapshot:

Image for post

Inside the Data Volume you can see the input data of the pipeline “taxi-cab-classification”, and the output data of the pipeline “taxi-cab-classification-s9qwx”. Note that you will see a different alphanumerical sequence in your Notebook. Open the second folder to view the pipeline results:

Image for post

You can see the results of every step of the pipeline:

Image for post

Let’s go inside the analysis folder to see how the model performed:

Image for post

Open the “output_display.html” file:

Image for post

Click on “Trust HTML” to be able to view the file:

Image for post

Here is how the model performed:

Image for post

Note: this page is using HTML imports, a feature of the web components spec which is now deprecated. Most probably you won’t be able to view the model performance.

More MiniKF tutorials

Talk to us

Join the discussion on the #minikf Slack channel, ask questions, request features, and get support for MiniKF.

To join the Kubeflow Slack workspace, please request an invite.

Watch the tutorial video

You can also watch the video of this tutorial: