Kubeflow Pipelines: Kaggle’s Titanic Disaster Survivor Prediction – Part 1

August 5, 2022

In this blog post, we’re going to explore the Kaggle Titanic Disaster survivor prediction notebook and convert it into a Kubeflow pipeline with the click of a button using the Kale JupyterLab extension. Ok, let’s start.

First, What is Kubeflow?

Kubeflow is an open source, cloud-native MLOps platform originally developed by Google that aims to provide all the tooling that both data scientists and machine learning engineers need to run workflows in production. Features include model development, training, serving, AutoML, monitoring and artifact management. The latest 1.5 release features contributions from Google, Arrikto, IBM, Twitter and Rakuten. Want to try it for yourself? You can get started in minutes with a free trial of Kubeflow as a Service, no credit card required.

About the Kaggle’s Titanic Disaster Survivor Prediction

Kaggle is an online community of Data Scientists, ML Engineers and MLOps champions who come together to explore creating models and technical solutions to popular real-world problems. Kaggle competitions focus on finding solutions to these popular problems to advance the collective community’s knowledge and capabilities. On April 15, 1912, the RMS Titanic sank after colliding with an iceberg. There were not enough lifeboats for everyone on board, resulting in the death of 1,502 out of 2,224 passengers and crew. There was some element of luck involved in surviving, but some groups of people were more likely to survive than others. The notebook you will work with is based on a Kaggle project which uses Machine Learning to create a model that predicts which passengers survived the Titanic shipwreck. This course is a self-service exploration of this problem solved using a Jupyter Notebook and Kubeflow Pipelines.

More details on this project can be found here: https://www.kaggle.com/c/titanic

Prerequisites for Building the Kubeflow Pipeline

Step 1: Setup Kubeflow as a Service

If you don’t already have Kubeflow up and running, we recommend signing up for a free trial of Kubeflow as a Service. Create a new Kubeflow Deployment.

Step 2: Launch a new Notebook Server

From the Notebooks UI inside of Kubeflow click New Notebook to create a new Notebook Server. Add a name for the new Notebook server and, for the purposes of this competition, you also need to create a new data volume with the following requirements:

Type: Empty volume
Name: data
Size: 5 Gi
Access mode: ReadWriteOnce
Storage class: default

Then, click “Launch” to launch your notebook and once it is ready, click “Connect”.

Step 3: Clone the Project Repo to Your Notebook

Open up a terminal in the Notebook Server and download the notebook and the data files using the following commands:

If you are using version 1.5 run this command in the terminal window to download the Notebook file and the data that you will use for the remainder of this course:

git clone -b release-1.5 https://github.com/arrikto/examples

If you are using version 1.4 run this command in the terminal window to download the Notebook file and the data that you will use for the remainder of this course:

git clone -b release-1.4 https://github.com/arrikto/examples

If you’re using Arrikto’s Kubeflow as a Service, note that the version that is currently used is Kubeflow 1.4.
If you’re not sure what version you’re using, you can see it on the bottom left of the Kubeflow Central Dashboard as it is shown below.

Step 4: Open the Notebook

Navigate to the folder examples/academy/titanic-ml-dataset in the sidebar and open the notebook titanic_dataset_ml.ipynb.

Step 5: Install Packages and Libraries for Notebook

If you try to run the first cell, you’ll notice that the code fails because a library is missing:

Go back to the terminal, install the missing library with the following command and restart the kernel by clicking on the Refresh icon:

pip3 install –user seaborn

Run the cell again with the correct libraries installed and watch it succeed.

What’s the Easiest Way to Create a Kubeflow Pipeline?

To create a Kubeflow pipeline from your running Notebook, we highly recommend making use of the open source JupyterLab extension called Kale. Kale is built right into Kubeflow as a Service and provides a simple UI for defining Kubeflow Pipelines directly from your JupyterLab notebook, without the need to change a single line of code, build and push Docker images, create KFP components or write KFP DSL code to define the pipeline DAG. In this next example, we’ll show you just how easy it is.

Understanding Kale Tags

Kale tags give much better flexibility when it comes to converting a notebook to a Kubeflow pipeline. With Kale you annotate cells (which are logical groupings of code) inside your Jupyter Notebook with tags. These tags tell Kale how to interpret the code contained in the cell, what dependencies exist and what functionality is required to execute the cell.

Step 1: ENABLE THE KALE DEPLOYMENT PANEL AND ANNOTATE NOTEBOOK CELLS WITH KALE TAGS

The first step is to open up the Kale Deployment panel and click on the Enable switch button. Once you have it switched on, you should see the following information on the Kale Deployment panel.

After installing the required Python packages, the next step is to annotate the notebook with Kale tags.

There are six tags available for annotation:

Imports
Functions
Pipeline Parameters
Pipeline Metrics
Pipeline Step
Skip Cell

Our first annotation is for imports. We import the necessary packages for this example in the cell with the “Imports” tag. Ensure that all your imports are contained in a single cell. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale.

The “Pipeline Parameters” tag is required for defining pipeline parameters and for running Hyperparameter Tuning experiments with Kale and Katib. The parameters are passed to pipeline steps that make use of them.

The next cell is the “loaddata” cell which loads the data we will use with the various algorithms. In this cell, the “Pipeline Step” annotation will be used. The pipeline step is a set of code that implements the computation required to complete a step in your machine learning workflow. This tag is used to define the load data step. Because this is the first step in the pipeline, it has no dependencies. Also, each pipeline step can have its own GPU support, however for this task it is not enabled.

The “Skip Cell” annotation can be used to skip any section of the notebook that is not required for the pipeline compilation. There is no need to feed this cell into our pipeline step because the packages have already been installed in the notebook.

The next cells we’re going to explore are the cells in regards to data preprocessing. In this particular step, we combine the siblings and parents features to as the number of relatives. Here, the Pipeline Step tag is annotated as well. As expected, this step depends on the loaddata step. If you hover over the green circle, you will see the dependency.

The next cells refer to filling missing data related to cabins, ages and embarking features. Let’s note that Kale provides the same tag annotation automatically for the next cells if the cell type is not otherwise specified, as shown below.

It can be observed from the above screenshot that the next cell has been highlighted with the same tag annotation without any specified tag annotation.

The next group of cells we’re going to see is the cells in regards to feature engineering. Let’s recall that a feature is an individual measurable property that in most cases is numeric but also can be a string or a graph. Algorithms sometimes require that certain features have certain characteristics to work properly. In this example, we’re going to convert the Sex of a passenger into a binary format. Specifically, we convert “Males” into “0” and “Females” into “1”. Let’s note that the feature engineering step depends on the datapreprocessing step.

Now we’re moving to the machine learning and algorithms section. Here we will see all the algorithms we’re using to find out which is the one with the best performance. Then, we see the Results section where all the experiment results are displayed.

Step 2: Run the Kubeflow Pipeline

Now that we’ve tagged all the notebook cells, let’s go back to the Kale panel and click the “Compile and Run” button. Kale will perform the following tasks for you:

Validate the notebook
Take a snapshot, so the whole environment is versioned
Compile the notebook to a Kubeflow pipeline
Upload the pipeline
Run the pipeline

In the “Running pipeline” output, click on the “View” hyperlink. This will take you directly to the runtime execution graph where you can watch your pipeline execute and update in real time.

Congratulations! You just ran an end-to-end Kubeflow pipeline starting from your notebook! Note that we didn’t have to create a new Docker image, although we installed new libraries. Rok took a snapshot of the whole notebook, including the workspace volume that contains all the imported libraries. Thus, all the newly added dependencies were included. We will explore Rok and Snapshotting further in the next section.

When the run is complete, click on the Results step and go to the Visualizations tab. You’ll notice that all the predictors show a score of 100%. An experienced data scientist should immediately find this suspicious. This is a good indication that our models are not generalizing, either we are overfitting on the training dataset or there might be some other mistake in the input features. This is likely caused by an issue with the data consumed by the models.

Stay tuned for “Kubeflow Pipelines: Kaggle’s Titanic Disaster Survivor Prediction – Part 2” where we’re going to use Rok Snapshots to resolve the above problem!

What’s Next?

Get started with Kubeflow in just minutes, for free. No credit card required!
Try out the Titanic Disaster use case in Arrikto Academy.
Try your hand at converting a Kaggle competition into a Kubeflow Pipeline
Sign Up for an Instructor-Led Overview of the Kaggle Competition and the Notebook.

Arrikto Academy

If you are ready to put what you’ve learned into practice with hands-on labs? Then check out Arrikto Academy! On this site you’ll find a variety of FREE skills-building labs and tutorials including:

Kubeflow Use Cases: Kaggle OpenVaccine, Kaggle Titanic Disaster, Kaggle Blue Book for Bulldozers, Dog Breed Classification, Distributed Training, Kaggle Digit Recognizer Competition
Kubeflow Functionality – Kale, Katib
Enterprise Kubeflow Skills – Kale SDK, Rok Registry