Training and AutoML Summit Recap – Part 1

August 18, 2021

Did you miss the AutoML and Training working groups’ summit back in July? If you did, all the talks from the event have been uploaded to YouTube.

Reminder, if you attended the Summit, the organizers kindly ask you to complete this survey. Your answers will help the Kubeflow contributors!

In part one of this two part blog series, we’ll give you an executive summary of the first batch of the day’s talks.

If you are new to Kubeflow and AutoML

The Kubeflow project is organized into working groups with associated GitHub repositories, that focus on specific pieces of the ML platform. These include:

AutoML
Deployment
Manifests
Notebooks
Pipelines
Serving
Training

As the name suggests, the goal of automated machine learning (AutoML) is to automate as many of the tasks associated with machine learning as possible. In a perfect world, AutoML allows non-data science experts to make use of machine learning models and techniques and apply them to problems. Aside from making machine learning more accessible to non-experts, AutoML also has the advantage of creating solutions that are easier to understand, can be designed quickly and are pre-optimized vs those that are “hand-rolled” from scratch.

The tasks AutoML seeks to dramatically simplify include:

Data pre-processing
Feature engineering
Feature extraction
Feature selection
Algorithm selection
Hyperparameter tuning

As you can imagine, AutoML is “kind of a big deal” in the context of making Kubeflow accessible to experts and non-experts alike. This is where having a dedicated AutoML working group comes in. The working group’s chairs include:

Andrey Velichkevich, Cisco
Ce Gao, Caicloud
Johnu George, Nutanix

The co-organizers of the Summit were the folks from the Training working group. This group covers developing, deploying, and operating training jobs on Kubeflow. The working group’s chairs include:

Ce Gao, Caicloud
Johnu George, Nutanix
Yuan Tang, Ant Group

Ok, let’s look at a few previews of the first bath of talks!

Paddle Operator & EDL Introduction

In this talk, Ti Zhou of Baidu introduced the PaddlePaddle project. He explained why Bauidu started using Kubeflow as the foundation for their platform and introduced many of the details concerning the implementation of the Paddle operator.

Talk Highlights

Since 2012, Baidu has been leveraging deep learning and developing their platform
An overview of PaddlePaddle (tools & components, development kits, models and the core framework)
An overview of some of the more that 270 NLP, CV, speech and recommendation models that are supported
How distributed training works in PaddlePaddle
A look at he Paddle-Operator and EDL architecture
Highlights of the recent releases
Benchmarks and integrations

DGL Operator and Graph Training

In this talk, Xiaoyu Zhai of Qihoo 360’s AI infrastructure team talked about the background of the DGL (Deep Graph Library) framework, and the philosophy of native DGL distributed training. He then went on to illustrate some of the challenges and limitations of going to production and offered some solutions that included Kubernetes and the DGL Operator. He wrapped things up with an overview of the implementation details of the DGL Operator.

Talk Highlights

Explanation of a variety of terms used in the context of the DGL framework
DGL’s origins at Amazon
What is GNN? What is DGL?
How DGL distributed training works
The native way of running DGL distributed training and its challenges
How to solve the challenges
Overview of DGL Operator
The implementation of the DGL Operator (data loading, partitioning, workflows)
Examples of DGL in action

Building Real Time Image Classification with Kubeflow Orchestrator

In this talk, Aniruddha Choudhury of Publicis Sapient showed how to build a Pipeline for real time image classification using AutoML, Katib integration and exposing the endpoints with KFServing and Minio.

Talk Highlights

Teach “A” use cases
Architecture overview
Structuring the Kubeflow pipeline end-to-end training component
Building the AutoML Bayesian Framework
Building the KFServing layer
Setting the Kafka and Minio connector with a Kafka source event
Building a production pipeline
Serving the endpoint with a real time image in Minio
Monitoring with Grafana

Katib User Journey

In this talk, Johnu George of Nutanix walked us through the creation of a model and then tuning the model hyperparameters using Katib. He then talked about internal architecture and various configuration options for the experiment.

Talk Highlights

What is hyperparameter tuning and why is it hard?
Intro to the Katib hyperparameter tuner
Understanding experiment and trial worker
System architecture
A sample experiment and trial
Demo!

Tour of New Katib UI

In this talk, Kimonas Sotirchos of Arrikto took us through the inner workings of the new Kabib UI and the workflows it enables. He also showed us how we can create and track an Experiment, as well as its underlying Trials, via the UI. He also gave us a quick roadmap update.

Talk Highlights

Overview and rationale behind the new UI
Demo showing hyperparameter tuning!
Inspecting and navigating experiment detail charts
What’s missing, being worked on and what’s next

Stay tuned for Part 2 of this series next week!

Book a FREE Kubeflow and MLOps workshop

This FREE virtual workshop is designed with data scientists, machine learning developers, DevOps engineers and infrastructure operators in mind. The workshop covers basic and advanced topics related to Kubeflow, MiniKF, Rok, Katib and KFServing. In the workshop you’ll gain a solid understanding of how these components can work together to help you bring machine learning models to production faster. Click to schedule a workshop for your team.

About Arrikto

At Arrikto, we are active members of the Kubeflow community having made significant contributions to the latest 1.4 release. Our projects/products include:

Kubeflow as a Service is the easiest way to get started with Kubeflow in minutes! It comes with a Free 7-day trial (no credit card required).
Enterprise Kubeflow (EKF) is a complete machine learning operations platform that simplifies, accelerates, and secures the machine learning model development life cycle with Kubeflow.
Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
Kale, a workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.