Training and AutoML Summit Recap – Part 1

Did you miss the AutoML and Training working groups’ summit back in July? If you did, all the talks from the event have been uploaded to YouTube.  

Reminder, if you attended the Summit, the organizers kindly ask you to complete this survey. Your answers will help the Kubeflow contributors! 

In part one of this two part blog series, we’ll give you an executive summary of the first batch of the day’s talks.

If you are new to Kubeflow and AutoML

The Kubeflow project is organized into working groups with associated GitHub repositories, that focus on specific pieces of the ML platform. These include:

  • AutoML
  • Deployment
  • Manifests
  • Notebooks
  • Pipelines
  • Serving
  • Training

As the name suggests, the goal of automated machine learning (AutoML) is to automate as many of the tasks associated with machine learning as possible. In a perfect world, AutoML allows non-data science experts to make use of machine learning models and techniques and apply them to problems. Aside from making machine learning more accessible to non-experts, AutoML also has the advantage of creating solutions that are easier to understand, can be designed quickly and are pre-optimized vs those that are “hand-rolled” from scratch.

The tasks AutoML seeks to dramatically simplify include:

  • Data pre-processing
  • Feature engineering
  • Feature extraction
  • Feature selection
  • Algorithm selection
  • Hyperparameter tuning

As you can imagine, AutoML is “kind of a big deal” in the context of making Kubeflow accessible to experts and non-experts alike. This is where having a dedicated AutoML working group comes in. The working group’s chairs include:

  • Andrey Velichkevich, Cisco
  • Ce Gao, Caicloud
  • Johnu George, Nutanix

The co-organizers of the Summit were the folks from the Training working group. This group covers developing, deploying, and operating training jobs on Kubeflow. The working group’s chairs include:

  • Ce Gao, Caicloud
  • Johnu George, Nutanix
  • Yuan Tang, Ant Group

Ok, let’s look at a few previews of the first bath of talks!

Paddle Operator & EDL Introduction

In this talk, Ti Zhou of Baidu introduced the PaddlePaddle project. He explained why Bauidu started using Kubeflow as the foundation for their platform and introduced many of the details concerning the implementation of the Paddle operator.

Talk Highlights

  • Since 2012, Baidu has been leveraging deep learning and developing their platform
  • An overview of PaddlePaddle (tools & components, development kits, models and the core framework)
  • An overview of some of the more that 270 NLP, CV, speech and recommendation models that are supported
  • How distributed training works in PaddlePaddle
  • A look at he Paddle-Operator and EDL architecture
  • Highlights of the recent releases
  • Benchmarks and integrations

DGL Operator and Graph Training

In this talk, Xiaoyu Zhai of Qihoo 360’s AI infrastructure team talked about the background of the DGL (Deep Graph Library) framework, and the philosophy of native DGL distributed training. He then went on to illustrate some of the challenges and limitations of going to production and offered some solutions that included Kubernetes and the DGL Operator. He wrapped things up with an overview of the implementation details of the DGL Operator.

Talk Highlights

  • Explanation of a variety of terms used in the context of the DGL framework
  • DGL’s origins at Amazon
  • What is GNN? What is DGL?
  • How DGL distributed training works
  • The native way of running DGL distributed training and its challenges
  • How to solve the challenges
  • Overview of DGL Operator
  • The implementation of the DGL Operator (data loading, partitioning, workflows)
  • Examples of DGL in action

Building Real Time Image Classification with Kubeflow Orchestrator

In this talk, Aniruddha Choudhury of Publicis Sapient showed how to build a Pipeline for real time image classification using AutoML, Katib integration and exposing the endpoints with KFServing and Minio.

Talk Highlights

  • Teach “A” use cases
  • Architecture overview
  • Structuring the Kubeflow pipeline end-to-end training component
  • Building the AutoML Bayesian Framework
  • Building the KFServing layer
  • Setting the Kafka and Minio connector with a Kafka source event
  • Building a production pipeline
  • Serving the endpoint with a real time image in Minio
  • Monitoring with Grafana

Katib User Journey

In this talk, Johnu George of Nutanix walked us through the creation of a model and then tuning the model hyperparameters using Katib. He then talked about internal architecture and various configuration options for the experiment.

Talk Highlights

  • What is hyperparameter tuning and why is it hard?
  • Intro to the Katib hyperparameter tuner
  • Understanding experiment and trial worker
  • System architecture
  • A sample experiment and trial
  • Demo!

Tour of New Katib UI

In this talk, Kimonas Sotirchos of Arrikto took us through the inner workings of the new Kabib UI and the workflows it enables. He also showed us how we can create and track an Experiment, as well as its underlying Trials, via the UI. He also gave us a quick roadmap update.

Talk Highlights

  • Overview and rationale behind the new UI
  • Demo showing hyperparameter tuning!
  • Inspecting and navigating experiment detail charts
  • What’s missing, being worked on and what’s next

Stay tuned for Part 2 of this series next week!

Book a FREE Kubeflow and MLOps workshop

This FREE virtual workshop is designed with data scientists, machine learning developers, DevOps engineers and infrastructure operators in mind. The workshop covers basic and advanced topics related to Kubeflow, MiniKF, Rok, Katib and KFServing. In the workshop you’ll gain a solid understanding of how these components can work together to help you bring machine learning models to production faster. Click to schedule a workshop for your team.

About Arrikto

At Arrikto, we are active members of the Kubeflow community having made significant contributions to the latest 1.3 release. Our projects/products include:

  • MiniKF, a production-ready, local Kubeflow deployment that installs in minutes, and understands how to downscale your infrastructure 
  • Enterprise Kubeflow (EKF) is a complete machine learning operations platform that simplifies, accelerates, and secures the machine learning model development life cycle with Kubeflow.
  • Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
  • Kale, a workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.

MiniKF is the simplest way to get started with Kubeflow and Rok on any platform

Turbocharge your team’s Kubeflow and MLOps skills with a free workshop