Loading…
In-person + Virtual
October 11-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2021 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Pacific Daylight Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Back To Schedule
Friday, October 15 • 5:25pm - 6:00pm
Scaling Kubeflow for Multi-tenancy at Spotify - Keshi Dai & Jonathan Jin, Spotify

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Spotify began offering a centralized Kubeflow Pipelines product to its machine learning teams around two years ago. Since then, adoption has skyrocketed, with more teams training more models and running increasingly complex experiments. These increased demands on our system come with more stringent demands on us, the Kubeflow team at Spotify, to ensure not just cluster reliability, but cluster equitability. Our job is to not just be cluster maintainers, but cluster stewards—ensuring equitable and reliable access to cluster resources, and keeping users from stepping on each others’ toes. In this talk, we’ll discuss our streamlined tooling to maintain, deploy, and monitor Spotify’s distribution of Kubeflow. We’ll illustrate the challenges we face as we scale to increased user load and increasingly distinct and demanding pipelines, and outline our approach to addressing those challenges with “multi-cluster” Kubeflow. Finally, we’ll give a preview of our future plans for the platform.

Speakers
avatar for Keshi Dai

Keshi Dai

ML Infra Engineer, Spotify
Keshi Dai is a Senior ML Engineer on the Spotify Machine Learning platform team. He has been working on building and managing a centralized Kubeflow platform to help Machine Learning engineers at Spotify to adopt Kubernetes. Recently, he is also leading the effort to evaluate managed... Read More →
avatar for Jonathan Jin

Jonathan Jin

Senior ML Infrastructure Engineer, Spotify
Jonathan Jin is a senior engineer at Spotify working on machine learning platform and infrastructure. Previously, he has worked on AI infrastructure for NVIDIA and Twitter. He has also worked on observability infrastructure at Uber.


slides pdf

Friday October 15, 2021 5:25pm - 6:00pm PDT
411 Theater + Online