This event has ended. Visit the official site or create your own event on Sched.
Customize your schedule by session topic and skill level:  Session Topic - Refer to the "Type" filter list to the right to find a session based on topic. Talk Difficulty - Sessions are categorized as [B]eginner, [I]ntermediate or [A]dvanced at the end of each talk title.
Back To Schedule
Thursday, March 30 • 11:30 - 12:05
Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I] - Zeyu Zheng & Huizhi Zhao, Caicloud

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Big Data and Machine Learning have become extremely hot topics in recent years. Google has announced its AI-centric strategy and released the deep learning toolkit TensorFlow. TensorFlow soon became the most popular open source toolkit for deep learning applications. However, it may take years to train large deep learning models on a single machine without GPU. In order to accelerate the training process, we build a distributed TensorFlow system on Kubernetes which support both CPUs and GPUs.

In this presentation, I’d like to share our experiences about how to build this distributed TensorFlow system on Kubernetes. First, I'll briefly introduce TensorFlow and how TensorFlow supports training model distributedly. However, the original distribution mechanism lacks lots of components such as scheduling, monitoring, life cycle managing and etc. to make it suitable for production usage.

In the rest of the presentation, I'll focus on how to leverage Kubernetes to solve those problem. The solution involves three components. First, I'll introduce how to schedule TensorFlow jobs in a cluster with both CPUs and GPUs. Then I'll share our experience in managing the life cycle of a distributed TensorFlow job. Finally, I'll state our efforts in lowering the bar for using distributed TensorFlow


Huizhi Zhao

Software Engineer, Caicloud

Zeyu Zheng

Chief Data Scientist, Caicloud
Zeyu is chief data scientist and co-founder at Caicloud which provides Cloud and Big Data related services. He leads the efforts to build reliable and scalable data analysis and machine learning platforms like Hadoop, Spark and TensorFlow on Kubernetes. His team has developed Machine... Read More →

Thursday March 30, 2017 11:30 - 12:05 CEST
B 07 - B 08 Berlin Congress Center, Alexanderstraße 11, 10178 Berlin, Germany