Toward Practical Federated Learning
Mosharaf Chowdhury, Morris Wellman Assistant Professor of CSE at the University of Michigan
Abstract: Although theoretical federated learning research is growing exponentially, we are far from putting those theories into practice. In this talk, I will share our ventures into building practical systems for two extremities of federated learning. Sol is a cross-silo federated learning and analytics system that tackles network latency and bandwidth challenges faced by distributed computation between far-apart data sites. Oort, in contrast, is a cross-device federated learning system that enables training and testing on representative data distributions despite unpredictable device availability. Both deal with systems and network characteristics in the wild that are hard to account for in analytical models. I’ll then share the challenges in systematically evaluating federated learning systems that have led to a disconnect between theoretical conclusions and performance in the wild. I’ll conclude this talk by introducing FedScale, which is an extensible framework for evaluation and benchmarking in realistic settings to democratize practical federated learning for researchers and practitioners alike. All these systems are open-source and available at https://github.com/symbioticlab.
Biography: Mosharaf Chowdhury is a Morris Wellman assistant professor of CSE at the University of Michigan, Ann Arbor, where he leads the SymbioticLab. His work improves application performance and system efficiency of machine learning and big data workloads. He is also building software solutions to allow users to monitor and optimize the impact of machine learning systems on energy consumption and user privacy. His group developed Infiniswap, the first scalable software solution for memory disaggregation; Salus, the first software-only GPU sharing system for deep learning; Sol, the fastest multi-cloud data processing engine; and FedScale, the largest federated learning benchmark with accompanying runtime. In the past, Mosharaf did seminal works on coflows and virtual network embedding, and he was a co-creator of Apache Spark. He has received many individual awards and fellowships, thanks to his stellar students and collaborators. His works have received seven paper awards from top venues, including NSDI, OSDI, and ATC, and over 22,000 citations. He received his Ph.D. from UC Berkeley in 2015.
Pushing the Limits of Learning-Augmented Adaptation in Networked Systems
Junchen Jiang, Assistant Professor of Computer Science at The University of Chicago
Abstract: ML-inspired techniques are transforming many classic problems in networking and systems communities, by formulating the problems as standard learning (often reinforcement learning) problems and solving them as such. However, despite much interest from industry, these advances are sometimes met with lukewarm adoption in real-world systems. To understand this gap, this talk discusses our recent efforts in applying ML/RL to two systems problems (congestion control and cloud resource reservation). Our experience shows that the ML literature is ripe enough that, by carefully choosing the suitable formulations and techniques, we can design more efficient and practical solutions for real systems. In particular, better solutions often result from using the right formulation to best capture well-studied structures of the targeted systems or harness non-ML domain-specific solutions developed over the decades. Yet, these changes cannot be brought about without joint work between ML researchers and systems researchers and operators.
Biography: Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He received his PhD degree from CMU in 2017 and his bachelor’s degree from Tsinghua in 2011. His research interests are networked systems, multimedia systems, and their intersections with machine learning. He is a recipient of a Google Faculty Research Award, NSF CAREER Award, and CMU Computer Science Doctoral Dissertation Award.
Optimizing Contributions to Distributed, Networked Learning
Carlee Joe-Wong, Assistant Professor of Electrical and Computer Engineering at Carnegie Mellon University
Abstract: The rapid expansion of Internet-connected, compute-equipped “things” has greatly expanded the amount of data that can be collected about many types of systems, from smart cities to mobile applications to personal health. Making use of this data, however, requires effectively leveraging computing resources to run data analysis algorithms (e.g., machine learning inference or training). Unfortunately, the “things” at which all of this data is collected are often resource-constrained, e.g., with limited power budgets, unreliable network connectivity, and/or limited computing capabilities. Distributed learning algorithms such as federated learning aim to address these challenges, but they are generally not optimized to run on networks of devices with limited, heterogeneous, and unreliable computing and communication resources. In this talk, I will present new variants on federated learning algorithms that provide theoretical convergence guarantees and good empirical performance in the presence of such resource limitations. By carefully designing algorithms for each stage in the distributed machine learning pipeline (data collection, data analysis, and communication across devices), we can realize significant improvement in the accuracy of our trained models.
Biography: Carlee Joe-Wong is the Robert E. Doherty Associate Professor of Electrical and Computer Engineering at Carnegie Mellon University. She received her A.B. degree (magna cum laude) in Mathematics, and M.A. and Ph.D. degrees in Applied and Computational Mathematics, from Princeton University in 2011, 2013, and 2016, respectively. Dr. Joe-Wong’s research is in optimizing networked systems, particularly on applying machine learning and pricing to resource allocation in data and computing networks. From 2013 to 2014, she was the Director of Advanced Research at DataMi, a startup she co-founded from her Ph.D. research on mobile data pricing. Her research has received several awards, including the NSF CAREER Award in 2018.
The Hyper-Converged Programmable Gateway in Alibaba Edge Cloud
Hongqiang Liu, Director of R&D at Alibaba Group
Abstract: Edge cloud provides significant performance and cost advantages for emerging applications such as cloud gaming, video conferencing and AR/VR, etc. However, different from central clouds, edge cloud also faces tremendous challenges due to the limited resources, demands on high performance, and hardware heterogeneity. Alibaba solves these problems by introducing a hyper-converged gateway platform ”SNA” that provides the cloud network stack and network functions within the network rather than the hosts. SNA is a heterogeneous computing platform that merges network switching, network virtualization, and various network functions on top of programmable network ASICs, FPGAs, and CPUs. It has been deployed to support some multi-million-user products in Alibaba’s edge cloud. The key technical enabler of the rapid and safe deployment of the hyper-converged gateways running in SNA is our programmable network development platform “TaiX” which provides novel and practical programming abstractions, compilers, debuggers, testers, orchestrators, and operation tools.
Biography: Hongqiang “Harry” Liu is a Director of Network Research and Edge Network Infrastructure Engineering in Alibaba Cloud and Alibaba DAMO Academy. He received his Ph.D. degree from the Department of Computer Science at Yale University in 2014. His research focuses on data center networks, network transports, and programmable networks. He has published more than 20 papers in top-tier academic conferences, such as ACM SIGCOMM, ACM SOSP, and USENIX NSDI. He also serves on the technical program committees of SIGCOMM and NSDI. He is the recipient of the prestigious ACM SIGCOMM Doctoral Dissertation Award – Honorable Mention in 2015.
Untangling Interconnection in the Mobile Ecosystem
Andra Lutu, Senior Researcher of Telefónica Research
Abstract: The IP eXchange (IPX) Network interconnects about 800 Mobile Network Operators (MNOs) worldwide and a range of other service providers (such as cloud and content providers) to for the core that enables global data roaming. Global roaming now supports the fast growth of the Internet of Things, as well as responds to the insatiable demand coming from digital nomads, who adhere to a lifestyle where they connect from anywhere in the world.
In this talk, we’ll take a first look into this so-far opaque mobile ecosystem, and present first-of-its-kind in-depth analysis of an operational IPX Provider (IPX-P). The IPX Network is a private network formed by a small set of tightly interconnected IPX-Ps. We analyze an operational dataset from a large IPX-P that includes BGP data as well as statistics from signaling. We shed light on the structure of the IPX Network as well as on the temporal, structural and geographic features of the IPX traffic. Our results are a first step to fully understand the global mobile Internet, especially since it now represents a pivotal part in connecting IoT devices and digital nomads all over the world.
Biography: Andra Lutu is a Senior Researcher at Telefonica Research, in Madrid, Spain. Her main research interests lie in the areas of network measurements, interdomain routing and mobile networks. As part of Telefonica Research, Andra has been the recipient of an H2020 Marie Curie Individual Fellowship grant funding her work on Dynamic Interconnections for the Cellular Ecosystem (DICE), which is partly included in this talk.
Machine Learning for Sketches and Sketches for Machine Learning
Tong Yang, Associate Professor of Peking University
Abstract: Sketches, a type of probabilistic algorithms, have been widely accepted as the most promising solution for network measurement. There are a series of papers about sketches published in SIGCOMM, SIGKDD, SIGMOD and NSDI. One the one side, sketch can be used to encode/compress the gradients, significantly reducing the bandwidth usage. One the other side, the error of sketches can be learned and reduced by machine learning.
ML2Sketch: This talk first presents the idea of employing machine learning to reduce this dependence of the accuracy of sketches on network traffic characteristics and present a generalized machine learning framework that increases the accuracy of sketches significantly. We further present three case studies, where we applied our framework on sketches for measuring three well-known flow-level network metrics. Experimental results show that machine learning helps decrease the error rates of existing sketches by up to 202 times.
Sketch2ML: This talk then presents two sketches (MinMax and Cluster-Reduce) to compress the gradients transferred through the network in distributed ML. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. The key technique of Cluster-Reduce is to cluster the adjacent counters with similar values in the sketch to significantly improve the accuracy. Extensive experimental results show that Cluster-Reduce can achieve up to 60 times smaller error than prior works.
Biography: Tong Yang received his PHD degree in Computer Science from Tsinghua University in 2013. He visited Institute of Computing Technology, Chinese Academy of Sciences (CAS) China from 2013.7 to 2014.7. Now he is an associate professor in the Department of Computer Science and technology, Peking University. His research interests focus on networking algorithms, such as sketches, IP lookups, Bloom filters. He published nearly 20 papers in top conferences (SIGCOMM, SIGKDD, SIGMOD, and NSDI).