Cornell Tech

ECE 5290/7290 & ORIE 5290: Distributed Optimization for Machine Learning and AI

Instructor: Tianyi Chen  |  tianyi.chen@cornell.edu  |  Semester: Fall 2025

Class time: MoWe 1:25 PM – 2:40 PM  |  Location: Cornell Tech, Bloomberg Center 161

Overview

This is a graduate-level course about theory, algorithms, and applications of distributed optimization and machine learning. The course covers the basics of distributed optimization and learning algorithms, and their performance analyses when they are used to solve large-scale distributed problems arising in AI, machine learning, signal processing, communication networks, and power systems.

Public resources: The lecture slides and assignments will be posted online as the course progresses. We are happy for anyone to learn from these resources, but we cannot grade the work of any unenrolled students

Cornell students: Students should ask all course-related questions on Ed Discussion, submit homework on Gradescope, and find all the announcements on Canvas.

Instructor, TAs, and Graders

Tianyi Chen

Tianyi Chen
Instructor
Email

Yuheng Wang

Yuheng Wang
Teaching Assistant
Email

Zhaoxian Wu

Zhaoxian Wu
Student Volunteer
Email

Modules

Course surveys: To help tailor the course and improve your learning experience, please complete:

Week 1 Aug 25 – Aug 27 2 sessions
  • Mon, Aug 25
    Introduction & Motivation
    Overview of course goals, applications of distributed optimization in modern AI and ML systems, and motivating examples from large-scale model training and federated learning.
    SlidesWelcome to the course! We’ll set the stage for why distributed optimization is now central to AI at scale.
  • Wed, Aug 27
    ML fundamentals for optimization (ERM, supervised learning)
    Review of empirical risk minimization (ERM), supervised learning frameworks, convexity basics, and gradient-based learning methods.
Week 2 Sep 01 – Sep 03 2 sessions
  • Mon, Sep 01
    No class - Labor day
    No Class
  • Wed, Sep 03
    ML Fundamentals Continued: Optimization Viewpoint
    From linear and nonlinear regression to general loss functions; connecting optimization to machine learning tasks and understanding challenges in non-convex learning.
Week 3 Sep 08 – Sep 10 2 sessions
  • Mon, Sep 08
    Optimization Basics I: Gradient Descent on Quadratic Problems
    Derivation and intuition of gradient descent, convergence on quadratic objectives, and the geometry of step sizes and conditioning.
  • Wed, Sep 10
    Optimization Basics II: Convergence and Complexity
    Convergence analysis for smooth and strongly convex functions; understanding sublinear and linear rates of convergence for gradient descent.
    SlidesHW 1 due (9/12)
Week 4 Sep 15 – Sep 17 2 sessions
  • Mon, Sep 15
    Beyond Convexity: Non-Convex Landscapes and Smoothness
    Extending gradient-based methods to non-convex settings; smoothness assumptions, saddle points, and convergence guarantees.
  • Wed, Sep 17
    Gradient Methods for Constrained Optimization
    Projected gradient methods, constraint handling in distributed settings, and convergence rates under constraints.
Week 5 Sep 22 – Sep 24 2 sessions
  • Mon, Sep 22
    Stochastic Optimization: SGD, Minibatching, and Convergence
    Fundamentals of stochastic gradient descent, convergence properties under noise, and trade-offs between batch size and computation.
  • Wed, Sep 24
    Variance Reduction and Momentum
    Modern SGD enhancements — variance-reduced methods (SVRG) and momentum techniques for accelerating convergence.
Week 6 Sep 29 – Oct 01 2 sessions
  • Mon, Sep 29
    Variance Reduction and Momentum in Practice
    Finite-sum optimization, adaptive methods (Adam, AdaGrad), and a deeper look at the variance–bias trade-off in stochastic learning.
    SlidesHW 2 due (9/26)
  • Wed, Oct 01
    Consensus and Spectrum of Graphs
    Fundamentals of consensus algorithms, properties of doubly-stochastic matrices, and spectral connectivity measures in networks.
    SlidesProject announced
Week 7 Oct 06 – Oct 08 2 sessions
  • Mon, Oct 06
    Gossip and Random Walks
    Randomized gossip algorithms, asynchronous communication, and relationships between random walks and averaging in networks.
  • Wed, Oct 08
    Data and Model Parallelism in Distributed Training
    Data and model parallelism in distributed training; local SGD, synchronization intervals, and communication–computation balancing.
Week 8 Oct 13 – Oct 15 2 sessions
  • Mon, Oct 13
    No class - Fall break
    No Class
  • Wed, Oct 15
    Communication-Efficient Distributed Methods
    Quantization and local updates; techniques to reduce bandwidth while preserving convergence in distributed learning.
    SlidesPractice problems announced
Week 9 Oct 20 – Oct 22 2 sessions
  • Mon, Oct 20
    Communication-Efficient Distributed Methods
    Sparsification, and worker selections; techniques to reduce bandwidth while preserving convergence in distributed learning.
    SlidesHW 3 due (10/20)
  • Wed, Oct 22
    Decentralized Algorithms: Consensus GD and Its Convergence
    Gradient tracking algorithms, convergence under directed and time-varying graphs.
Week 10 Oct 27 – Oct 29 2 sessions
  • Mon, Oct 27
    No class - Asynchronous Office Q&A on Ed Discussion and Emails
    Welcome Q&A for HWs 1-3 and Practice problemsNo Class
  • Wed, Oct 29
    In-person Exam
    Project Idea Due (11/1)Exam
Week 11 Nov 03 – Nov 05 2 sessions
  • Mon, Nov 03
    Robust distributed optimization: adversaries, attacks and defenses
    Robustness against adversarial clients, Byzantine attacks, and noisy updates; algorithmic defenses and aggregation strategies.
    SlidesTry every effort to attend; attendance bonus
  • Wed, Nov 05
    The curse of data heterogeneity in distributed learning
    Personalization techniques and handling heterogeneous data distributions in federated settings.
Week 12 Nov 10 – Nov 12 2 sessions
  • Mon, Nov 10
    Transformers - Architecture, Parameters, and Memories
    Transformer architectures, parameter scaling laws, and memory components relevant to efficient and distributed training.
  • Wed, Nov 12
    Memory footprint of GPT and mixed precision training
    Memory bottlenecks in GPT training, covering parameters, activations, optimizer states, and how mixed-precision methods enable efficient distributed optimization.
    SlidesHW 4 due (11/14)
Week 13 Nov 17 – Nov 19 2 sessions
  • Mon, Nov 17
    Analogc computing for energy-efficient AI: Part I
    Principles of analog computing for neural network inference, focusing on in-memory computation, and hardware–algorithm co-design.
  • Wed, Nov 19
    Analogc computing for energy-efficient AI: Part II
    Analog computing techniques for neural network training, highlighting gradient computation and hardware–algorithm co-design under device nonidealities.
Week 14 Nov 24 – Nov 26 2 sessions
  • Mon, Nov 24
    Pre-training and fine-tuning LLMs
    Optimization methods for pre-training and fine-tuning LLMs, including data scaling, parameter-efficient adaptation, and distributed training tradeoffs.
  • Wed, Nov 26
    No class - Thanksgiving
    HW 5 due (12/1)No Class
Week 15 Dec 01 – Dec 03 2 sessions
  • Mon, Dec 01
    Project presentation - Part I
    ECE/ORIE 5290 student projects: educational presentations highlighting key papers, methods, and open challenges from the recent literature in distributed optimization.
  • Wed, Dec 03
    Project presentation - Part II
    ECE/ORIE 5290/ECE 7290 Students project presentations: showcasing research topics, key findings, and proposed extensions in distributed optimization.
Week 16 Dec 08 – Dec 08 1 sessions
  • Mon, Dec 08
    Project presentation - Part III
    ECE 7290 Students project presentations: showcasing research topics, key findings, and proposed extensions in distributed optimization.
    Project Report due (12/13)

Assignments & Project


Synergy with other ECE Courses

Due to space limitations, the current list highlights selected ECE courses offered at Cornell Tech that are open to Master’s students and have strong thematic connections with ECE 5290.

Suggested Readings

These papers offer background and provide deeper insights into distributed optimization and its applications in machine learning.

Advances and Open Problems in Federated Learning

P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, et al.; arXiv preprint, 2021

The definitive survey of federated learning—covering foundations, system challenges, privacy, personalization, and open research directions. Essential background for advanced topics in this course.

DeepSpeed: System Optimizations Enable Training Beyond 100 Billion Parameters

J. Rasley, S. Rajbhandari, O. Ruwase, Y. He; Proceedings of the International Conference for High Performance Computing (SC), 2020

System-level innovations enabling efficient large-scale model training—essential reading for project work.

Local SGD Converges Fast and Communicates Little

S. U. Stich; ICLR, 2019

Formal analysis of local SGD showing near-linear speedup with limited communication.

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

T. Chen, G. Giannakis, T. Sun, W. Yin; NeurIPS, 2018

Introduces LAG, which adaptively skips redundant gradient updates to reduce communication cost without harming convergence.

Communication Compression for Decentralized Training

H. Tang, X. Lian, M. Yan, C. Zhang, J. Liu; NeurIPS, 2018

Shows how gradient compression techniques accelerate decentralized training while maintaining convergence.

Achieving geometric convergence for distributed optimization over time-varying graphs

A. Nedić, A. Olshevsky, W. Shi; SIAM Journal on Optimization, 2017

Pioneering paper introducing gradient tracking, enabling linear convergence in time-varying graphs.

Communication-Efficient Learning of Deep Networks from Decentralized Data

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. Arcas; AISTATS, 2017

Introduces FedAvg, the foundational algorithm for federated learning.

On the Convergence of Decentralized Gradient Descent

K. Yuan, Q. Ling, W. Yin; SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016

A seminal theoretical study of decentralized gradient descent (DGD), providing a convergence rate analysis for both diminishing and fixed step sizes over static networks.

EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization

W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin; SIAM Journal on Optimization, 2015

A breakthrough decentralized algorithm achieving exact convergence to the global optimum using local updates.

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein; Foundations and Trends in Machine Learning, 2011

The classic monograph on ADMM—still the most cited reference for distributed convex optimization.


Resources


Acknowledgement


Course Policies

Academic integrity, late submission, and collaboration policies will follow Cornell Tech standards.