Overview

Today, computer systems operate at unprecedented scale, running on billions of devices worldwide and underpinning personal computing, cloud infrastructure, industrial platforms, and critical services. As these systems grow in complexity, traditional design and analysis techniques are increasingly strained, creating new challenges in performance, reliability, and security.

This course explores how machine learning techniques can be used to address fundamental challenges in computer systems. We will study state-of-the-art research that applies ML to systems problems such as performance optimization, bug detection, reliability analysis, and security enforcement. Through in-depth discussion of research papers and hands-on projects, students will learn how to frame systems challenges as learning problems, critically evaluate ML-based system designs, and develop practical ML-driven solutions for real-world systems.

Prerequisites: COMP 530/730 (operating systems), COMP 734 (distributed systems), COMP 562 (Intro Machine Learning), or equivalent background.



Course Info



Grading


Paper presentation and review

Each student will present and lead the discussion for one or two research papers during the semester. In addition, each student will complete written reviews for five research papers. Paper assignments will be finalized by Week 2.

Each presentation should be approximately 30 minutes long, followed by a short Q&A and discussion. Presenters are expected to actively engage with the audience, respond to questions, and help guide the discussion. Students should create their own slides. Copying slides directly from the paper authors is not allowed and will affect the grade. It's acceptable to reuse figures or diagrams from the paper or talk.

Students will also write reviews for assigned papers. Reviews will be submitted through a Google Form, and detailed guidelines and evaluation criteria will be provided in advance. Reviews must be written independently by the student. The use of AI tools to generate or substantially draft reviews is strictly prohibited and will be treated as a violation of academic integrity.


Class participation

Participation is central to the course. Students are required to attend all classes. Absence of up to two classes is allowed without prior notice; additional absences must be reported and approved.

Everyone is expected to engage actively during discussions---by asking questions, offering thoughts, or responding to others. The classroom should remain inclusive and respectful.


Research project

The course includes a semester-long team project on a topic related to system reliability or security. Projects are done in teams of three or four. Students who wish to work individually must request approval from the instructor. Teams should be formed by Week 2; the instructor will help with team formation if needed.

Each team will choose a topic, either from a list of suggestions (provided during the lecture) or based on their own ideas. All topics must be approved to ensure they are suitable in scope and relevance.


Project Milestones



Date Topic Detail
01/08 Lecture: Introduction
01/13 Lecture: Use machine learning to address kernel concurrency bugs
01/15 Lecture: Network performance
Deadline for team registration (by 01/20)
01/20 Paper presentation - Computers Can Learn from the Heuristic Designs and Master Internet Congestion Control, SIGCOMM'23
- Achieving Fairness Generalizability for Learning-based Congestion Control with Jury, EuroSys'25
01/22 NO CLASS Hacker Day
01/27 Paper presentation - LiteFlow: towards high-performance adaptive neural networks for kernel datapath, SIGCOMM'22
- Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents, OSDI'24
01/29 Lecture: Resource management
02/03 Paper presentation - SmartOS: Towards Automated Learning and User-Adaptive Resource Allocation in Operating Systems, APSys'21
- ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions, ATC'24
02/05 Paper presentation - SelfTune: Learning-based Cluster Managers, NSDI'23
- Towards VM Rescheduling Optimization Through Deep Reinforcement Learning, EuroSys'25
02/10 Project Proposal
Deadline for project proposal report (by 02/19)
02/12 Project Proposal
02/17 NO CLASS Hacker Day
02/19 Lecture: Data structure
02/24 Paper presentation - The Case for Learned Index Structures, SIGMOD'18
- ALEX: An Updatable Adaptive Learned Index, SIGMOD'20
02/26 Paper presentation - Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions, OSDI'20
- LOFT: A Lock-free and Adaptive Learned Index with High Scalability for Dynamic Workloads, EuroSys'25
03/03 Lecture: Bug detection
03/05 Paper presentation - Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs, SOSP'24
- If At First You Don't Succeed, Try, Try, Again ...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems, SOSP'24
03/10 Mid-semester Presentation
03/12 Mid-semester Presentation
03/17 NO CLASS Spring Break
03/19 NO CLASS Spring Break
03/24 Paper presentation - SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning, Security'21
- KNighter: Transforming Static Analysis with LLM-Synthesized Checkers, SOSP'25
03/26 Lecture: Bug diagnosis
03/31 Paper presentation - Automatic Root Cause Analysis via Large Language Models for Cloud Incidents, EuroSys'24
- Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks, ASPLOS'24
04/02 NO CLASS Well-Being Day
04/07 Paper presentation - Murphy: Performance Diagnosis of Distributed Cloud Applications, SIGCOMM'23
- Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices, ASPLOS'21
04/09 Lecture: ML integration
04/14 Paper presentation - ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications, OSDI'24
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, OSDI'24
04/16 Paper presentation - SuperFE: A Scalable and Flexible Feature Extractor for ML-based Traffic Analysis Applications, EuroSys'25
- Towards a Machine Learning-Assisted Kernel with LAKE, ASPLOS'23
04/21 Final Presentation
04/23 Final Presentation
Deadline for project final report (by 04/28)

Course Schedule