Overview
Today, computer systems operate at unprecedented scale, running on billions of devices worldwide and underpinning personal computing, cloud infrastructure, industrial platforms, and critical services. As these systems grow in complexity, traditional design and analysis techniques are increasingly strained, creating new challenges in performance, reliability, and security.
This course explores how machine learning techniques can be used to address fundamental challenges in computer systems. We will study state-of-the-art research that applies ML to systems problems such as performance optimization, bug detection, reliability analysis, and security enforcement. Through in-depth discussion of research papers and hands-on projects, students will learn how to frame systems challenges as learning problems, critically evaluate ML-based system designs, and develop practical ML-driven solutions for real-world systems.
Prerequisites: COMP 530/730 (operating systems), COMP 734 (distributed systems), COMP 562 (Intro Machine Learning), or equivalent background.
Course Info
- Time: Tuesday and Thursday, 5:00-6:15 PM
- Room: SN115
- Instructor: Sishuai Gong
- Email: sishuai@cs.unc.edu
- Office hours: 30 minutes after class or by appointment
Grading
- Paper presentation and review: 40%
- Class participation: 30%
- Research project: 30%
Paper presentation and review
Each student will present and lead the discussion for one or two research papers during the semester. In addition, each student will complete written reviews for five research papers. Paper assignments will be finalized by Week 2.
Each presentation should be approximately 30 minutes long, followed by a short Q&A and discussion. Presenters are expected to actively engage with the audience, respond to questions, and help guide the discussion. Students should create their own slides. Copying slides directly from the paper authors is not allowed and will affect the grade. It's acceptable to reuse figures or diagrams from the paper or talk.
Students will also write reviews for assigned papers. Reviews will be submitted through a Google Form, and detailed guidelines and evaluation criteria will be provided in advance. Reviews must be written independently by the student. The use of AI tools to generate or substantially draft reviews is strictly prohibited and will be treated as a violation of academic integrity.
Class participation
Participation is central to the course. Students are required to attend all classes. Absence of up to two classes is allowed without prior notice; additional absences must be reported and approved.
Everyone is expected to engage actively during discussions---by asking questions, offering thoughts, or responding to others. The classroom should remain inclusive and respectful.
Research project
The course includes a semester-long team project on a topic related to system reliability or security. Projects are done in teams of three or four. Students who wish to work individually must request approval from the instructor. Teams should be formed by Week 2; the instructor will help with team formation if needed.
Each team will choose a topic, either from a list of suggestions (provided during the lecture) or based on their own ideas. All topics must be approved to ensure they are suitable in scope and relevance.
Project Milestones
- Team Formation
Teams of three or four must be in place by Week 2.- Project Proposal
Teams will first give a short in-class presentation to introduce their project idea. If the proposal is approved by the instructor, the team should upload a 2-page PDF proposal within one week of the presentation, outlining the project’s goals, background, and related work. If the proposal is not approved, the team should meet with the instructor during office hours to refine and finalize the topic before submission.- Midterm Review
Each team will give a brief mid-semester review, sharing early results and discussing any challenges.- Final Report and Presentation
The final deliverables include a 6-page PDF report, along with an in-class presentation.
| Date | Topic | Detail |
|---|---|---|
| 01/08 | Lecture: Introduction | |
| 01/13 | Lecture: Use machine learning to address kernel concurrency bugs | |
| 01/15 | Lecture: Network performance | |
| Deadline for team registration (by 01/20) | ||
| 01/20 | Paper presentation | - Computers Can Learn from the Heuristic Designs and Master Internet Congestion Control,
SIGCOMM'23 - Achieving Fairness Generalizability for Learning-based Congestion Control with Jury, EuroSys'25 |
| 01/22 | NO CLASS | Hacker Day |
| 01/27 | Paper presentation | - LiteFlow: towards high-performance adaptive neural networks for kernel datapath,
SIGCOMM'22 - Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents, OSDI'24 |
| 01/29 | Lecture: Resource management | |
| 02/03 | Paper presentation | - SmartOS: Towards Automated Learning and User-Adaptive Resource Allocation in Operating
Systems, APSys'21 - ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions, ATC'24 |
| 02/05 | Paper presentation | - SelfTune: Learning-based Cluster Managers, NSDI'23 - Towards VM Rescheduling Optimization Through Deep Reinforcement Learning, EuroSys'25 |
| 02/10 | Project Proposal | |
| Deadline for project proposal report (by 02/19) | ||
| 02/12 | Project Proposal | |
| 02/17 | NO CLASS | Hacker Day |
| 02/19 | Lecture: Data structure | |
| 02/24 | Paper presentation | - The Case for Learned Index Structures, SIGMOD'18 - ALEX: An Updatable Adaptive Learned Index, SIGMOD'20 |
| 02/26 | Paper presentation | - Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions,
OSDI'20 - LOFT: A Lock-free and Adaptive Learned Index with High Scalability for Dynamic Workloads, EuroSys'25 |
| 03/03 | Lecture: Bug detection | |
| 03/05 | Paper presentation | - Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs, SOSP'24 - If At First You Don't Succeed, Try, Try, Again ...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems, SOSP'24 |
| 03/10 | Mid-semester Presentation | |
| 03/12 | Mid-semester Presentation | |
| 03/17 | NO CLASS | Spring Break |
| 03/19 | NO CLASS | Spring Break |
| 03/24 | Paper presentation | - SyzVegas: Beating Kernel Fuzzing Odds with Reinforcement Learning, Security'21 - KNighter: Transforming Static Analysis with LLM-Synthesized Checkers, SOSP'25 |
| 03/26 | Lecture: Bug diagnosis | |
| 03/31 | Paper presentation | - Automatic Root Cause Analysis via Large Language Models for Cloud Incidents, EuroSys'24 - Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks, ASPLOS'24 |
| 04/02 | NO CLASS | Well-Being Day |
| 04/07 | Paper presentation | - Murphy: Performance Diagnosis of Distributed Cloud Applications, SIGCOMM'23 - Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices, ASPLOS'21 |
| 04/09 | Lecture: ML integration | |
| 04/14 | Paper presentation | - ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications,
OSDI'24 - Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, OSDI'24 |
| 04/16 | Paper presentation | - SuperFE: A Scalable and Flexible Feature Extractor for ML-based Traffic Analysis Applications,
EuroSys'25 - Towards a Machine Learning-Assisted Kernel with LAKE, ASPLOS'23 |
| 04/21 | Final Presentation | |
| 04/23 | Final Presentation | |
| Deadline for project final report (by 04/28) | ||