MLSys DAG Scheduling Challenge
Designing optimal execution schedulers for computational graphs on memory-constrained AI accelerators.
Problem Statement
Status Active
Challenge: Execute massive computational graphs on hardware with limited on-chip memory by orchestrating complex data movement strategies.
Design a scheduler that analyzes a Directed Acyclic Graph (DAG) of operations and generates an execution strategy that minimizes total latency while respecting all memory capacity limits.
Core Concepts
Memory Hierarchy
- Slow Memory: Infinite capacity, limited bandwidth - all inputs start here, all outputs must end here
- Fast Memory: High-speed scratchpad with finite capacity (e.g., 50KB) - compute cores can only access data here
- Ephemeral Data: When operations are grouped, intermediate data passes directly between ops without touching fast memory (zero capacity cost)
Execution Granularity
Control how hardware slices computation using a 3D configuration [w, h, k]:
- w, h: Spatial dimensions - size of output slice (w × h)
- k: Reduction depth - for MatMul operations, controls dot product step size
Key Constraints
- Working Set Limit: Sum of input slices + output slices must fit in fast memory
- Native Granularity: Hardware has minimum execution size (e.g., 128×128) - smaller sizes incur padding overhead
- Roofline Model: Latency = max(compute_time, memory_transfer_time)
Optimization Strategies
1. Subgraph Grouping
Group connected operations to make intermediate data ephemeral, eliminating memory transfers and capacity consumption.
2. Granularity Tuning
Balance between memory footprint and compute efficiency by adjusting tile sizes. Smaller tiles fit in memory but may incur padding overhead.
3. Data Residency Management
Decide which tensors to keep in fast memory vs. spill to slow memory. Trade-off between recomputation and memory bandwidth.
4. Traversal Order Optimization
Process tiles in "zig-zag" or "snake" patterns to maximize data reuse and minimize revisits to slow memory.
5. Split-K Pipelining
For chained MatMuls, use small reduction depths to stream data and minimize intermediate tensor size in fast memory.
Input/Output Format
Input Specification (JSON)
The problem is specified in a JSON file containing graph topology and hardware specs:
widths,heights: Dimensions of each tensorinputs: List of tensor indexes consumed by each operationoutputs: List of tensor IDs produced by each operationbase_costs: Compute cost for each operationop_types: Operation type (MatMul or Pointwise)fast_memory_capacity: Available fast memory (e.g., 25000)slow_memory_bandwidth: Transfer rate (e.g., 10)native_granularity: Hardware native execution size [w, h]
Output Specification (JSON)
Must provide a complete execution schedule with:
subgraphs: List of operation groups to executegranularities: [w, h, k] configuration for each subgraphtensors_to_retain: Which tensors to keep in fast memory after each steptraversal_orders: Optional custom tile execution order (e.g., "snake" pattern)subgraph_latencies: Calculated latency for each execution step
Key Trade-offs
Spilling vs. Recomputation
Spilling: Save intermediate results to slow memory (costs bandwidth, saves compute)
Recomputation: Discard intermediates and recompute when needed (costs compute, saves bandwidth)
Granularity Selection
Large tiles: Efficient compute (matches native granularity) but high memory footprint
Small tiles: Fits in memory but incurs padding overhead and more execution steps
Grouping Strategy
Mega-grouping: Combine many ops to eliminate intermediate transfers
Fine-grained: Execute ops separately for more flexibility in memory management
Example Scenarios
Example 1: Baseline (128×128 Pointwise Chain)
Two sequential pointwise operations. Strategy comparison:
- Always Spill: 6,553.6 latency (memory bound, 2× slower)
- Mega-Group (128×128): 3,276.8 latency (memory bound, optimal)
- Mega-Group (64×64): 4,400 latency (compute bound, padding overhead)
Example 2: Diamond Graph (Skip Connection)
Intermediate tensor needed by two downstream branches:
- Spilling (Cache): 11,468.8 latency (save to slow memory, reload later)
- Recomputation (Flash): 6,276.8 latency (45% faster, recompute when needed)
- Selective Residency (Hybrid): 4,638.4 latency (60% faster, keep in fast memory)
Example 3: MatMul with Revisit Optimization
128×128 MatMul with 64×64 granularity creates 4 tiles:
- Naive Raster Order: 8,192 latency (reload data every tile)
- Zig-Zag Traversal: 6,548 latency (20% faster, reuse resident data)
Example 4: Chained MatMul with Split-K
Computing (A @ B) @ C with tight memory:
- Full Materialization (k=128): OOM error (intermediate too large)
- Split-K Pipeline (k=32): 6,915.2 latency (streams data, fits in memory)
Implementation Approach
- Languages: Python for scheduler logic, C++ for performance-critical paths
- Graph Analysis: NetworkX for DAG traversal and dependency analysis
- Optimization: Dynamic programming, greedy heuristics, genetic algorithms
- Simulation: Custom roofline model simulator for latency calculation
- Validation: JSON schema validation, memory constraint checking
Progress & Updates
Current Status
Phase: Algorithm development and baseline implementation
Focus: Implementing core scheduling algorithms and testing on example cases
Updates on competition progress, experimental results, and optimization discoveries will be posted here as work progresses.