Google MLSys Competition

Problem Statement

Status Active

Challenge: Execute massive computational graphs on hardware with limited on-chip memory by orchestrating complex data movement strategies.

Design a scheduler that analyzes a Directed Acyclic Graph (DAG) of operations and generates an execution strategy that minimizes total latency while respecting all memory capacity limits.

Core Concepts

Memory Hierarchy

Slow Memory: Infinite capacity, limited bandwidth - all inputs start here, all outputs must end here
Fast Memory: High-speed scratchpad with finite capacity (e.g., 50KB) - compute cores can only access data here
Ephemeral Data: When operations are grouped, intermediate data passes directly between ops without touching fast memory (zero capacity cost)

Execution Granularity

Control how hardware slices computation using a 3D configuration [w, h, k]:

w, h: Spatial dimensions - size of output slice (w × h)
k: Reduction depth - for MatMul operations, controls dot product step size

Key Constraints

Working Set Limit: Sum of input slices + output slices must fit in fast memory
Native Granularity: Hardware has minimum execution size (e.g., 128×128) - smaller sizes incur padding overhead
Roofline Model: Latency = max(compute_time, memory_transfer_time)

Optimization Strategies

1. Subgraph Grouping

Group connected operations to make intermediate data ephemeral, eliminating memory transfers and capacity consumption.

2. Granularity Tuning

Balance between memory footprint and compute efficiency by adjusting tile sizes. Smaller tiles fit in memory but may incur padding overhead.

3. Data Residency Management

Decide which tensors to keep in fast memory vs. spill to slow memory. Trade-off between recomputation and memory bandwidth.

4. Traversal Order Optimization

Process tiles in "zig-zag" or "snake" patterns to maximize data reuse and minimize revisits to slow memory.

5. Split-K Pipelining

For chained MatMuls, use small reduction depths to stream data and minimize intermediate tensor size in fast memory.

Input/Output Format

Input Specification (JSON)

The problem is specified in a JSON file containing graph topology and hardware specs:

widths, heights: Dimensions of each tensor
inputs: List of tensor indexes consumed by each operation
outputs: List of tensor IDs produced by each operation
base_costs: Compute cost for each operation
op_types: Operation type (MatMul or Pointwise)
fast_memory_capacity: Available fast memory (e.g., 25000)
slow_memory_bandwidth: Transfer rate (e.g., 10)
native_granularity: Hardware native execution size [w, h]

Output Specification (JSON)

Must provide a complete execution schedule with:

subgraphs: List of operation groups to execute
granularities: [w, h, k] configuration for each subgraph
tensors_to_retain: Which tensors to keep in fast memory after each step
traversal_orders: Optional custom tile execution order (e.g., "snake" pattern)
subgraph_latencies: Calculated latency for each execution step

Key Trade-offs

Spilling vs. Recomputation

Spilling: Save intermediate results to slow memory (costs bandwidth, saves compute)
Recomputation: Discard intermediates and recompute when needed (costs compute, saves bandwidth)

Granularity Selection

Large tiles: Efficient compute (matches native granularity) but high memory footprint
Small tiles: Fits in memory but incurs padding overhead and more execution steps

Grouping Strategy

Mega-grouping: Combine many ops to eliminate intermediate transfers
Fine-grained: Execute ops separately for more flexibility in memory management

Example Scenarios

Example 1: Baseline (128×128 Pointwise Chain)

Two sequential pointwise operations. Strategy comparison:

Always Spill: 6,553.6 latency (memory bound, 2× slower)
Mega-Group (128×128): 3,276.8 latency (memory bound, optimal)
Mega-Group (64×64): 4,400 latency (compute bound, padding overhead)

Example 2: Diamond Graph (Skip Connection)

Intermediate tensor needed by two downstream branches:

Spilling (Cache): 11,468.8 latency (save to slow memory, reload later)
Recomputation (Flash): 6,276.8 latency (45% faster, recompute when needed)
Selective Residency (Hybrid): 4,638.4 latency (60% faster, keep in fast memory)

Example 3: MatMul with Revisit Optimization

128×128 MatMul with 64×64 granularity creates 4 tiles:

Naive Raster Order: 8,192 latency (reload data every tile)
Zig-Zag Traversal: 6,548 latency (20% faster, reuse resident data)

Example 4: Chained MatMul with Split-K

Computing (A @ B) @ C with tight memory:

Full Materialization (k=128): OOM error (intermediate too large)
Split-K Pipeline (k=32): 6,915.2 latency (streams data, fits in memory)

Implementation Approach

Languages: Python for scheduler logic, C++ for performance-critical paths
Graph Analysis: NetworkX for DAG traversal and dependency analysis
Optimization: Dynamic programming, greedy heuristics, genetic algorithms
Simulation: Custom roofline model simulator for latency calculation
Validation: JSON schema validation, memory constraint checking

Progress & Updates

Current Status

Phase: Algorithm development and baseline implementation
Focus: Implementing core scheduling algorithms and testing on example cases

Updates on competition progress, experimental results, and optimization discoveries will be posted here as work progresses.

Resources

Competition Page GitHub Repo Documentation (Coming Soon)

MLSys DAG Scheduling Challenge