Phase 1: Simulation Engine (Complete)
The research prototype achieves up to 80% cost reduction for GPU workloads on volatile spot instances through provably optimal migration and intelligent checkpoint recovery. 2,204 lines of Rust, 41 tests passing.
Key Findings (200 tasks, 72-hour simulation)
- Optimal Kuhn-Munkres migration provides 7-46% cost reduction vs naive first-fit (policy-dependent)
- Greedy-Optimal policy achieves 79.9% cost savings vs on-demand baseline
- Checkpoint recovery successfully handles preemption events
- 45% fewer preemptions with optimal migration (Greedy policy)
Interactive Visualizations
Explore the benchmark results through interactive Plotly charts. Hover for details, zoom, and pan.
📊 Benchmark Comparison
Cost comparison across all policies. Shows the 200-task, 72-hour simulation results with optimal vs naive migration strategies.
View Benchmark →🎯 Naive vs Optimal
Direct comparison demonstrating the superiority of Kuhn-Munkres optimal migration over naive first-fit assignment.
View Comparison →📈 Spot Price Behavior
Realistic spot instance market simulation using Ornstein-Uhlenbeck stochastic process. Shows price volatility and preemption risk.
View Market Simulation →Try It Yourself
Run the simulation on your own machine. Requires Rust 1.91+ (no GPU needed - pure CPU simulation).
Technical Highlights
🧮 Kuhn-Munkres Migration
Provably optimal task-to-instance assignment using Hungarian algorithm. 443 lines, 11 tests.
⏱️ Checkpoint Recovery
Intelligent exploitation of AWS's 120-second grace period. Full/Partial/Restart decision logic. 382 lines, 9 tests.
🔄 Domain-Agnostic Design
Clean separation between orchestration logic and workload type via pluggable scheduling policies.
Benchmark Configuration
Workload: 200 inference tasks (Llama-2-70B equivalent)
Duration: 72 hours continuous simulation
Spot Price: $0.30/hr (30% of on-demand)
On-Demand Price: $1.00/hr (baseline)
Preemption Rate: 5% per hour (realistic AWS rate)
Network Bandwidth: 10 Gbps (typical AWS instance)