
We compare and contrast two advanced 2025 memory management and scheduling techniques for optimizing Large Language Model (LLM) serving throughput and latency:
vAttention Vs Strata
One core innovation discussed is **vAttention**, which improves upon the popular PagedAttention method by leveraging CUDA Virtual Memory Management (VMM) APIs to keep the KV cache virtually contiguous, thereby simplifying **attention kernel portability** and reducing performance overheads associated with non-contiguous memory access. The other major focus is **Strata**, a hierarchical context caching framework that boosts throughput by employing **GPU-assisted I/O and cache-aware scheduling** to efficiently manage and transfer KV cache data between CPU and GPU memory, specifically mitigating the "delay hit phenomenon" and allowing for on-the-fly data layout transformations. Both systems aim to resolve the efficiency challenges inherent in LLM inference, particularly during the resource-intensive prefill and decode phases, with Strata showing substantial throughput gains over existing hierarchical caching solutions. Ultimately, vAttention and Strata represent different, yet potentially complementary, approaches to addressing the **memory fragmentation and I/O bottlenecks** that limit LLM serving performance.
Sources:
January 29, 2025
vAttention: Dynamic Memory Management for
Serving LLMs without PagedAttention
https://arxiv.org/pdf/2405.04437
August 26 2025
Strata: Hierarchical Context Caching for Long Context Language Model Serving
https://arxiv.org/html/2508.18572v1