
The May 21, 2024 paper introduces **Vidur**, a new, high-fidelity simulation framework designed to optimize the deployment and performance of Large Language Model (LLM) inference. The authors explain that experimentally optimizing LLM deployment is **prohibitively expensive**, requiring exploration of a vast configuration space of system parameters like parallelization strategies and batching techniques, which can cost hundreds of thousands of dollars and thousands of GPU hours. Vidur addresses this by using **predictive modeling and experimental profiling** of LLM operators to estimate end-to-end performance metrics, achieving less than 9% error in latency estimation. Complementing the simulator is **Vidur-Search**, a configuration search tool that leverages Vidur to automatically identify the most cost-effective deployment settings that meet application performance constraints, reducing optimization time from months of GPU time to approximately one hour on a CPU machine. The research emphasizes that the **optimal configuration depends on both the LLM and the specific workload trace**, justifying the need for a rapid simulation tool like Vidur.
Source:
May 21, 2024
VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE
https://arxiv.org/pdf/2405.05465