MLPerf Inference Benchmarks, MLCommons Association, 2024 (MLCommons Association) - Defines standard metrics and methodologies for benchmarking machine learning inference, including aspects highly relevant to LLMs.
vLLM: Efficient Memory Management for Large Language Model Serving, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Ion Stoica, 2023Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI '23) (ACM)DOI: 10.1145/3620678.3624682 - Presents a novel LLM serving system with detailed performance analysis, demonstrating advanced benchmarking practices for throughput and latency.
TensorRT-LLM Documentation, NVIDIA Corporation, 2023 (NVIDIA Corporation) - Official documentation for NVIDIA's open-source library that optimizes LLM inference on NVIDIA GPUs, detailing performance considerations and deployment strategies.