Latest News 🔥
- [2025-07] Release of our first three well-lit paths in v0.2: intelligent inference scheduling, simple disaggregated serving, and wide expert-parallelism.
- [2025-05] CoreWeave, Google, IBM Research, NVIDIA, and Red Hat launched the llm-d community. Check out our blog post and press release.
llm-d is a Kubernetes-native distributed inference serving stack, providing well-lit paths for anyone to serve large generative AI models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
Our well-lit paths provide tested and benchmarked recipes and Helm charts to start serving quickly with best practices common to production deployments. They are extensible and customizable for particulars of your models and use cases, using popular open source components like Kubernetes, Envoy proxy, NIXL, and vLLM. Our intent is to eliminate the heavy lifting common in deploying inference at scale so users can focus on building.
We currently offer three tested and benchmarked paths to help deploying large models:
- Intelligent Inference Scheduling - Deploy vLLM behind the Inference Gateway (IGW) to decrease latency and increase throughput via precise prefix-cache aware routing and customizable scheduling policies.
- Prefill/Decode Disaggregation - Reduce time to first token (TTFT) and get more predictable time per output token (TPOT) by splitting inference into prefill servers handling prompts and decode servers handling responses, primarily on large models such as Llama-70B and when processing very long prompts.
- Wide Expert-Parallelism - Deploy very large Mixture-of-Experts (MoE) models like DeepSeek-R1 and significantly reduce end-to-end latency and increase throughput by scaling up with Data Parallelism and Expert Parallelism over fast accelerator networks.
See the path descriptions for more details about the accelerators, networks, and configurations tested and our roadmap for what is coming next.
llm-d
currently targets improving the production serving experience around:
- Generative models running in PyTorch or JAX
- Large language models (LLMs) with 1 billion or more parameters
- Using most or all of the capacity of one or more hardware accelerators
- On recent generation datacenter-class accelerators
- NVIDIA H100 or newer for larger models and L4, A100 for smaller models
- AMD MI250 or newer
- Google TPU v5e, v6e, and newer
- With extremely fast accelerator interconnect and datacenter networking
- 600-16,000 Gbps per accelerator NVLINK on host or across narrow domains like NVL72
- 1,600-5,000 Gbps per chip TPU OCS links within TPU pods
- 100-1,600 Gbps per host datacenter networking across broad (>128 host) domains
- Kubernetes 1.29+ running
- in large (100-100k node) reserved cloud capacity or datacenters, overlapping with AI batch and training
- in medium (10-1k node) cloud deployments with a mix of reserved, on-demand, or spot capacity
- in small (1-10 node) test and qualification environments with a static footprint, often time shared
Our upstream projects – particularly vLLM, and Kubernetes – support a broader array of models, accelerators, and networks that may also benefit from our work, but we concentrate on optimizing and standardizing the operational and automation challenges of the leading edge inference workloads.
llm-d accelerates distributed inference by integrating industry-standard open technologies: vLLM as model server and engine, Inference Gateway as request scheduler and balancer, and Kubernetes as infrastructure orchestrator and workload control plane.
Key features of llm-d include:
-
vLLM-Optimized Inference Scheduler: llm-d builds on IGW's pattern for customizable “smart” load-balancing via the Endpoint Picker Protocol (EPP) to define vLLM-optimized scheduling. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced teams can implement their own scorers to further customize, while benefiting from other features in IGW, like flow control and latency-aware balancing. See our Northstar design
-
Disaggregated Serving with vLLM: llm-d leverages vLLM’s support for disaggregated serving to run prefill and decode on independent instances, using high-performance transport libraries like NIXL. In llm-d, we plan to support latency-optimized implementation using fast interconnects (IB, RDMA, ICI) and throughput optimized implementation using data-center networking. See our Northstar design
-
Disaggregated Prefix Caching with vLLM: llm-d uses vLLM's KVConnector to provide a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache. We plan to support two KV caching schemes. See our Northstar design
- Independent (N/S) caching with offloading to local memory and disk, providing a zero operational cost mechanism for offloading.
- Shared (E/W) caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
-
Variant Autoscaling over Hardware, Workload, and Traffic (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. See our Northstar design
For more see the project proposal.
llm-d
can be installed as a full solution, customizing enabled features, or through its individual components for experimentation.
llm-d
requires a Kubernetes 1.29+ cluster and accelerators capable of running large models supported by vLLM. Our well-lit paths are focused on datacenter accelerators and networks and issues encountered outside these may not receive the same level of attention.
llm-d
provides Helm charts that deploy the inference scheduler and an optional modelservice that accelerates deploying vLLM in a number of different configurations.
We bundle these together for our well-lit paths as quickstart examples with usage guidance, benchmarks, and recommended configuration.
We suggest the inference scheduling quickstart if you need a simple, production ready deployment of vLLM with optimized load balancing.
Tip
For a guided introduction to the whole system, try our step-by-step quickstart.
llm-d
is composed of multiple component repositories and derives from both vLLM and Inference Gateway upstreams. Please see the individual repositories for more guidance on development.
Visit our GitHub Releases page and review the release notes to stay updated with the latest releases.
Check out our roadmap for upcoming releases.
- See our project overview for more details on our development process and governance.
- Review our contributing guidelines for detailed information on how to contribute to the project.
- Join one of our Special Interest Groups (SIGs) to contribute to specific areas of the project and collaborate with domain experts.
- We use Slack to discuss development across organizations. Please join: Slack
- We host a weekly standup for contributors on Wednesdays at 12:30 PM ET, as well as meetings for various SIGs. You can find them in the shared llm-d calendar
- We use Google Groups to share architecture diagrams and other content. Please join: Google Group
This project is licensed under Apache License 2.0. See the LICENSE file for details.