vLLM

An example of using vLLM with LWS

In this example, we will use LeaderWorkerSet to deploy a distributed inference service with vLLM on GPUs. vLLM supports distributed tensor-parallel inference and serving. Currently, it supports Megatron-LM’s tensor parallel algorithm. It manages the distributed runtime with Ray. See the doc vLLM Distributed Inference and Serving for more details.

Deploy LeaderWorkerSet of vLLM

We use LeaderWorkerSet to deploy two vLLM model replicas. We have two flavors of the deployment:

  • GPU: Each vLLM replica has 2 pods (pipeline_parallel_size=2) and 8 GPUs per pod (tensor_parallel_size=8).
  • TPU: The example assumes that you have a GKE cluster with two TPU v5e-16 slices. You can view how to create a cluster with multiple TPU slices here. Each TPU slice has 4 hosts, and each host has 4 TPUs. The vLLM server is deployed on the TPU slice with pipeline_parallel_size=2 and tensor_parallel_size=16.

In both examples, Ray uses the leader pod as the head node and the worker pods as the worker nodes. The leader pod runs the vLLM server, with a ClusterIP Service exposing the port.

kubectl apply -f docs/examples/vllm/GPU/lws.yaml
kubectl apply -f docs/examples/vllm/TPU/lws.yaml

Verify the status of the vLLM pods

kubectl get pods

Should get an output similar to this

NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          2s
vllm-0-1   1/1     Running   0          2s
vllm-1     1/1     Running   0          2s
vllm-1-1   1/1     Running   0          2s

Verify that the distributed tensor-parallel inference works

kubectl logs vllm-0 |grep -i "Loading model weights took"

Should get an output similar to this

INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB

Access ClusterIP Service

Use kubectl port-forward to forward local port 8080 to a pod.

# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
kubectl port-forward svc/vllm-leader 8080:8080

The output should be similar to the following

Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Serve the Model

Open another terminal and send a request

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

The output should be similar to the following

{
  "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
  "object": "text_completion",
  "created": 1715138766,
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " top destination for foodies, with",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 12,
    "completion_tokens": 7
  }
}