Saxml
In this example, we will use LeaderWorkerSet to deploy a multi-host inference instance on TPUs with Saxml. You could use the steps here to setup your TPU clusters, node pools, GCS bucket, and configure workload access.
Conceptual View
Deploy ConfigMap with model configuration
The ConfigMap contains what model it is, and the checkpoint that will be used. If the model information on the ConfigMap is updated, the HTTP Server will unpublish the model that was loaded, and publish a model that reflects the new model information.
Apply the configmap.yaml
manifest:
kubectl apply -f docs/examples/saxml/configmap.yaml
Deploy LeaderWorkerSet Deployment
We use LeaderWorkerSet to deploy two Saxml model replicas on two TPU multi-host pod slices. On the leader pod, the leader pod runs the Sax admin and the http servers, while the workers run the Sax model servers. Additionally, there is a LoadBalancer Service that exposes the leader’s HTTPS services.
Replace the GCS_BUCKET with the name of your GCS bucket and apply the lws.yaml
manifest:
kubectl apply -f docs/examples/saxml/lws.yaml
Verify the status of the SaxML Deployment
kubectl get pods
Should get an output similar to this
NAME READY STATUS RESTARTS AGE
saxml-multi-host-0 3/3 Running 0 3m12s
saxml-multi-host-0-1 1/1 Running 0 3m12s
saxml-multi-host-0-2 1/1 Running 0 3m12s
saxml-multi-host-0-3 1/1 Running 0 3m12s
saxml-multi-host-0-4 1/1 Running 0 3m12s
saxml-multi-host-0-5 1/1 Running 0 3m12s
saxml-multi-host-0-6 1/1 Running 0 3m12s
saxml-multi-host-0-7 1/1 Running 0 3m12s
saxml-multi-host-1 3/3 Running 0 3m12s
saxml-multi-host-1-1 1/1 Running 0 3m12s
saxml-multi-host-1-2 1/1 Running 0 3m12s
saxml-multi-host-1-3 1/1 Running 0 3m12s
saxml-multi-host-1-4 1/1 Running 0 3m12s
saxml-multi-host-1-5 1/1 Running 0 3m12s
saxml-multi-host-1-6 1/1 Running 0 3m12s
saxml-multi-host-1-7 1/1 Running 0 3m12s
Use SaxML
Access LoadBalancer
Wait for the service to have an external IP address assigned
kubectl get svc
The output should be similar to the following
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
saxml LoadBalancer 10.68.56.41 10.182.0.187 8888:31876/TCP 56s
Serve the Model
Retrieve the load balancer IP address for SaxML
LB_IP=$(kubectl get svc sax-http-lb -o jsonpath='{.status.loadBalancer.ingress[*].ip}')
PORT="8888"
Serve a request
curl --request POST \
--header "Content-type: application/json" \
-s ${LB_IP}:${PORT}/generate --data \
'{
"model": "/sax/cell/lmcloudspmd175b32test",
"query": "How many days are in a week?"
}'