llama.cpp

An example of using llama.cpp with LWS

Deploy Distributed Inference Service with llama.cpp

In this example, we will use LeaderWorkerSet to deploy a distributed inference service using llama.cpp. llama.cpp began as a project to support CPU-only inference on a single node, but has since expanded to support accelerators and distributed inference.

Deploy LeaderWorkerSet of llama.cpp

We use LeaderWorkerSet to deploy a llama.cpp leader and two llama.cpp workers. The leader pod loads the model and distributes layers to the workers; the workers perfom the majority of the computation.

Because the default configuration runs with CPU inference, this can be run on a kind cluster.

Create a kind cluster:

kind create cluster

Install LeaderWorkerSet; full details are in the installation guide but running the e2e test suite is an easy way to get started in development:

USE_EXISTING_CLUSTER=true KIND_CLUSTER_NAME=kind make test-e2e

Finally, deploy the llama.cpp example to kind with this script:

./docs/examples/dev/tasks/run-in-kind

After a lot of downloading and building (we build llama.cpp from source, and we download a model), you should be running our example clichat pod that communicates with the llamap.cpp Service to give a basic LLM chat experience:

...
+ kubectl apply --server-side -f k8s/lws.yaml
leaderworkerset.leaderworkerset.x-k8s.io/llamacpp-llama3-8b-instruct-bartowski-q5km serverside-applied
service/llamacpp serverside-applied
+ kubectl run clichat --image=clichat:latest --rm=true -it --image-pull-policy=IfNotPresent --env=LLM_ENDPOINT=http://llamacpp:80
If you don't see a command prompt, try pressing enter.
What is the capital of France?
        The capital of France is Paris.

(Feel free to do your own chat messages, not just “What is the capital of France?")

If you want to see the pods:

> kubectl get pods
NAME                                             READY   STATUS    RESTARTS   AGE
llamacpp-llama3-8b-instruct-bartowski-q5km-0     1/1     Running   0          2m42s
llamacpp-llama3-8b-instruct-bartowski-q5km-0-1   1/1     Running   0          2m42s
llamacpp-llama3-8b-instruct-bartowski-q5km-0-2   1/1     Running   0          2m42s
llamacpp-llama3-8b-instruct-bartowski-q5km-0-3   1/1     Running   0          2m42s
llamacpp-llama3-8b-instruct-bartowski-q5km-0-4   1/1     Running   0          2m42s

To see the endpoints for the services:

> kubectl get endpoints
NAME                                         ENDPOINTS                                         AGE
kubernetes                                   192.168.8.3:6443                                  16m
llamacpp                                     10.244.0.90:8080                                  4m
llamacpp-llama3-8b-instruct-bartowski-q5km   10.244.0.90,10.244.0.91,10.244.0.92 + 2 more...   3m59s

The service llamacpp-llama3-8b-instruct-bartowski-q5km is the service that backs the StatefulSet.

The service llamacpp is the real service, and targets the Leader pod only. The Leader then distributes work to the Workers.

Last modified March 27, 2025: Address feedback (d5ed8e7)