SkyPilot
In this guide we'll show you how you can deploy the NOS inference server using SkyPilot on any of the popular Cloud Service Providers (CSPs) such as AWS, GCP or Azure. We'll use GCP as an example, but the steps are similar for other CSPs.
What is SkyPilot?
SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. - SkyPilot Documentation.
👩💻 Prerequisites¶
You'll first need to install SkyPilot in your virtual environment / conda environment before getting started. Before getting started, we recommend you go through their quickstart to familiarize yourself with the SkyPilot tool.
If you're installing SkyPilot for use with other cloud providers, you may install any of the relevant extras skypilot[aws,gcp,azure]
. `
[OPTIONAL] Configure cloud credentials¶
Run sky check
for more details and installation instructions.
📦 Deploying NOS on GCP¶
1. Define your SkyPilot deployment YAML¶
First, let's create a sky.yaml
YAML file with the following configuration.
# NOS GPU server deployment on T4
# Usage: sky launch -c nos-server sky.yaml
name: nos-server
resources:
accelerators: T4:1
cloud: gcp
ports:
- 8000
setup: |
# Setup conda environment
conda init bash
conda create -n nos python=3.10 -y
conda activate nos
# Install docker compose plugin
sudo apt-get install -y docker-compose-plugin
# Install torch-nos
pip install torch-nos
run: |
# Activate conda environment
conda activate nos
# Run the server (gRPC + HTTP)
nos serve up --http
Here, we are going to provision a single GPU server on GCP with an NVIDIA T4 GPU and expose ports 8000
(REST) and 50051
(gRPC) for the NOS server.
🚀 2. Launch your NOS server¶
Now, we can launch our NOS server on GCP with the following command:
That's it! You should see the following output:
(nos-infra-py38) examples/skypilot spillai-desktop [ sky launch -c nos-server sky.yaml --cloud gcp
Task from YAML spec: sky.yaml
I 01-16 09:41:18 optimizer.py:694] == Optimizer ==
I 01-16 09:41:18 optimizer.py:705] Target: minimizing cost
I 01-16 09:41:18 optimizer.py:717] Estimated cost: $0.6 / hour
I 01-16 09:41:18 optimizer.py:717]
I 01-16 09:41:18 optimizer.py:840] Considered resources (1 node):
I 01-16 09:41:18 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 01-16 09:41:18 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 01-16 09:41:18 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 01-16 09:41:18 optimizer.py:910] GCP n1-highmem-4 4 26 T4:1 us-central1-a 0.59 ✔
I 01-16 09:41:18 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 01-16 09:41:18 optimizer.py:910]
Launching a new cluster 'nos-server'. Proceed? [Y/n]: y
I 01-16 09:41:25 cloud_vm_ray_backend.py:4508] Creating a new cluster: 'nos-server' [1x GCP(n1-highmem-4, {'T4': 1}, ports=['8000', '50051'])].
I 01-16 09:41:25 cloud_vm_ray_backend.py:4508] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 01-16 09:41:26 cloud_vm_ray_backend.py:1474] To view detailed progress: tail -n100 -f /home/spillai/sky_logs/sky-2024-01-16-09-41-16-157615/provision.log
I 01-16 09:41:29 cloud_vm_ray_backend.py:1912] Launching on GCP us-central1 (us-central1-a)
I 01-16 09:44:36 log_utils.py:45] Head node is up.
I 01-16 09:45:43 cloud_vm_ray_backend.py:1717] Successfully provisioned or found existing VM.
I 01-16 09:46:00 cloud_vm_ray_backend.py:4558] Processing file mounts.
I 01-16 09:46:00 cloud_vm_ray_backend.py:4590] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-01-16-09-41-16-157615/file_mounts.log
I 01-16 09:46:00 backend_utils.py:1459] Syncing (to 1 node): ./app -> ~/.sky/file_mounts/app
I 01-16 09:46:05 cloud_vm_ray_backend.py:3315] Running setup on 1 node.
...
...
...
(nos-server, pid=12112) Status: Downloaded newer image for autonomi/nos:0.1.4-gpu
(nos-server, pid=12112) docker.io/autonomi/nos:0.1.4-gpu
(nos-server, pid=12112) 2024-01-16 17:49:09.415 | INFO | nos.server:_pull_image:235 - Pulled new server image: autonomi/nos:0.1.4-gpu
(nos-server, pid=12112) ✓ Successfully generated docker-compose file
(nos-server, pid=12112) (filename=docker-compose.sky_workdir.yml).
(nos-server, pid=12112) ✓ Launching docker compose with command: docker compose -f
(nos-server, pid=12112) /home/gcpuser/.nos/tmp/serve/docker-compose.sky_workdir.yml up
(nos-server, pid=12112) Container serve-nos-server-1 Creating
(nos-server, pid=12112) Container serve-nos-server-1 Created
(nos-server, pid=12112) Container serve-nos-http-gateway-1 Creating
(nos-server, pid=12112) Container serve-nos-http-gateway-1 Created
(nos-server, pid=12112) Attaching to serve-nos-http-gateway-1, serve-nos-server-1
(nos-server, pid=12112) serve-nos-server-1 | Starting server with OMP_NUM_THREADS=4...
(nos-server, pid=12112) serve-nos-http-gateway-1 | WARNING: Current configuration will not reload as not all conditions are met, please refer to documentation.
(nos-server, pid=12112) serve-nos-server-1 | ✓ InferenceExecutor :: Connected to backend.
(nos-server, pid=12112) serve-nos-server-1 | ✓ Starting gRPC server on [::]:50051
(nos-server, pid=12112) serve-nos-server-1 | ✓ InferenceService :: Deployment complete (elapsed=0.0s)
(nos-server, pid=12112) serve-nos-http-gateway-1 | INFO: Started server process [1]
(nos-server, pid=12112) serve-nos-http-gateway-1 | INFO: Waiting for application startup.
(nos-server, pid=12112) serve-nos-http-gateway-1 | INFO: Application startup complete.
(nos-server, pid=12112) serve-nos-http-gateway-1 | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
🔋 3. Check the status of your NOS server¶
You can check the status of your NOS server with the following command:
You should see the following output:
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
nos-server 1 min ago 1x GCP(n1-highmem-4, {'T4': 1}, ports=[8000, 50051]) UP - sky launch -c nos-server-...
Congratulations! You've successfully deployed your NOS server on GCP. You can now access the NOS server from your local machine at <ip>:8000
or <ip>:50051
. In a new terminal, let's check the health of our NOS server with the following command:
You should see the following output:
💬 4. Chat with your hosted LLM endpoint¶
You can now chat with your hosted LLM endpoint. Since NOS exposes an OpenAI compatible API via it's /v1/chat/completions
route, you can use any OpenAI compatible client to chat with your hosted LLM endpoint.
Below, we show how you can use any OpenAI API compatible client to chat with your hosted LLM endpoint. We will use the popular llm CLI tool from Simon Willison to chat with our hosted LLM endpoint.
# Install the llm CLI tool
$ pip install llm
# Install the llm-nosrun plugin to talk to your service
$ llm install llm-nosrun
# List the models
$ llm models list
# Chat with your endpoint
$ NOSRUN_API_BASE=http://$(sky status --ip nos-server):8000/v1 llm -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 "Tell me a joke in 300 words."
Note: You can also change the NOSRUN_API_BASE
to http://localhost:8000/v1
to talk to your local NOS server.
Below, we show how you can use the OpenAI Python Client to chat with your hosted LLM endpoint.
import subprocess
import openai
# Get the output of `sky status --ip nos-server` with subprocess
address = subprocess.check_output(["sky", "status", "--ip", "nos-server"]).decode("utf-8").strip()
print(f"Using address: {address}")
# Create a stream and print the output
client = openai.OpenAI(api_key="no-key-required", base_url=f"http://{address}:8000/v1")
stream = client.chat.completions.create(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
messages=[{"role": "user", "content": "Tell me a joke in 300 words"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
On the first call to the server, the server will download the model from Huggingface, cache it locally and load it onto the GPU. Subsequent calls will not have any of this overhead as the GPU memory for the models will be pinned.
🛑 5. Stop / Terminate your NOS server¶
Once you're done using your server, you can stop it with the following command:
Alternatively, you can terminate your server with the following command:
This will terminate the server and all associated resources (e.g. VMs, disks, etc.).