Index

📦 NOS Tutorials¶

The following tutorials give a brief overview of how to use NOS to serve models.

01-serving-custom-models: Serve a custom GPU model with NOS.
02-serving-multiple-methods: Expose several custom methods of a model for serving purposes.
03-llm-streaming-chat: Serve an LLM with streaming support (TinyLlama/TinyLlama-1.1B-Chat-v0.1).
04-serving-multiple-models: Serve multiple models such as TinyLlama/TinyLlama-1.1B-Chat-v0.1 and distil-whisper/distil-small.en on the same GPU with custom model resources -- enable multi-modal applications like audio transcription + summarization on the same device.
05-serving-with-docker: Use NOS in a production environment with Docker and Docker Compose.

For each of the examples, you can run the following command to serve the model (in one of your terminals):

nos serve up -c serve.yaml

You can then run the tests in the tests directory to check if the model is served correctly:

pytest -sv ./tests

For HTTP tests, you'll need add the --http flag to the nos serve command:

nos serve up -c serve.yaml --http