Index
Serving LLMs with streaming support¶
This tutorial shows how to serve an LLM with streaming support.
Serve the model¶
The serve.yaml
file contains the specification of the custom image that will be used to build the docker runtime image and serve the model using this custom rumtime image. You can serve the model via:
Run the tests (via the gRPC client)¶
You can now run the tests to check that the model is served correctly:
Run the tests (via the REST/HTTP client)¶
You can also run the tests to check that the model is served correctly via the REST API:
Use cURL to call the model (via the REST API)¶
NOS also exposes an OpenAI API compatible endpoint for such custom LLM models. You can call the model via the /chat/completions
route: