vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.
As a bonus point, vLLM is fully open source. It is released under Apache 2 License.
For the purpose of this tutorial, we will use vLLM to host Mistral AI 7B model, Mistral-7B-v0.1. We will also show two different approaches to start a project. One is to use pre-built vLLM docker image. The other is to directly run a Python script.
Step 0, (Optional) run locally
It is usually a good practice to run code locally before deploying it to cloud because it is faster to develop and debug. To run the vLLM locally, use the following command.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
After the container is up and running, we can query the model using the following command.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "My favourite condiment is",
"max_tokens": 25
}'
vLLM provides an OpenAI-compatible API server. If you are interested, you can read the full API specification on OpenAI website (opens in a new tab)
Step 1. Create a project
Now we will deploy the vLLM framework to EverlyAI.
Visit the Projects (opens in a new tab) page and click Create Project
button. On the next page, enter the following configurations.
- docker image: vllm/vllm-openai:latest
- docker command: --model mistralai/Mistral-7B-v0.1
- docker port: 8000
- docker shared memory: 1G
and click Create
button. An example is shown below.
Step 2. query the model
After the project is created, we need to wait until the Status of instance to be RUNNING
. We can reuse the same command in step 0.
Simply replace http://localhost:8000
by https://everlyai.xyz/endpoint
and insert your project API Key to header.
-curl http://localhost:8000/v1/completions \
+curl https://everlyai.xyz/endpoint/v1/completions \
-H "Content-Type: application/json" \
+ -H 'Authorization: Bearer <API_KEY>>' \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "My favourite condiment is",
"max_tokens": 25
}'
An example request is shown below. The API key in the example is revoked. Please replace it with your own API key.
curl https://everlyai.xyz/endpoint/v1/completions \
-H "Content-Type: application/json" \
-H 'Authorization: Bearer epk_e04ARV9sF8iXuCDID1nBQUVWnieD2n8R' \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "My favourite condiment is",
"max_tokens": 25
}'
{"id":"cmpl-cba820140e554176ad15c2b9ea54dcfe","object":"text_completion","created":546,"model":"mistralai/Mistral-7B-v0.1","choices":[{"index":0,"text":" the salad cream – quite like mayonnaise, but nicer. Or a the other way round. Depends","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":6,"total_tokens":31,"completion_tokens":25}}