Huggingface Text Generation Interface

Huggingface TGI (opens in a new tab) is a Rust, Python and gRPC server for text generation inference. It is used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint. In this tutorial, we will see how to deploy a model on EverlyAI dedicated instances using Huggingface TGI framework and query the hosted model. For the purpose of the tutorial, we will create the example as shown on the TGI Get Started (opens in a new tab) section.

⚠️

If you plan to use TGI in production, check out its license (opens in a new tab) first to make sure your use case complies with the license.

Step 0. (Optional) Run locally

It is usually a good practice to run code locally before deploying it to cloud because it is faster to develop and debug. To run the TGI locally, use the following command.

docker run --gpus all --shm-size 1g -p 8080:80 -v  ghcr.io/huggingface/text-generation-inference --model-id facebook/opt-1.3b

After the container is up and running, we can query the model using the following command.

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Step 1. Create a project

Now we will deploy the TGI framework to EverlyAI.

Visit the [Project]](https://everlyai.xyz/projects (opens in a new tab)) page and click Create Project button. In the configuration section, enter the following.

docker image: ghcr.io/huggingface/text-generation-inference
docker command: --model-id facebook/opt-1.3b
docker shared memory size: 1G

and click Create button.

Step 2. query the model

After the project is created, we need to wait until the Status of instance to be RUNNING. We can reuse the same command in step 0. Simply replace 127.0.0.1:8080 by https://everlyai.xyz/endpoint and insert your project API Key to header.

curl https://everlyai.xyz/endpoint/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <API_KEY>'

An example request is shown below. The API key in the example is revoked. Please replace it with your own API key.

curl https://everlyai.xyz/endpoint/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer sk_FEr_n6Av3AYeuPN726afdb247be4421a3b5c97cff05ff51USGX0'
 
{"generated_text":"\n\nDeep learning is a branch of artificial intelligence that uses deep neural networks to learn from data."}

vLLM Stable Diffusion Serving