Managed Serving
Overview
EverlyAI Managed Serving simplifies the transition from development to production. It allows you to serve any model with minimal setup, from transformers and diffusers to PyTorch and Tensorflow to XGBoost and sklearn.
To maximize performance, safety and flexibility, we employ an innovative serving stack. When you create a project for serving, we will bring up two components,
- L4 Load Balancer (opens in a new tab) for extremely fast and efficient load balancing.
- A set of machines/servers to run your code.
Your client directly talks to the servers that runs your code without hopping through any of our server stack, as shown in the figure below.
Key Features
1. Use any framework or HTTP endpoint
We do not force you to use a particular framework or endpoint. Instead, you can use any framework and define any endpoint. In addition, you can
- Serve multiple models together.
- Load model or model adapters on demand when serving user requests.
- Run your backend logic, AI agent logic and model serving logic inside a single server to simplify your backend stack and save cost.
2. Automatic failover
We will monitor the health of the servers and automatically fail over to another machine or data center. After that, we will update the load balancer configuration for you.
3. Auto scaling
You can configure the auto scaling configuraiton of the project to upscale when user traffic increases and downscale when user traffic decreases. We support scale down to 0 to further optimize cost.
4. Low latency
The traffic flows from clients directly to the servers that run your code. There is no additional hop. In addition, we use L4 load balancer to further minimize latency!
5. Secure by default
All communications are encrypted by default with HTTPs. The TLS certificates are fully managed for you.
6. Built-in privacy
Your user request does not flow through any of our internal stack. We do not have access to any of your user traffic. Your data entirely belong to you.
Get Started
As a first example, let's create a toy application with FastAPI (opens in a new tab). We will first create a toy application with two endpoints.
- A GET endpoint at
/test
, - A POST endpotin at
/la/la
.
Step 1, implement server code
To do so, we will first create a file server.py
, with the following content.
from fastapi import FastAPI
app = FastAPI()
# Defines a GET endpoint.
@app.get('/test')
def test():
return "hayyy"
# Defines a POST endpoint.
@app.post('/la/la')
def lala():
return "You want to post?"
We will also need to create another file everlyai_entrypoint.sh
to tell EverlyAI how to start the server. Note that your
server must run at port 8000.
pip install fastapi
pip install uvicorn
uvicorn server:app --port 8000 --host 0.0.0.0
Now, we have all the code ready. Run the code below to package the code.
zip code.zip server.py everlyai_entrypoint.sh
Step 2, create a EverlyAI project
Create a project with the following configuration.
- Job type: Model Serving
- Code: Local code
- Upload zip file: select the zip created in step 1
- Enable project api validation: OFF
Step 3, use the endpoint
Wait until the instance transits to RUNNING
state. Go to https://<public-domain>/docs.
You will find the swagger UI for the endpoints you just defined.
(Optional) Step 4, enable project api check
If your code has user management or access control support, you do not need to enable project API check.
Repeat step 1 to 3. But in step 2, turn on Enable project api validation
. If you go to the swagger UI again,
it will fail with Invalid API key
.
To send request to it, you need to embed your project API key inside the request header. One CURL example is shown below.
curl https://swjxhbcvot.everlyai.xyz/test \
+ -H "Authorization: Bearer epk_p96Q209LZAjP8fI3jLdu9X611yNxIAzh"
"hayyy"