Auto Scaling
EverlyAI will automatically help you adjust number of machines based on the load. This will help you
- Increase number of machines when there is a traffic spike.
- Reduce number of machines when traffic volume drops to save cost.
This feature is only available for model serving, not for model training.
Horizontal scaling
From the most basic perspective, EverlyAI operates on the ratio between desired metric value and current metric value:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
For example, if the current metric value is 20, and the desired value is 10, the number of replicas will be doubled, since 20 / 10 == 2.0. If the current value is instead 5, you'll halve the number of replicas, since 5 / 10 == 0.5.
EverlyAI supportsa few metrics, such as
- CPU utilization
- Memory utilization
- GPU utilization
The currentMetricValue is computed by taking the average of the given metric across all instances in the project.
Scale to zero
You can set the Minimum number of machine
to zero so that EverlyAI will scale it to zero instance when there is no
user traffic after scale to zero grace period
seconds.
You should set scale to zero grace period
to a relatively large value to reduce interruptions.