MLOps Platform
Fault Tolerance

Fault Tolerant

The instance may crash any time. It is important to make your job fault tolerant so that you are not losing any progress.

Training Job

For training job, it is highly recommended to periodically store checkpoint into EverlyAI File Storage and restore from checkpoint when your job restarts. For example, if you are using Transformer trainer library, you can follow the instructions here to resume training. https://github.com/huggingface/transformers/blob/main/examples/pytorch/README.md#resuming-training (opens in a new tab)

Serving Job

There is nothing you need to do. When the instance starts, it will run your code, which will re-download the model weights. Regardless where you store the model weights, in EverlyAI file storage or external storage solutions, this process is free.