Skip to main content
Workers are the containerized environments that run your code on Runpod Serverless. After creating and testing your handler function, you need to package it into a Docker image and deploy it to an endpoint. This page provides an overview of the worker deployment process.

Create a Dockerfile

To deploy your worker to Runpod, you need to create a Dockerfile that packages your handler function and all its dependencies.

Package and deploy a worker

Once you’ve created your Dockerfile, you can deploy the worker image to a Serverless endpoint using one of the following methods:

Deploy from Docker Hub

Build your Docker image locally, push it to Docker Hub (or another container registry), and deploy it to Runpod. This gives you full control over the build process and allows you to test the image locally before deployment.

Deploy from GitHub

Connect your GitHub repository to Runpod and deploy directly from your code. Runpod automatically builds the Docker image from your repository and deploys it to an endpoint. This streamlines the deployment process and enables continuous deployment workflows.

Worker storage options

Serverless offers two storage options for your workers to access and store data for requests:
  • Container volumes: Temporary storage that exists only while a worker is running, and is completely lost when the worker is stopped or scaled down.
  • Network volumes: Persistent storage that can be attached to different workers and even shared between multiple workers.
See Storage options for more details.

Model deployment

To deploy your workers with AI/ML models, follow this order of preference:
  1. Use cached models: If your model is available on Hugging Face (public or gated), this is the recommended approach. Cached models provide the fastest cold starts, eliminate download costs, and persist across worker restarts.
  2. Bake the model into your Docker image: If your model is private and not available on Hugging Face, embed it directly in your worker’s container image using COPY or RUN wget. This ensures the model is always available, but it increases image size and build time.
  3. Use network volumes: You can use network volumes to store models and other files that need to persist between workers. Models loaded from network storage are slower than cached or baked models, so you should only use this option when the preceeding approaches don’t fit your needs.

Worker configuration

When deploying workers, you can configure:
  • GPU/CPU types: Select specific GPU models for your workload.
  • GPU count: Set the number of GPUs for each worker.
  • Max workers: Set the maximum number of workers for your endpoint.
  • Container disk size: Allocate temporary storage for your worker. See Storage options.
  • Environment variables: Pass configuration values to your worker. See Environment variables for usage details.
  • Model caching: Pre-load models to reduce cold start times. See Cached models.
These settings are configured when you create or edit an endpoint.

Active vs. flex workers

You can deploy workers in two modes:
  • Active workers: “Always on” workers that eliminate cold start delays. They never scale down, so you are charged as long as they are active, but they receive a discount (up to 30%) compared to flex workers. (Default: 0).
  • Flex workers: “Sometimes on” workers that scale during traffic surges. They transition to idle after completing jobs. (Default: max_workers - active_workers = 3).
The system will also sometimes add additiona extra workers during traffic spikes when Docker images are cached on host servers. (Default: 2).

Worker states

Workers move through different states as they handle requests and respond to changes in traffic patterns. Understanding these states helps you monitor and troubleshoot your workers effectively.
  • Initializing: The worker starts up while the system downloads and prepares the Docker image. The container starts and loads your code.
  • Idle: The worker is ready but not processing requests. No charges apply while idle.
  • Running: The worker actively processes requests. Billing occurs per second.
  • Throttled: The worker is ready but temporarily unable to run due to host machine resource constraints.
  • Outdated: The system marks the worker for replacement after endpoint updates. It continues processing current jobs during rolling updates (10% of max workers at a time).
  • Unhealthy: The worker has crashed due to Docker image issues, incorrect start commands, or machine problems. The system automatically retries with exponential backoff for up to 7 days.
You can view the state of your workers using the Workers tab of the Serverless endpoint details page in the Runpod console. This page provides real-time information about each worker’s current state, resource utilization, and job processing history, allowing you to monitor performance and troubleshoot issues effectively.

Max worker limit

By default, each Runpod account can allocate a maximum of 5 workers (flex + active combined) across all endpoints. If your account balance exceeds a certain threshold, you can increase this limit:
  • $100 balance: 10 max workers
  • $200 balance: 20 max workers
  • $300 balance: 30 max workers
  • $500 balance: 40 max workers
  • $700 balance: 50 max workers
  • $900 balance: 60 max workers
If your workload requires additional capacity beyond 60 workers, contact our support team.

Best practices

Follow these best practices when deploying workers:
  • Optimize image size: Smaller images download faster and reduce cold start times. See Create a Dockerfile for optimization techniques.
  • Use model caching: Pre-load models to avoid downloading them on every cold start. See Cached models.
  • Test locally first: Always test your handler locally before deploying. See Local testing.
  • Handle errors gracefully: Implement proper error handling to prevent worker crashes. See Error handling.
  • Debug using logs and SSH: Use logs and SSH to debug and optimize your workers. See Monitor logs and SSH into workers.

Next steps