Inference & Serving

Valohai offers two distinct approaches for running inference, each designed for different use cases and latency requirements.

Which path fits your needs?

Batch Inference via Executions

Run inference jobs on datasets using Valohai's standard execution system.

Use this when:

  • You're processing large datasets or file batches

  • Predictions can take minutes or hours

  • You need to run scheduled or triggered inference jobs

What you'll build:

  • Python inference scripts

  • Standard Valohai executions with inputs and outputs

  • Scheduled or API-triggered batch jobs

Where to start: Run Batch Inference


Real-Time Endpoints

Deploy models as RESTful APIs on Kubernetes for low-latency predictions.

Use this when:

  • You need predictions in milliseconds or seconds

  • Your application requires synchronous responses

  • You're building user-facing features or interactive systems

What you'll build:

  • FastAPI or Flask inference servers

  • Auto-scaling Kubernetes endpoints

  • Version-controlled deployment aliases

Where to start: Deploy a Real-Time Endpoint


Key Differences

Factor
Real-Time Endpoints
Batch Executions

Latency

Milliseconds to seconds

Minutes to hours

Infrastructure

Kubernetes cluster (you configure)

Valohai-managed VMs

API

You build RESTful APIs

Valohai REST API triggers jobs

Scaling

Kubernetes auto-scaling

VM-based execution queues

Code Requirements

Web framework (FastAPI, Flask)

Standard Python scripts


Need Help Deciding?

Ask yourself: "Does my user need an answer right now?"

  • Yes → Real-Time Endpoints

  • No → Batch Executions

Still unsure? Check out our video overview or reach out to your Valohai representative.

Last updated

Was this helpful?