Inference & Serving
Valohai offers two distinct approaches for running inference, each designed for different use cases and latency requirements.
Which path fits your needs?
Batch Inference via Executions
Run inference jobs on datasets using Valohai's standard execution system.
Use this when:
You're processing large datasets or file batches
Predictions can take minutes or hours
You need to run scheduled or triggered inference jobs
What you'll build:
Python inference scripts
Standard Valohai executions with inputs and outputs
Scheduled or API-triggered batch jobs
Where to start: Run Batch Inference
Real-Time Endpoints
Deploy models as RESTful APIs on Kubernetes for low-latency predictions.
Use this when:
You need predictions in milliseconds or seconds
Your application requires synchronous responses
You're building user-facing features or interactive systems
What you'll build:
FastAPI or Flask inference servers
Auto-scaling Kubernetes endpoints
Version-controlled deployment aliases
Where to start: Deploy a Real-Time Endpoint
Key Differences
Latency
Milliseconds to seconds
Minutes to hours
Infrastructure
Kubernetes cluster (you configure)
Valohai-managed VMs
API
You build RESTful APIs
Valohai REST API triggers jobs
Scaling
Kubernetes auto-scaling
VM-based execution queues
Code Requirements
Web framework (FastAPI, Flask)
Standard Python scripts
Need Help Deciding?
Ask yourself: "Does my user need an answer right now?"
Yes → Real-Time Endpoints
No → Batch Executions
Still unsure? Check out our video overview or reach out to your Valohai representative.
Last updated
Was this helpful?
