Valohai inputs are data files, or collections of files, accessible during an execution, fetched from your cloud storage or public sources (HTTP/HTTPS)
Supported sources include AWS S3, Azure Storage, GCP Cloud Storage, on-prem S3, or any public link.
Valohai simplifies data handling:
- Manages authentication with your cloud storage.
- Handles downloading, uploading, and caching data.
- Eliminates the need to manage keys or authentication in your code.
Download location
- All datasets are downloaded to a local directory on Valohai machines (e.g.,
/valohai/inputs/
). - Each input has its own directory structure within this location (e.g.,
/valohai/inputs/images/
and/valohai/inputs/model/
). - Your code can always expect the data to be predownloaded on the machine, so you can treat it as local data.
Define an input
- In your
valohai.yaml
, each step can define one or multiple inputs, each containing one or more files. - You can set default values for inputs, which can be overridden when executing a new task. For instance, you can change the set of images for batch inference.
Syntax
You can use your object data stores address to point to a file.
- Azure Blob Storage:
azure://{account_name}/{container_name}/{blob_name}
- Google Storage:
gs://{bucket}/{key}
- Amazon S3:
s3://{bucket}/{key}
- OpenStack Swift:
swift://{project}/{container}/{key}
Wildcards
This syntax also has supported wildcard syntax to download multiple files:
s3://my-bucket/dataset/images/*.jpg
for all .jpg (JPEG) files \s3://my-bucket/dataset/image-sets/**.jpg
for recursing subdirectories for all .jpg (JPEG) files
Parameters in an URL
You can also interpolate execution parameters into input URIs:
s3://my-bucket/dataset/images/{parameter:user-id}/*.jpeg
would replace {parameter:user-id}
with the value of the parameter user-id
during an execution.
Download a folder
Use wildcard to download a whole folder to your inputs directory.
- step:
name: train
image: tensorflow/tensorflow:2.6.0
command: python train.py
inputs:
- name: model
default: s3://mybucket/models/model.pb
- name: images
default: s3://mybucket/images/*.jpg
keep-directories: suffix
keep-directories
keep-directories is used to define what folder structure should Valohai use in the inputs folder.
- none: (default) all files are downloaded to
/valohai/inputs/myinput
- full: keeps the full path from the storage root. For example
s3://special-bucket/foo/bar/**.jpg
could end up as/valohai/inputs/myinput/foo/bar/dataset1/a.jpg
- suffix: keeps the suffix from the “wildcard root”. For example
s3://special-bucket/foo/bar/*
the special-bucket/foo/bar/ would be removed, but any relative path after it would be kept, and you might end up with/valohai/inputs/myinput/dataset1/a.jpg
Code examples
Python
# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext<
path_to_file = '/valohai/inputs/myinput/mydata.csv'
pd.read_csv(path_to_file)
Create a valohai.yaml configuration files and define your step in it:
- step:
name: train
image: tensorflow/tensorflow:2.6.1-gpu
command: python myfile.py
inputs:
- name: myinput
default: s3://bucket/mydata.csv
R
# Get the location of Valohai inputs directory
vh_inputs_dir <- Sys.getenv("VH_INPUTS_DIR", unset = ".inputs")
# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file <- file.path(vh_inputs_dir, "myinput/mydata.csv")
import_df <- read.csv(path_to_file, stringsAsFactors = F)
- step:
name: train
image: tensorflow/tensorflow:2.6.1-gpu
command: python myfile.py
inputs:
- name: myinput
default: s3://bucket/mydata.csv
Python with valohai-utils
import valohai
# Define inputs available for this step and their default location
# The default location can be overridden when you create a new execution (UI, API or CLI)
default_inputs = {
'myinput': 's3://bucket/mydata.csv'
}
# Create a step 'train' in valohai.yaml with a set of inputs
valohai.prepare(step="train", image="tensorflow/tensorflow:2.6.1-gpu", default_inputs=default_inputs)
# Open the CSV file from Valohai inputs
with open(valohai.inputs("myinput").path()) as csv_file:
reader = csv.reader(csv_file, delimiter=',')
Generate or update your existing valohai.yaml file with:
vh yaml step myfile.py