Inputs: Access Your Data

Valohai handles secure access to your files in object storages like for example AWS S3, Azure Blob Storage, Google Cloud Storage, and more.

💡 Your data stays where it is. Valohai downloads files only when needed and manages caching automatically.

How Inputs Work

  1. Configure data store access once (project or organization level)

  2. Define inputs in valohai.yaml with cloud URLs

  3. Access files locally at /valohai/inputs/, no download code needed

Valohai handles authentication, parallel downloads, and caching behind the scenes.

Quick Example

Define in valohai.yaml

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0
    command:
        - python train_model.py
    inputs:
        - name: images
          default:
          - s3://mybucket/factories/images/*.png
          keep-directories: suffix
        - name: model
          default: datum://production-latest
          filename: model.pkl

The inputs will be downloaded to /valohai/inputs/.

In the above case, you’ll find your:

  • images in the directory /valohai/inputs/images/ with the folder structure from your object data stores intact.

  • model will be downloaded to the directory /valohai/inputs/model/ and the file will always be renamed to model.pkl

Use in Python

In Python you’ll access these files like any other file, as they’ll be available locally on the machine.

import os
from PIL import Image

# Path to downloaded model
model_path = '/valohai/inputs/model/model.pkl'

# Path to the directory where the images are downloaded
images_directory = '/valohai/inputs/images/'

# Loop through images in the directory
for root, dirs, files in os.walk(images_directory):
    for filename in files:
        image_path = os.path.join(root, filename)

        image_name = os.path.basename(image_path)
        image = Image.open(image_path)
        image.load()

That's it. No boto3, no credentials, no download loops.

Optional: Use the valohai-utils Python helper tool

The valohai-utils helper library offers a simpler syntax:

import valohai
import os
from PIL import Image

# Path to downloaded model
model_path = valohai.inputs("model").path()

# Path to the directory where the images are downloaded
images_directory = valohai.inputs("images").dir_path()

# Loop through images in the directory
for root, dirs, files in os.walk(images_directory):
    for filename in files:
        image_path = os.path.join(root, filename)

        image_name = os.path.basename(image_path)
        image = Image.open(image_path)
        image.load()

Common Patterns

Multiple Files with Wildcards

inputs:
    - name: images
      default:
        - s3://mybucket/train/images/*.jpg
        - s3://mybucket/train/images/*.png

All matching files download to /valohai/inputs/images/.

Multiple Cloud Sources

inputs:
    - name: data
      default:
        - s3://aws-bucket/data/*.parquet
        - azure://mycontainer/data/*.parquet
        - gs://gcs-bucket/data/*.parquet

Mix and match storage providers in one input. All the files will be downloaded under /valohai/inputs/data/.

💡 Files defined under the same input are downloaded to the same directory. If their names are not unique, they will override each other and only one of them will be available in the execution.

Single File with Rename

inputs:
    - name: pretrained
      default: s3://models/bert-base.h5
      filename: model.h5  # Always save as this name

Access at /valohai/inputs/pretrained/model.h5.

Keep Directory Structure

keep-directories is used to define what folder structure should Valohai use in the inputs folder.

  • none: (default) all files are downloaded to /valohai/inputs/myinput

  • full: keeps the full path from the storage root. For example s3://special-bucket/foo/bar/**.jpg could end up as /valohai/inputs/myinput/foo/bar/dataset1/a.jpg

  • suffix: keeps the suffix from the “wildcard root”. For example s3://special-bucket/foo/bar/* the special-bucket/foo/bar/ would be removed, but any relative path after it would be kept, and you might end up with /valohai/inputs/myinput/dataset1/a.jpg

inputs:
    - name: dataset
      default: s3://bucket/project/**/*.json
      keep-directories: suffix  # Preserves folder structure

Override Inputs at Runtime

Default inputs are just starting points. Override them when running:

# CLI
vh execution run train-model \
  --dataset=s3://different-bucket/experiment-data/*.csv

# Or use the web UI to browse and select different files
# Or pass different URLs via API

Quick Reference

Define inputs in valohai.yaml

inputs:
    - name: mydata
      default: s3://bucket/*.csv
# Access at: /valohai/inputs/mydata/file.csv

Use as local files

Inputs are available under /valohai/inputs/{input-name}/ :

import pandas as pd
import glob

# Files are already downloaded to /valohai/inputs/mydata/
csv_files = glob.glob('/valohai/inputs/mydata/*.csv')

Dynamic File Selection

Don't hardcode paths in YAML. Pass them at runtime:

vh execution run train \
  --images=s3://bucket/client-xyz/images/*.jpg

Inputs can be overridden at runtime

vh execution run train-model \
  --dataset=s3://different-bucket/experiment-data/*.csv

# Or in the UI / with the API

Options

  • filename: newname.ext — Rename single input file on download

  • keep-directories: suffix — Preserve folder structure


Last updated

Was this helpful?