Input and output data

See also

This how-to is a part of our Bring your existing projects to Valohai series.

A short introduction inputs

  • Valohai will download data files from your private cloud storage. Data can come from for example AWS S3, Azure Storage, GCP Cloud Storage, or a public source (HTTP/HTTPS).

  • Valohai handles both authentication with your cloud storage and downloading, uploading, and caching data.

    • This means that you don’t need to manage keys, authentication, and use tools like boto3, gsutils, or BlobClient in your code. Instead you should always treat the data as local data.

  • All Valohai machines have a local directory /valohai/inputs/ where all your datasets are downloaded to. Each input will have it’s own directory, for example /valohai/inputs/images/ and /valohai/inputs/model/.

  • Each step in your valohai.yaml can contain one or multiple input definitions and each input can contain one or multiple files. For example, in a batch inference step you could have a trained model file and a set of images you want to run the inference on.

  • Each input in valohai.yaml can have a default value. These values can be overriden any time you run a new execution to for example change the set of images you want to run batch inference on.

Read files from /valohai/inputs/

Start by configuring inputs to your step in valohai.yaml and update your code to read the data from Valohai inputs, rather than directly from your cloud storage.

import valohai

# Define inputs available for this step and their default location
# The default location can be overriden when you create a new execution (UI, API or CLI)
default_inputs = {
    'myinput': 's3://bucket/mydata.csv'
}

# Create a step 'train' in valohai.yaml with a set of inputs
valohai.prepare(step="train", image="tensorflow/tensorflow:2.6.1-gpu", default_inputs=default_inputs)

# Open the CSV file from Valohai inputs
with open(valohai.inputs("myinput").path()) as csv_file:
    reader = csv.reader(csv_file, delimiter=',')

Generate or update your existing YAML file by running

vh yaml step myfile.py

The generated valohai.yaml configuration file looks like:

# Get the location of Valohai inputs directory
VH_INPUTS_DIR = os.getenv('VH_INPUTS_DIR', '.inputs')

# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file = os.path.join(VH_INPUTS_DIR, 'myinput/mydata.csv')

pd.read_csv(path_to_file)

Create a valohai.yaml configuration file and define your step in it:

# Get the location of Valohai inputs directory
vh_inputs_dir <- Sys.getenv("VH_INPUTS_DIR", unset = ".inputs")

# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file <- file.path(vh_inputs_dir, "myinput/mydata.csv")

import_df <- read.csv(path_to_file, stringsAsFactors = F)

Create a valohai.yaml configuration file and define your step in it:

- step:
    name: train
    image: tensorflow/tensorflow:2.6.1-gpu
    command: python myfile.py
    inputs:
      - name: myinput
        default: s3://bucket/mydata.csv

Accessing databases and datawarehouses

You can also query data from sources like BigQuery, MongoDB, RedShift, Snowflake, BigTable and other databases and data warehouses. These are not accessed through Valohai inputs but instead you should run your existing code on Valohai to query from these data sources.

As database contents change over time, we recommend saving the query results as a file in Valohai outputs to make sure there is a snapshot of the query result, and you can later on easily reproduce your jobs with the exact same data.

Save files to /valohai/outputs/

A short introduction outputs

  • Any file(s) that you want to save, version, track, and access after the execution should be saved as Valohai outputs.

  • Valohai will upload all files to your private cloud storage and version those files.

  • Each output will be availble under the executions outputs tab and in the project’s data tab. From there you can download the file, or copy the link to that file.

  • When creating another execution you can pass in the datum:// address of an output file, or use a cloud specific address (i.e. s3://, gs://, azure://)

import valohai

out_path = valohai.outputs().path('mydata.csv')
df.to_csv(out_path)
# Get the location of Valohai outputs directory
VH_OUTPUTS_DIR = os.getenv('VH_OUTPUTS_DIR', '.outputs')

# Define a filepath in Valohai outputs directory
# e.g. /valohai/outputs/<filename.ext>
out_path = os.path.join(VH_OUTPUTS_DIR, 'mydata.csv')
df.to_csv(out_path)
# Get the location of Valohai outputs directory
vh_outputs_path <- Sys.getenv("VH_OUTPUTS_DIR", unset = ".outputs")

# Define a filepath in Valohai outputs directory
# e.g. /valohai/outputs/<filename.ext>
out_path <- file.path(vh_outputs_path, "mydata.csv")
write.csv(output, file = out_path)