Input and output data

See also

This how-to is a part of our Bring your existing projects to Valohai series.

Valohai downloads data from your cloud storage as execution inputs and:

  1. Manages authentication with your cloud storage (AWS S3, Azure Blob Storage, Google Cloud Storage)

  2. Downloads all the input data and caches it on the machine

  3. Keeps track of which datasetd were used in which execution

In the same way, Valohai will upload all execution outputs to your cloud storage, version them and track if they’re bring used in other executions.

Read files from /valohai/inputs/

  • All files are downloaded to the Valohai inputs directory on the machine (i.e. /valohai/inputs)

  • Valohai will create a directory for each input inside that path (e.g. /valohai/inputs/myinput)

  • Each input can have one or multiple files. All the files that are provided for an input will be downloaded to the corresponding path (e.g. /valohai/inputs/myinput/mydata.csv)

import valohai

# Define inputs available for this step and their default location
# The default location can be overriden when you create a new execution (UI, API or CLI)
default_inputs = {
    'myinput': 's3://bucket/mydata.csv'
}

# Create a step 'Train Model' in valohai.yaml with a set of inputs
valohai.prepare(step="Train Model", default_inputs=default_inputs)

# Open the CSV file from Valohai inputs
with open(valohai.inputs("myinput").path()) as csv_file:
    reader = csv.reader(csv_file, delimiter=',')

Generate or update your existing YAML file by running

vh yaml step myfile.py

The generated valohai.yaml configuration file looks like:

# Get the location of Valohai inputs directory
VH_INPUTS_DIR = os.getenv('VH_INPUTS_DIR', '.inputs')

# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file = os.path.join(VH_INPUTS_DIR, 'myinput/mydata.csv')

pd.read_csv(path_to_file)

Create a valohai.yaml configuration file and define your step in it:

# Get the location of Valohai inputs directory
vh_inputs_dir <- Sys.getenv("VH_INPUTS_DIR", unset = ".inputs")

# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file <- file.path(vh_inputs_dir, "myinput/mydata.csv")

import_df <- read.csv(path_to_file, stringsAsFactors = F)

Create a valohai.yaml configuration file and define your step in it:

- step:
    name: Train Model
    image: tensorflow/tensorflow:1.13.1
    command: python myfile.py
    inputs:
      - name: myinput
        default: s3://bucket/mydata.csv

Save files to /valohai/outputs/

All files that are saved to /valohai/outputs/ will automatically get versioned, tracked and uploaded to your cloud storage.

import valohai

out_path = valohai.outputs().path('mydata.csv')
df.to_csv(out_path)
# Get the location of Valohai outputs directory
VH_OUTPUTS_DIR = os.getenv('VH_OUTPUTS_DIR', '.outputs')

# Define a filepath in Valohai outputs directory
# e.g. /valohai/outputs/<filename.ext>
out_path = os.path.join(VH_OUTPUTS_DIR, 'mydata.csv')
df.to_csv(out_path)
# Get the location of Valohai outputs directory
vh_outputs_path <- Sys.getenv("VH_OUTPUTS_DIR", unset = ".outputs")

# Define a filepath in Valohai outputs directory
# e.g. /valohai/outputs/<filename.ext>
out_path <- file.path(vh_outputs_path, "mydata.csv")
write.csv(output, file = out_path)