Download input data

Note

This tutorial is a part of our Valohai Fundamentals series.

By defining inputs you can easily download data from a public address or your private cloud storage.

In this section you will learn:

  • How to define Valohai inputs

  • How to change inputs between executions both in the CLI and in the UI

A short introduction inputs

  • Valohai inputs can be from a public location (HTTP/HTTPS) or from your private cloud storage (AWS S3, Azure Storage, GCP Cloud Storage, OpenStack Swift)

  • The input values you define in your code are default values. You can replace any defined inputs file(s) when creating an execution from the UI, command-line or API.

  • All inputs are downloaded and available during an execution to /valohai/inputs/<input-name>/

  • An input value can be a single file (e.g. myimages.tar.gz) or you can use a wildcard to download multiple files from a private cloud storage (e.g. s3://mybucket/images/*.jpeg)

  • You can interpolate parameter values into input URIs with the syntax s3://mybucket/images/{parameter:myparam}/**.jpeg. This is of particular use in tasks, where you can now easily run your execution on multiple variable datasets.

  • Valohai inputs are cached on the virtual machine.

Let’s start by defining an input in our valohai.yaml. Uncomment the sample input and update the name and default address.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
---

- step:
    name: train-model
    image: tensorflow/tensorflow:2.1.0-py3
    command: python train.py {parameters}
    inputs:
      - name: mnist
        default: s3://onboard-sample/tf-sample/mnist.npz
    parameters:
      - name: epoch
        type: integer
        default: 5

Update train.py to add inputs:

  • Get the path to the Valohai inputs folder from the environment variable VH_INPUTS_DIR

  • Update the mnist_file_path to point to a single file in the Valohai inputs.

  • Remove the line mnist = tf.keras.datasets.mnist

You should also remove the mnist.npz from your local machine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import tensorflow as tf
import numpy
import argparse

VH_OUTPUTS_DIR = os.getenv('VH_OUTPUTS_DIR', '.outputs/')
VH_INPUTS_DIR = os.getenv('VH_INPUTS_DIR', '.inputs/')

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--epoch', type=int, default=10)
    return parser.parse_args()

args = parse_args()

mnist_file_path = os.path.join(VH_INPUTS_DIR, 'mnist/mnist.npz')

with numpy.load(mnist_file_path, allow_pickle=True) as f:
    x_train, y_train = f['x_train'], f['y_train']
    x_test, y_test = f['x_test'], f['y_test']

x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()
predictions

tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer='adam',
            loss=loss_fn,
            metrics=['accuracy'])

model.fit(x_train, y_train, epochs=args.epoch)

save_path = os.path.join(VH_OUTPUTS_DIR, 'model.h5')
model.save(save_path)

Run in Valohai

Finally run a new Valohai execution.

vh exec run train-model --adhoc

Rerun an execution with different input data

  • Open your project on app.valohai.com

  • Open the latest execution

  • Click Copy

  • Scroll down to the Inputs section and remove the current input.

  • You can now either pass in a new URI or select an input from the Data list (for example, if you’ve uploaded a file)

  • Click Create execution

Tip

You can also run a new execution with different input value from the command line:

vh exec run train-model --adhoc --mnist=https://mmyurl.com/differentfile.npz