Download input data

Note

This tutorial is a part of our Valohai fundamentals series.

By defining inputs you can easily download data from a public address or your private cloud storage.

In this section you will learn:

  • How to define Valohai inputs

  • How to change inputs between executions both in the CLI and in the UI

A short introduction inputs

  • Valohai will download data files from your private cloud storage. Data can come from for example AWS S3, Azure Storage, GCP Cloud Storage, or a public source (HTTP/HTTPS).

  • Valohai handles both authentication with your cloud storage and downloading, uploading, and caching data.

    • This means that you don’t need to manage keys, authentication, and use tools like boto3, gsutils, or BlobClient in your code. Instead you should always treat the data as local data.

  • All Valohai machines have a local directory /valohai/inputs/ where all your datasets are downloaded to. Each input will have it’s own directory, for example /valohai/inputs/images/ and /valohai/inputs/model/.

  • Each step in your valohai.yaml can contain one or multiple input definitions and each input can contain one or multiple files. For example, in a batch inference step you could have a trained model file and a set of images you want to run the inference on.

  • Each input in valohai.yaml can have a default value. These values can be overriden any time you run a new execution to for example change the set of images you want to run batch inference on.

Let’s start by defining an input in our valohai.yaml. Uncomment the sample input and update the name and default address.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
---

- step:
    name: train-model
    image: tensorflow/tensorflow:2.1.0-py3
    command: python train.py {parameters}
    inputs:
      - name: mnist
        default: s3://onboard-sample/tf-sample/mnist.npz
    parameters:
      - name: epoch
        type: integer
        default: 5

Update train.py to add inputs:

  • Get the path to the Valohai inputs folder from the environment variable VH_INPUTS_DIR

  • Update the mnist_file_path to point to a single file in the Valohai inputs.

  • Remove the line mnist = tf.keras.datasets.mnist

You should also remove the mnist.npz from your local machine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import tensorflow as tf
import numpy
import argparse

VH_OUTPUTS_DIR = os.getenv('VH_OUTPUTS_DIR', '.outputs/')
VH_INPUTS_DIR = os.getenv('VH_INPUTS_DIR', '.inputs/')

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--epoch', type=int, default=10)
    return parser.parse_args()

args = parse_args()

mnist_file_path = os.path.join(VH_INPUTS_DIR, 'mnist/mnist.npz')

with numpy.load(mnist_file_path, allow_pickle=True) as f:
    x_train, y_train = f['x_train'], f['y_train']
    x_test, y_test = f['x_test'], f['y_test']

x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()
predictions

tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer='adam',
            loss=loss_fn,
            metrics=['accuracy'])

model.fit(x_train, y_train, epochs=args.epoch)

save_path = os.path.join(VH_OUTPUTS_DIR, 'model.h5')
model.save(save_path)

Run in Valohai

Finally run a new Valohai execution.

vh exec run train-model --adhoc

Rerun an execution with different input data

  • Open your project on app.valohai.com

  • Open the latest execution

  • Click Copy

  • Scroll down to the Inputs section and remove the current input.

  • You can now either pass in a new URI or select an input from the Data list (for example, if you’ve uploaded a file)

  • Click Create execution

Tip

You can also run a new execution with different input value from the command line:

vh exec run train-model --adhoc --mnist=https://myurl.com/differentfile.npz

Next: Collect and view metrics