# Load Files in Jobs

Valohai inputs are data files, or collections of files, accessible during an execution, fetched from your cloud storage or public sources (HTTP/HTTPS)

Supported sources include AWS S3, Azure Storage, GCP Cloud Storage, Oracle buckets, on-prem S3, or any public links.

Valohai simplifies data handling:

* Manages authentication with your cloud storage.
* Handles downloading, unpacking and caching.
* Eliminates the need to manage keys or authentication in your code.

## Define an input

In your `valohai.yaml`, each step can have one or multiple inputs, each resulting in one or more files.

You can set default values for inputs, which can be overridden every time you create an execution.\
For instance, you can change the set of images for batch inference.

```yaml
- step:
    name: image_processor
    environment: ec2-instance
    image: python:3.10
    command:
      - python process_image.py
    inputs:
      ## Results in one file
      ## e.g /valohai/inputs/image/<image-name>
      - name: image
        default: datum://01234567-89ab-cdef-0123-456789abcdef

      ## Results in a group of files with .jpg extension
      ## e.g
      ## /valohai/inputs/image_set/image_1.jpg
      ## /valohai/inputs/image_set/image_2.jpg
      ## ...
      - name: image_set
        default: s3://mybucket/unprocessed/*.jpg

      ## As is, will result in no files being download
      ## Later on you can override this input and make it resolve to one or more files
      - name: additional_data
        optional: true
```

### Select single file <a href="#id-1-download-location" id="id-1-download-location"></a>

To select single file as an input you can use:

* **Object store URL**
  * **Amazon S3**: `s3://{bucket}/{key}`
  * **Azure Blob Storage**: `azure://{account_name}/{container_name}/{blob_name}`
  * **Google Storage**: `gs://{bucket}/{key}`
  * **OpenStack Swift**: `swift://{project}/{container}/{key}`
* **Datum URI**
  * datum://01234567-89ab-cdef-0123-456789abcdef
* **Any public http/https URL**
  * <https://somewebsite.com/some\\_image>

### Select collection of files

#### Wildcards with object store URLs <a href="#id-4-wildcards" id="id-4-wildcards"></a>

* `s3://my-bucket/dataset/images/*.jpg` for all .jpg (JPEG) files
* `s3://my-bucket/dataset/image-sets/**.jpg` for recursing subdirectories for all .jpg (JPEG) files

  For example, expression above will match:

  * `s3://my-bucket/dataset/image-sets/cats/big/cat-1.jpg`
  * `s3://my-bucket/dataset/image-sets/cats/small/cat-9.jpg`
  * `s3://my-bucket/dataset/image-sets/dog/dog-5.jpg`
  * `...`

> :bulb: **Parameter interpolation**
>
> You can also interpolate [execution parameters ](/migration-strategy/migrate-job-parameters.md)into input URIs:
>
> `s3://my-bucket/dataset/images/{parameter:user-id}/*.jpeg` would replace `{parameter:user-id}` with the value of the parameter `user-id` during an execution.

#### Datasets <a href="#id-5-parameters-in-an-url" id="id-5-parameters-in-an-url"></a>

Since [datasets ](/data/datasets.md)are actually just pointers to the collections of files, using dataset version URL as an input will result in every file included in that version to be downloaded.

An example of dataset version URL: `dataset://boats/semi-processed`

> :bulb: **Latest version**
>
> You can always use `latest` as a version, for any dataset, to point to the latest dataset version.
>
> e.g: `dataset://boats/latest`

#### Datum queries

Using the UI you can select the files to be downloaded by executing a simple query on datum [properties](/data/data-versioning/metadata-overview/custom-properties.md).

<figure><img src="/files/IWvAiPdDwTHuI1FqqeeP" alt=""><figcaption></figcaption></figure>

Query defined in the example above will match all datums that:

* have `index` property greater than 10 **and** have `processed_date` property assigned (no matter of its value) **and** have property `type` equals to "dog".
* `Limit` field determines the maximum number of files that will be returned by the query (it's optional)

**Supported operators**:

* `>`, `>=`, `<`, `<=`: Intended for use with numerals.
* `==`, `!=`: Applicable to all data types.
* `contains`: Used for strings.
* `exists`: Checks for property existence, e.g., `property_name exists true` or `property_name exists false`.

Numerical values (e.g., 2, 2.1, 1.332) are treated as **numbers**.\
Values: `true`, `false`, `True`, and `False` are treated as **booleans.**\
Everything else is treated as a **string**.

> :exclamation: If a property value of a datum is set to `"False"` (or `"True"` - boolean passed as a string)
>
> ```
> e.g. valohai.metadata.jsonl
> {..., "metadata": { "property_one": "False", ... }}
> ```
>
> such datum will not be matched with the query: `property_one == False` - value passed in the query will be treated as a **boolean** while the one that's actually attached to the datum is of the **string** type.

Click `Add metadata filter block` to add multiple metadata query blocks. The final results will be the union of each block's results.

> :warning: **Non reproducible**
>
> Be aware that the results of these queries will be calculated each time you create an execution.\
> If new files are added, with the matching properties, in between the executions, query results **will not be the same** - the more recent one will include newly added files as well!

#### Directory structure

Without any additional configuration, examples above will result in all files being downloaded in a single directory under `/valohai/inputs/<input-name>/`. From their original path (where they are stored in a store e.g `s3://bucket-name/dir/subdir/file.jpg`) only the file name (basename) will be kept (i.e. `file.jpg`).

To preserve the directory structure found in the store, you can use `keep-directories` input property.

**keep-directories** can take next values:

* **none** (default) all files are downloaded to `/valohai/inputs/myinput` , only the file name is kept from the actual storage path
* **full**&#x20;
  * When selecting files using a datum URL or dataset version URL, the entire directory structure shown under the `Data` tab for each datum will be maintained. This structure is derived either from subdirectories in `/valohai/outputs/` at upload time (for[ execution output files](/data/data-versioning/save-files-from-jobs.md)) or from the `Upload path` if the file is [manually uploaded](/data/data-versioning/upload-files-via-web-ui.md).
  * When files are selected using store URL or URL wildcards, full path from the storage root is kept and recreated in `/valohai/inputs/<input-name>`. \
    For example `s3://special-bucket/foo/bar/**.jpg` will end up as \
    `/valohai/inputs/<input-name>/foo/bar/dataset1/a.jpg`
* **suffix**&#x20;
  * When selecting files using a datum URL or dataset version URL, behavior is the same as with the **full** option.&#x20;
  * When files are selected using store URL, only the filename is preserved. \
    For example: `s3://valohai-data/data/output-1389/subdir/file.txt` will result in \
    `/valohai/inputs/<input-name>/file.txt`
  * When files are selected using URL wildcards, directory structure after the wildcard character will be preserved. \
    For example `s3://special-bucket/foo/bar/*` the `special-bucket/foo/bar/` would be removed, but any relative path after it would be kept, and you might end up with \
    `/valohai/inputs/myinput/dataset1/a.jpg`

Example:

```yaml
- step:
    name: image_processor
    environment: ec2-instance
    image: python:3.10
    command:
      - python process_image.py
    inputs:
      - name: image_set
        keep-directories: suffix
        default: s3://mybucket/unprocessed/*
```

## **Access files** <a href="#id-1-download-location" id="id-1-download-location"></a>

### Download location

All requested data is downloaded to a local directory `/valohai/inputs` on the machine used for execution - no matter if that's a cloud instance or a physical on-prem machine, behavior is the same.

Each input has its own directory structure within this location. From the [example above,](#define-an-input) three directories will be created:

* `/valohai/inputs/image/`
* `/valohai/inputs/image_set/`
* `/valohai/inputs/additional_data/` \
  :bulb: Only if some value is provided when the execution is created, otherwise, the directory for this input will not be created, as it's marked as `optional`

### Using valohai-utils

`valohai-utils` library provides a convenient way of traversing the downloaded files just by specifying the `input` name, e.g:

```python
import valohai

input_name = "training_data"

for file_path in valohai.inputs(input_name).paths():
    with open(file_path, "r") as f:
        content = f.read()
        ## Do some processing
```

### Using system environment variables

To avoid the use of additional libraries, `VH_INPUTS_DIR` environment variable will be provided in every execution environment, pointing to `/valohai/inputs` . You can then use this path in such way:

```python
import os

input_name = "training_data"
inputs_dir = os.getenv("VH_OUTPUTS_DIR", default="./inputs")

inputs_path = os.path.join(inputs_dir, input_name)
for file_path in os.listdir(inputs_path):
    with open(file_path, "r") as f:
        content = f.read()
        ## Do some processing
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/data-versioning/load-files-in-jobs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
