Import Existing Cloud Files

Import files that already exist in your cloud storage into Valohai's data catalog. This creates datum:// links for external files without moving or re-uploading data.


When to Use This

Import existing cloud files when you need to:

  • Track legacy data — Bring pre-existing datasets into Valohai's lineage system

  • Use external datasets — Access data uploaded by other teams or processes

  • Avoid re-uploading — Create datum links for large files already in your cloud storage

  • Migrate to Valohai — Import historical data when adopting Valohai for existing projects

  • Share team data — Make files uploaded directly to cloud storage discoverable in Valohai


What This Does

"Adopting" files creates Valohai datum records pointing to your existing cloud storage files.

💡 Important: Files remain in your cloud storage. Nothing is copied or moved. Valohai creates tracking metadata so you can use these files like any other Valohai data.

After importing, you can:

  • Use files as inputs in executions

  • View and search files in the Valohai UI

  • Add tags, aliases, and properties

  • Track lineage and usage

  • Create datasets from imported files


Requirements

Before importing files:

Files Must Be Accessible

The files must exist in a data store that's already configured in your Valohai project. Valohai uses the store's credentials to verify file existence.

Learn more: Configure Data Stores

You Need Project Permissions

You must have permission to add data to the project.

Files Must Use Correct URL Format

Use your cloud provider's native URL format:

  • AWS S3: s3://bucket-name/path/to/file.ext

  • Google Cloud Storage: gs://bucket-name/path/to/file.ext

  • Azure Blob Storage: azure://account/container/path/to/file.ext

  • OpenStack Swift: swift://project/container/path/to/file.ext


Import via Web UI

  1. Open your project

  2. Navigate to the Data tab

  3. Click the Adopt tab

  4. Select the Destination store from the dropdown menu

  5. Enter file URLs to import (one per line)

  6. Click Adopt selected files

Valohai verifies each file exists and creates datum records.

⚠️ Performance warning: Importing many files (100+) via the web UI can be slow and may be interrupted by network issues or browser timeouts. For bulk imports, use the API with retry logic instead.

Example input:

s3://my-bucket/datasets/training-data-v1.csv
s3://my-bucket/datasets/training-data-v2.csv
s3://my-bucket/models/baseline-model.pkl

Import via API

Use the Valohai API for bulk imports or automated workflows.

Prerequisites

  1. Get your API token from Valohai account settings

  2. Get your project ID from Project → Settings in the Valohai UI

  3. Get your data store ID:

import os
import requests

response = requests.get(
    'https://app.valohai.com/api/v0/stores/',
    headers={'Authorization': 'Token ' + os.getenv('VH_TOKEN')}
)

stores = response.json()
# Find your store ID in the response

Import Single File

import os
import requests

store_id = "your-store-id-here"
project_id = "your-project-id-here"

payload = {
    "url": "s3://my-bucket/datasets/training-data.csv",
    "root_path":"s3://my-bucket/",
    "project": project_id
}

response = requests.post(
    f'https://app.valohai.com/api/v0/stores/{store_id}/adopt/',
    json=payload,
    headers={
        'Authorization': 'Token ' + os.getenv('VH_TOKEN'),
        'Content-Type': 'application/json'
    }
)

# Handle response
if response.ok:
    result = response.json()
    if result.get('ok'):
        print(f"Success! Datum ID: {result['created']}")
    else:
        print(f"Error: {result['message']}")
else:
    print(f"HTTP Error: {response.status_code}")
    print(response.text)

Import Multiple Files with Error Handling

import os, requests, time

token = os.getenv('VH_TOKEN')

store_id = "your-store-id-here"
project_id = "your-project-id-here"
files_to_import = [
    "s3://my-bucket/datasets/train.csv",
    "s3://my-bucket/datasets/validation.csv",
    "s3://my-bucket/models/baseline.pkl"
]
root_path = "s3://my-bucket/"

def adopt_files(urls, root_path, store_id, project_id, token: str, retries: int = 3):
    headers = {"Authorization": f"Token {token}"}

    payload = { 
        "urls": urls, 
        "root_path": root_path, 
        "project": project_id
    }

    url = f"https://app.valohai.com/api/v0/stores/{store_id}/adopt/"

    for attempt in range(retries): 
        try: 
            res = requests.post(url, json=payload, headers=headers, timeout=30)

            result = res.json()
            if res.ok and "ok" in result:
                return {"success": True, "datums": list(result["created"].values())}
            else:
                return {"success": False,  **(result[0])}
        except requests.exceptions.RequestException as e:
            if attempt < retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Retry {attempt + 1}/{retries} for {url} after {wait_time}s...")
                time.sleep(wait_time)
            else:
                return {"success": False, "error": f"Request failed: {str(e)}", "code": -1}

adopt_result = adopt_files(
    urls=files_to_import, 
    root_path=root_path, 
    store_id=store_id, 
    project_id=project_id, 
    token=token)

if adopt_result['success']:
    print(f"  ✓ Success: {adopt_result['datums']}")
else:
    print(f"  ✗ Error: {adopt_result['error']}")

ROOT_PATH

This parameter, even though similar to Upload path used with manual files upload, works in an opposite way. root_path is an optional, common path, which will be deducted from all URLs in order to produce the path on which the files will be imported.

Example:

urls = [
    "s3://some-bucket/dir_1/dir_2/file_1.txt",
    "s3://some-bucket/dir_1/dir_2/file_2.txt"
    "s3://some-bucket/dir_1/dir_3/file_2.txt"
]

┌─ root_path = s3://some-bucket
│  └─→ dir_1/dir_2/file_1.txt
│  └─→ dir_1/dir_2/file_2.txt
│  └─→ dir_1/dir_3/file_2.txt

┌─ root_path = s3://some-bucket/dir_1
│  └─→ dir_2/file_1.txt
│  └─→ dir_2/file_2.txt
│  └─→ dir_3/file_2.txt

┌─ root_path = s3://some-bucket/dir_1/dir_2
│  └─→ Error: {'success': False, 'message': 'All adopted URLs must start with the given root path', 'code': 'invalid_root_path'}


API Responses

Success

File imported successfully:

{
    "ok": true,
    "created": {
        "s3://my-bucket/my-file.txt": "017a515f-30a4-d0f1-d37a-53ffc38e90c7"
    }
}

What to do: Save the datum ID (017a515f-...) to use as datum://017a515f-... in your pipelines.


Already Exists

File was previously imported:

{
    "message": "s3://my-bucket/my-file.txt already exists in my-bucket",
    "code": "adopt_already_exists"
}

What to do:

  • The file is already tracked in Valohai

  • Find it in the Data → Browse tab

  • No action needed unless you want to update metadata


Not Found

File doesn't exist in cloud storage:

{
    "message": "Not found in my-bucket: 's3://my-bucket/my-file.txt'",
    "code": "adoptable_file_not_found"
}

What to do:

  • Verify the file URL is correct (check spelling, path, bucket name)

  • Ensure the file exists in your cloud storage

  • Check that Valohai has access to the bucket/container

  • Verify you selected the correct data store


Common Issues & Fixes

File Not Found Error

Symptom: adoptable_file_not_found error during import

Causes & Fixes:

  • Typo in file URL → Double-check bucket name, path, and filename (case-sensitive)

  • File doesn't exist → Verify file exists in your cloud storage console

  • Wrong data store selected → Ensure you selected the correct destination store

  • Wrong cloud region → Check that the data store is configured for the correct region

  • File in different bucket → Verify the bucket name matches your data store configuration


Permission Denied

Symptom: Import fails with access or permission error

Causes & Fixes:

  • Data store credentials invalid → Verify data store configuration and credentials

  • Bucket policy blocks access → Check cloud storage IAM/permissions allow Valohai to read files

  • File is private/encrypted → Ensure Valohai's service account has read access

  • Cross-region access issues → Verify data store configuration matches file location


Already Exists

Symptom: File shows as already existing in Valohai

Causes & Fixes:

  • File was previously imported → Find it in Data → Browse tab (not an error)

  • Trying to import duplicate → Use the existing datum ID instead of re-importing

To find existing datum:

  1. Go to Data → Browse

  2. Search by filename

  3. Copy the datum ID


Bulk Import Interrupted

Symptom: Web UI import stops partway through large file list

Causes & Fixes:

  • Browser timeout → Use API with retry logic for bulk imports (>50 files)

  • Network interruption → Import in smaller batches via UI

  • Too many files → Use the API script with error handling (see above)


Best Practices

Organize Before Importing

Plan your import strategy:

  • Group related files

  • Use consistent naming

  • Document file sources

  • Tag immediately after import

Use API for Bulk Operations

  • Web UI: Good for <50 files

  • API: Required for 50+ files

Add Metadata Immediately

After importing, add context:

metadata = {
    "valohai.tags": ["imported-2024-01", "legacy-data"],
    "valohai.alias": "training-data-v1",
    "source": "s3-legacy-bucket",
    "import_date": "2024-01-15",
    "original_owner": "data-team"
}

Create Aliases for Key Files

Make frequently-used imports easy to reference:

legacy-training-data → datum://abc123...
baseline-model → datum://def456...
validation-set-fixed → datum://ghi789...

Verify After Import

Check that files are accessible:

  1. Find imported file in Data → Browse

  2. Create a test execution using the file as input

  3. Verify file downloads and opens correctly



Next Steps

Last updated

Was this helpful?