Import Existing Cloud Files

Import files that already exist in your cloud storage into Valohai's data catalog. This creates datum:// links for external files without moving or re-uploading data.

When to Use This

Import existing cloud files when you need to:

Track legacy data — Bring pre-existing datasets into Valohai's lineage system
Use external datasets — Access data uploaded by other teams or processes
Avoid re-uploading — Create datum links for large files already in your cloud storage
Migrate to Valohai — Import historical data when adopting Valohai for existing projects
Share team data — Make files uploaded directly to cloud storage discoverable in Valohai

What This Does

"Adopting" files creates Valohai datum records pointing to your existing cloud storage files.

💡 Important: Files remain in your cloud storage. Nothing is copied or moved. Valohai creates tracking metadata so you can use these files like any other Valohai data.

After importing, you can:

Use files as inputs in executions
View and search files in the Valohai UI
Add tags, aliases, and properties
Track lineage and usage
Create datasets from imported files

Requirements

Before importing files:

Files Must Be Accessible

The files must exist in a data store that's already configured in your Valohai project. Valohai uses the store's credentials to verify file existence.

Learn more: Configure Data Stores

You Need Project Permissions

You must have permission to add data to the project.

Files Must Use Correct URL Format

Use your cloud provider's native URL format:

AWS S3: s3://bucket-name/path/to/file.ext
Google Cloud Storage: gs://bucket-name/path/to/file.ext
Azure Blob Storage: azure://account/container/path/to/file.ext
OpenStack Swift: swift://project/container/path/to/file.ext

Import via Web UI

Open your project
Navigate to the Data tab
Click the Adopt tab
Select the Destination store from the dropdown menu
Enter file URLs to import (one per line)
Click Adopt selected files

Valohai verifies each file exists and creates datum records.

⚠️ Performance warning: Importing many files (100+) via the web UI can be slow and may be interrupted by network issues or browser timeouts. For bulk imports, use the API with retry logic instead.

Example input:

s3://my-bucket/datasets/training-data-v1.csv
s3://my-bucket/datasets/training-data-v2.csv
s3://my-bucket/models/baseline-model.pkl

Import via API

Use the Valohai API for bulk imports or automated workflows.

Prerequisites

Get your API token from Valohai account settings
Get your project ID from Project → Settings in the Valohai UI
Get your data store ID:

import os
import requests

response = requests.get(
    'https://app.valohai.com/api/v0/stores/',
    headers={'Authorization': 'Token ' + os.getenv('VH_TOKEN')}
)

stores = response.json()
# Find your store ID in the response

Import Single File

import os
import requests

store_id = "your-store-id-here"
project_id = "your-project-id-here"

payload = {
    "url": "s3://my-bucket/datasets/training-data.csv",
    "root_path":"s3://my-bucket/",
    "project": project_id
}

response = requests.post(
    f'https://app.valohai.com/api/v0/stores/{store_id}/adopt/',
    json=payload,
    headers={
        'Authorization': 'Token ' + os.getenv('VH_TOKEN'),
        'Content-Type': 'application/json'
    }
)

# Handle response
if response.ok:
    result = response.json()
    if result.get('ok'):
        print(f"Success! Datum ID: {result['created']}")
    else:
        print(f"Error: {result['message']}")
else:
    print(f"HTTP Error: {response.status_code}")
    print(response.text)

Import Multiple Files with Error Handling

import os, requests, time

token = os.getenv('VH_TOKEN')

store_id = "your-store-id-here"
project_id = "your-project-id-here"
files_to_import = [
    "s3://my-bucket/datasets/train.csv",
    "s3://my-bucket/datasets/validation.csv",
    "s3://my-bucket/models/baseline.pkl"
]
root_path = "s3://my-bucket/"

def adopt_files(urls, root_path, store_id, project_id, token: str, retries: int = 3):
    headers = {"Authorization": f"Token {token}"}

    payload = { 
        "urls": urls, 
        "root_path": root_path, 
        "project": project_id
    }

    url = f"https://app.valohai.com/api/v0/stores/{store_id}/adopt/"

    for attempt in range(retries): 
        try: 
            res = requests.post(url, json=payload, headers=headers, timeout=30)

            result = res.json()
            if res.ok and "ok" in result:
                return {"success": True, "datums": list(result["created"].values())}
            else:
                return {"success": False,  **(result[0])}
        except requests.exceptions.RequestException as e:
            if attempt < retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Retry {attempt + 1}/{retries} for {url} after {wait_time}s...")
                time.sleep(wait_time)
            else:
                return {"success": False, "error": f"Request failed: {str(e)}", "code": -1}

adopt_result = adopt_files(
    urls=files_to_import, 
    root_path=root_path, 
    store_id=store_id, 
    project_id=project_id, 
    token=token)

if adopt_result['success']:
    print(f"  ✓ Success: {adopt_result['datums']}")
else:
    print(f"  ✗ Error: {adopt_result['error']}")

ROOT_PATH

This parameter, even though similar to Upload path used with manual files upload, works in an opposite way. root_path is an optional, common path, which will be deducted from all URLs in order to produce the path on which the files will be imported.

Example:

urls = [
    "s3://some-bucket/dir_1/dir_2/file_1.txt",
    "s3://some-bucket/dir_1/dir_2/file_2.txt"
    "s3://some-bucket/dir_1/dir_3/file_2.txt"
]

┌─ root_path = s3://some-bucket
│  └─→ dir_1/dir_2/file_1.txt
│  └─→ dir_1/dir_2/file_2.txt
│  └─→ dir_1/dir_3/file_2.txt
│
┌─ root_path = s3://some-bucket/dir_1
│  └─→ dir_2/file_1.txt
│  └─→ dir_2/file_2.txt
│  └─→ dir_3/file_2.txt
│
┌─ root_path = s3://some-bucket/dir_1/dir_2
│  └─→ Error: {'success': False, 'message': 'All adopted URLs must start with the given root path', 'code': 'invalid_root_path'}

API Responses

Success

File imported successfully:

{
    "ok": true,
    "created": {
        "s3://my-bucket/my-file.txt": "017a515f-30a4-d0f1-d37a-53ffc38e90c7"
    }
}

What to do: Save the datum ID (017a515f-...) to use as datum://017a515f-... in your pipelines.

Already Exists

File was previously imported:

{
    "message": "s3://my-bucket/my-file.txt already exists in my-bucket",
    "code": "adopt_already_exists"
}

What to do:

The file is already tracked in Valohai
Find it in the Data → Browse tab
No action needed unless you want to update metadata

Not Found

File doesn't exist in cloud storage:

{
    "message": "Not found in my-bucket: 's3://my-bucket/my-file.txt'",
    "code": "adoptable_file_not_found"
}

What to do:

Verify the file URL is correct (check spelling, path, bucket name)
Ensure the file exists in your cloud storage
Check that Valohai has access to the bucket/container
Verify you selected the correct data store

Common Issues & Fixes

File Not Found Error

Symptom: adoptable_file_not_found error during import

Causes & Fixes:

Typo in file URL → Double-check bucket name, path, and filename (case-sensitive)
File doesn't exist → Verify file exists in your cloud storage console
Wrong data store selected → Ensure you selected the correct destination store
Wrong cloud region → Check that the data store is configured for the correct region
File in different bucket → Verify the bucket name matches your data store configuration

Permission Denied

Symptom: Import fails with access or permission error

Causes & Fixes:

Data store credentials invalid → Verify data store configuration and credentials
Bucket policy blocks access → Check cloud storage IAM/permissions allow Valohai to read files
File is private/encrypted → Ensure Valohai's service account has read access
Cross-region access issues → Verify data store configuration matches file location

Already Exists

Symptom: File shows as already existing in Valohai

Causes & Fixes:

File was previously imported → Find it in Data → Browse tab (not an error)
Trying to import duplicate → Use the existing datum ID instead of re-importing

To find existing datum:

Go to Data → Browse
Search by filename
Copy the datum ID

Bulk Import Interrupted

Symptom: Web UI import stops partway through large file list

Causes & Fixes:

Browser timeout → Use API with retry logic for bulk imports (>50 files)
Network interruption → Import in smaller batches via UI
Too many files → Use the API script with error handling (see above)

Best Practices

Organize Before Importing

Plan your import strategy:

Group related files
Use consistent naming
Document file sources
Tag immediately after import

Use API for Bulk Operations

Web UI: Good for <50 files
API: Required for 50+ files

Add Metadata Immediately

After importing, add context:

metadata = {
    "valohai.tags": ["imported-2024-01", "legacy-data"],
    "valohai.alias": "training-data-v1",
    "source": "s3-legacy-bucket",
    "import_date": "2024-01-15",
    "original_owner": "data-team"
}

Create Aliases for Key Files

Make frequently-used imports easy to reference:

legacy-training-data → datum://abc123...
baseline-model → datum://def456...
validation-set-fixed → datum://ghi789...

Verify After Import

Check that files are accessible:

Find imported file in Data → Browse
Create a test execution using the file as input
Verify file downloads and opens correctly

Load Data in Jobs — Use imported files as execution inputs
Add Context to Your Data Files — Tag and organize imported files
Configure Data Stores — Set up cloud storage access
Upload Files via Web UI — Alternative for small files

Next Steps

Import a test file and verify it appears in the Data tab
Add tags and aliases to your imported files
Create an execution using an imported file as input
Set up automated imports using the API for new files

PreviousUpload Files via Web UI NextDatasets

Last updated 4 hours ago

Was this helpful?

When to Use This

What This Does

Requirements

Files Must Be Accessible

You Need Project Permissions

Files Must Use Correct URL Format

Import via Web UI

Import via API

Prerequisites

Import Single File

Import Multiple Files with Error Handling

ROOT_PATH

API Responses

Success

Already Exists

Not Found

Common Issues & Fixes

File Not Found Error

Permission Denied

Already Exists

Bulk Import Interrupted

Best Practices

Organize Before Importing

Use API for Bulk Operations

Add Metadata Immediately

Create Aliases for Key Files

Verify After Import

Related Pages

Next Steps