Migrate Your ML Jobs

Your existing ML code can run on Valohai with minimal changes. This guide walks you through migrating your workflows in under 5 hours, keeping your code intact while gaining versioning, reproducibility, and scalability. This page covers an overview for the steps. More detailed instructions for each of them can be found in the other sections.

Migration Timeline

  • Step 1: Define dependencies

  • Step 2: Create valohai.yaml (30 minutes)

  • Step 3: Add parameters and metrics (1-2 hours)

  • Step 4: Configure outputs (30 minutes)

  • Step 5: Update data access (1-2 hours)


Step 1: Define Your Dependencies

Identify the Python packages your code needs. You have two options:

Option A: Install at runtime

pip install -r requirements.txt
conda install pandas=0.13.1

Option B: Use a Docker image with pre-installed dependencies

image: tensorflow/tensorflow:2.6.0

💡 Tip: Include version numbers to ensure reproducible environments across all executions.


Step 2: Write Your valohai.yaml (30 minutes)

Create a valohai.yaml file in your repository root. Start simple—your existing code runs as-is:

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0
    command:
        - pip install -r requirements.txt
        - python train_model.py

That's it. Your job now runs on Valohai without touching your Python code.


Step 3: Add Parameters and Metrics (1-2 hours)

Parameters

If your code uses argparse or similar, this takes minutes. Define parameters in valohai.yaml :

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0
    command:
        - python train_model.py {parameters}
    parameters:
        - name: iterations
          type: integer
          default: 10
        - name: learningrate
          type: float
          default: 0.01

Metrics

Log metrics by printing JSON from your Python code, e.g.

print(json.dumps({
    "precision": 0.8125, 
    "recall": 0.8667, 
    "f1_score": 0.8387
}))

Valohai automatically captures and visualizes these metrics.


Step 4: Save Output Artifacts (30 minutes)

Save models, CSVs, or any outputs to /valohai/outputs/ directory:

# Before: local save
model.save('model.h5')

# After: Valohai versioned output
model.save('/valohai/outputs/model.h5')

Valohai automatically versions and uploads all outputs to your cloud storage.


Step 5: Update Data Access (1-2 hours)

Valohai handles all the complexity of cloud storage—authentication, access control, downloading, and caching. Your code just reads from local paths while Valohai manages everything behind the scenes.

Define Your Data Sources

Specify inputs in your YAML configuration:

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0
    command:
        - python train_model.py
    inputs:
        - name: images
          keep-directories: suffix
          default:
          - s3://mybucket/factories/images/*.png
          - azure://myblobstorage/factories/images/*.png
          - gs://mybucket/factories/images/*.png

Simplify Your Code

Remove all cloud authentication and data management code:

# Before: Complex cloud operations
s3_client = boto3.client('s3', 
    aws_access_key_id=KEY,
    aws_secret_access_key=SECRET)
download_from_s3('mybucket/factories/images/')
handle_caching_logic()

# After: Just read local files
images = '/valohai/inputs/images/'
# All files are already there, downloaded and cached by Valohai

Valohai automatically:

  • Authenticates with your cloud storage

  • Downloads files to the execution environment

  • Caches the input data for faster access

  • Works identically across AWS, Azure, GCP, and on-premises storage

💡 Advanced data management: Use Valohai datasets and aliases to version your data without hardcoding storage paths. Reference data as dataset://my-training-data or models as model://cats-v2 for better tracking and reproducibility.


You're Done! 🎉

Your ML jobs now run on Valohai with:

  • Automatic versioning of code, data, and outputs

  • Experiment tracking and comparison

  • Scalability across cloud and on-premises infrastructure

  • No vendor lock-in—your code remains portable

Next Steps

Last updated

Was this helpful?