Google Cloud Filestore

Mount Google Cloud Filestore to access shared network storage directly from Valohai executions.


Overview

Google Cloud Filestore provides managed NFS storage that you can mount in Valohai executions. Use Filestore to:

  • Access large datasets without downloading

  • Share preprocessed data across multiple executions

  • Cache intermediate results on fast shared storage

  • Process data in place and save versioned outputs

⚠️ Important: Files on Filestore mounts are NOT versioned by Valohai. Always save final results to /valohai/outputs/ for reproducibility.


Prerequisites

Before mounting Filestore in Valohai:

  1. Existing Filestore instance — Use an existing Filestore or create a new one in GCP Console

  2. Same VPC or VPC peering — Filestore must be in the same VPC as Valohai resources, or set up VPC peering between VPCs

  3. Network access — Filestore automatically allows access from VMs in the same VPC (no additional firewall rules needed)


Setup: Configure Filestore Access

Step 1: Create or Find Filestore Instance

In GCP Console:

  1. Go to Filestore → Instances

  2. Find your instance or click "Create Instance"

  3. Note the IP address (e.g., 10.0.0.5)

  4. Note the File share name (e.g., share1)


Step 2: Get Filestore IP Address

Via GCP Console:

  1. Go to Filestore → Instances

  2. Click on your instance

  3. Copy the IP address from instance details

Via gcloud command:

# List all Filestore instances
gcloud filestore instances list

# Get specific instance details
gcloud filestore instances describe INSTANCE_NAME \
    --location=ZONE \
    --format="value(networks.ipAddresses[0])"

Example output:

10.0.0.5

Step 3: Verify VPC Configuration

Ensure Filestore and Valohai VMs are in the same VPC:

  1. In GCP Console, go to Filestore → Instances

  2. Check the Network column for your instance

  3. Verify it matches the VPC where Valohai resources are deployed

  4. If different VPCs, set up VPC peering


Mount Filestore in Execution

Basic Mount Configuration

valohai.yaml:

- step:
    name: process-with-filestore
    image: python:3.9
    command:
      - python process_data.py
    mounts:
      - destination: /mnt/filestore-data
        source: <ip-address>:/share1
        type: nfs
        readonly: true

Parameters:

  • destination — Mount point inside container (e.g., /mnt/filestore-data)

  • source — Filestore IP and share name (format: <ip-address>:/<share-name>)

  • type — Always nfs for Filestore

  • readonlytrue (recommended) or false


Mount Specific Filestore Directory

mounts:
  - destination: /mnt/training-data
    source: <ip-address>:/share1/ml-datasets/training
    type: nfs
    readonly: true

Mounts only the /ml-datasets/training directory from Filestore share.


Complete Workflow Example

Mount → Process → Save Pattern

Scenario: Process large video dataset stored on Filestore, extract features, save to Valohai outputs.

valohai.yaml:

- step:
    name: extract-video-features
    image: python:3.9
    command:
      - pip install opencv-python numpy pandas
      - python extract_features.py
    mounts:
      - destination: /mnt/raw-videos
        source: <ip-address>:/video-storage/raw
        type: nfs
        readonly: true
    environment-variables:
      - name: SAMPLE_RATE
        default: "1"  # Process every Nth frame

extract_features.py:

import os
import cv2
import numpy as np
import pandas as pd
import json

# Configuration
FILESTORE_PATH = '/mnt/raw-videos/'
OUTPUT_PATH = '/valohai/outputs/'
SAMPLE_RATE = int(os.getenv('SAMPLE_RATE', '1'))

# 1. Scan Filestore for videos (NOT versioned)
print(f"Scanning Filestore mount: {FILESTORE_PATH}")
video_files = [f for f in os.listdir(FILESTORE_PATH) if f.endswith(('.mp4', '.avi'))]
print(f"Found {len(video_files)} videos on Filestore")

# 2. Extract features from videos
features_list = []

for i, filename in enumerate(video_files):
    video_path = os.path.join(FILESTORE_PATH, filename)
    
    # Read video from Filestore
    cap = cv2.VideoCapture(video_path)
    
    frame_count = 0
    video_features = []
    
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        
        # Sample frames based on SAMPLE_RATE
        if frame_count % SAMPLE_RATE == 0:
            # Extract features (e.g., histogram)
            hist = cv2.calcHist([frame], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
            hist = hist.flatten()
            video_features.append(hist)
        
        frame_count += 1
    
    cap.release()
    
    # Aggregate features for this video
    if video_features:
        avg_features = np.mean(video_features, axis=0)
        
        features_list.append({
            'video_filename': filename,
            'frame_count': frame_count,
            'sampled_frames': len(video_features),
            'features': avg_features.tolist()
        })
    
    if (i + 1) % 10 == 0:
        print(f"Processed {i + 1}/{len(video_files)} videos...")

# 3. Save features to Valohai outputs (VERSIONED ✅)
print(f"\nSaving {len(features_list)} video features...")

# Save as CSV for analysis
df = pd.DataFrame([
    {
        'video_filename': item['video_filename'],
        'frame_count': item['frame_count'],
        'sampled_frames': item['sampled_frames']
    }
    for item in features_list
])
df.to_csv(os.path.join(OUTPUT_PATH, 'video_metadata.csv'), index=False)

# Save features as numpy array
features_array = np.array([item['features'] for item in features_list])
np.save(os.path.join(OUTPUT_PATH, 'video_features.npy'), features_array)

# 4. Create dataset version
dataset_metadata = {
    "video_metadata.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }],
        "valohai.tags": ["video-analysis", "features", "extracted"]
    },
    "video_features.npy": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }]
    }
}

metadata_path = os.path.join(OUTPUT_PATH, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_meta in dataset_metadata.items():
        json.dump({"file": filename, "metadata": file_meta}, f)
        f.write('\n')

print(f"\nFeature extraction complete:")
print(f"  - Processed {len(features_list)} videos")
print(f"  - Extracted features shape: {features_array.shape}")
print(f"  - Created dataset: dataset://video-features/batch-001")

Result:

  • ✅ Raw videos accessed from Filestore (no download time for 100GB+ videos)

  • ✅ Extracted features saved to /valohai/outputs/ (versioned)

  • ✅ Dataset created for reproducible model training

  • ✅ Can train on dataset://video-features/batch-001 anytime


Best Practices

Use Readonly for Input Data

# ✅ Good: Readonly prevents accidental modifications
mounts:
  - destination: /mnt/training-data
    readonly: true
# ⚠️ Avoid: Writeable unless necessary
mounts:
  - destination: /mnt/training-data
    readonly: false

Always Version Final Results

# ❌ Bad: Only use Filestore, nothing versioned
results = process_data('/mnt/filestore-data/')
results.save('/mnt/filestore-output/results.pkl')  # NOT versioned

# ✅ Good: Filestore for input, outputs for results
results = process_data('/mnt/filestore-data/')
results.save('/valohai/outputs/results.pkl')  # VERSIONED

Organize Your Filestore Structure

/share1/
├── raw-data/
│   ├── images/
│   ├── videos/
│   └── text/
├── cache/
│   └── preprocessing/
├── team-workspace/
│   ├── experiments/
│   └── notebooks/
└── temp/

Clear organization makes mounting and access control easier.


Handle Mount Errors

import os
import sys

FILESTORE_PATH = '/mnt/filestore-data/'

# Verify mount is accessible
if not os.path.exists(FILESTORE_PATH):
    print(f"ERROR: Filestore mount {FILESTORE_PATH} not accessible")
    print("Possible causes:")
    print("  - Wrong IP address in mount configuration")
    print("  - VPC connectivity issue")
    print("  - Filestore instance not running")
    sys.exit(1)

# Verify expected data exists
expected_dir = os.path.join(FILESTORE_PATH, 'datasets')
if not os.path.exists(expected_dir):
    print(f"WARNING: Expected directory not found: {expected_dir}")
    print(f"Available: {os.listdir(FILESTORE_PATH)}")

print(f"Filestore mount verified: {FILESTORE_PATH}")

Maintaining Reproducibility

⚠️ Critical: Filestore data can change between executions. Always save processed results to /valohai/outputs/ for versioning.

The problem:

# Today: Process data from Filestore
data = load_from_filestore('/mnt/filestore-data/')
model = train(data)

# Next week: Someone updates Filestore data
# Retraining gives different results
# Can't reproduce original model

The solution:

# Load from Filestore (current state)
data = load_from_filestore('/mnt/filestore-data/')

# Save snapshot to versioned outputs
data.to_csv('/valohai/outputs/training_snapshot.csv')

# Create dataset version
metadata = {
    "training_snapshot.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://training-data/2024-01-15"
        }]
    }
}

# Train on versioned snapshot in next execution
# Can reproduce exactly anytime

See: Access Network Storage for complete patterns.



Next Steps

  • Set up Filestore in your GCP project (or use existing)

  • Get Filestore IP address using gcloud or Console

  • Create test execution mounting Filestore

  • Build pipeline: mount → process → save to outputs

  • Monitor Filestore performance in GCP Console

Last updated

Was this helpful?