As your data evolves in your database, the results of your queries may change over time. Running a query today may yield different results than running the same query in the future.
To ensure reproducibility, it’s highly recommended to save query results or preprocessed datasets as files to /valohai/outputs/
. This preserves a snapshot of the exact query results before conducting any training with the data.
Here’s an example pipeline:
- Fetch data from a database and preprocess it.
- Save the preprocessed data to
/valohai/outputs/
, which is stored in Amazon S3.
By saving the preprocessed data, you create a reference point for your training. If you need to reproduce the training or inspect the actual data used, you can easily rerun it on Valohai or download the dataset instead of relying solely on the query results from a specific day.