Every step defines a separate type of execution such as feature extraction or training.

You execute a step that creates an execution. Step executions are heavily version controlled so re-executing any past workloads will work as long as the Docker image and inputs are still available.

As machine learning projects are very different from one another, we feel like users should be allowed to be as flexible as possible in building their own data science pipeline steps.

Usually separate steps are defined for:

  • preprocessing files and upload them to be used by other steps
  • integrating with database services to create version controlled snapshot for training data
  • executing a Python script or C code e.g. to train a predictive model
  • validating if a trained model model could be used for production
  • deploying trained model to staging or production
  • build application binaries to be used in other steps

You can run anything that works inside a Docker container so the possibilities are endless ;)