Machine Learning Workflow

This document summarizes the machine learning workflow.

1. Machine Learning Workflow

Data preparation is the process of preparing data for model training and validation.

Data transformation : Transforming data into a form that is easier to work with and reloading the transformed data.
Data cleaning : Removing or correcting inaccurate data.
Data normalization : When some features have very large variance, scaling them to a range such as 0–1 so that one feature does not dominate learning and others are reflected properly.
Data featurization : Extracting features from data for use in the model. Often existing fields are used as features; it also includes creating new features that are not present in the raw data but are derived from it.
Data validation : Final checks before using featurized data—typically type, range, and shape.
Data split : Splitting validated featurized data into training, validation, and test sets. A common split is roughly 60% training, 20% validation, and 20% test.

Model training is the process of building, training, and validating a model using the prepared data.

Algorithm selection : Choosing which machine learning algorithm to use.
Model hyperparameter tuning : Configuring the model based on the chosen algorithm and deciding hyperparameter values.
Model training : Training the configured model to learn its parameters, using the training split from the data split step.
Model validation : Evaluating the trained model on the validation split to review metrics such as accuracy and performance and check whether requirements are met.
Model testing : Evaluating how the validated model behaves on data not used for training or validation, using the test split.

Deploying and monitoring the model after testing is complete.

Model deployment : Deploying the tested model into production.
Model monitoring : Tracking deployed model metrics such as accuracy and performance.
Model retraining : Retraining the model when monitoring indicates it is needed.