Building an ML-ops pipeline on AWS — Part 1 model training

3 min readSep 12, 2022

As companies adopt machine learning across their organizations, building, training, and deploying ML models manually become bottlenecks for innovation. Establishing MLOps patterns allows you to create repeatable workflows for all stages of the ML lifecycle and are key to transitioning from the manual experimentation phase to production. MLOps helps companies innovate faster by boosting productivity of data science and ML teams in creating and deploying models with high accuracy.

In this blog, I am going to explain end-to-end steps for building an enterprise MLOps pipeline.

At a high level, Here I will create two pipelines using CloudFormation in two-part series:

1. Part 1: Model training pipeline

2. Part 2: Model deployment pipeline

Part 1: Creating a CloudFormation template for the ML training pipeline

In this section, we will create two CloudFormation templates that do the following:

· The first template creates AWS Step Functions for an ML model training workflow that performs data processing, model training, and model registration. This will be a component of the training pipeline.

· The second template creates a CodePipeline ML model training pipeline definition with two stages:

- A source stage, which listens to changes in a CodeCommit repository to kick off the execution of the Step Functions workflow that we created

- A deployment stage, which kicks off the execution of the ML model training workflow

Now, let’s get started with the CloudFormation template for the Step Functions workflow:

{
 "Version":"2012-10-17",
 "Statement":[
 {
 "Effect":"Allow",
 "Action":[
 "sagemaker:CreateModel",
 "sagemaker:DeleteEndpointConfig",
 "sagemaker:DescribeTrainingJob",
 "sagemaker:CreateEndpoint",
 "sagemaker:StopTrainingJob",
 "sagemaker:CreateTrainingJob",
 "sagemaker:UpdateEndpoint",
 "sagemaker:CreateEndpointConfig",
 "sagemaker:DeleteEndpoint"
 ],
 "Resource":[
 "arn:aws:sagemaker:*:*:*"
 ]
 },
 {
 "Effect":"Allow",
 "Action":[
 "events:DescribeRule",
 "events:PutRule",
 "events:PutTargets"
 ],
 "Resource":[
 "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTrainingJobsRule"
 ]
 },
 {
 "Effect":"Allow",
 "Action":[
 "lambda:InvokeFunction"
 ],
 "Resource":[
 "arn:aws:lambda:*:*:function:query-training-status*"
 ]
 }
 ]
}

2.Copy and save the following code block to a file locally and name it training_workflow.yaml. Below CloudFormation template will create a Step Functions state machine with a training step and model registration step. We are using CloudFormation here to demonstrate managing IaC. Data scientists also have the option to use the Step Functions Data Science SDK to create the pipeline using a Python script:

3. Launch the newly created cloud template in the CloudFormation console. Make sure that you provide a value for the StepFunctionExecutionRoleArn field when prompted. This is the ARN you took down from the last step. Once the CloudFormation execution is completed, go to the Step Functions console to test it.

4. Test the workflow in the Step Functions console to make sure it works. Navigate to the newly created Step Functions state machine and click on Start Execution to kick off the execution. When you’re prompted for any input, copy and paste the following JSON as input for the execution. These are the input values that will be used by the Step Functions workflow. Make sure that you replace the actual values with the values for your environment.:

{
 "TrainingImage": "<aws hosting account>.dkr.ecr.<aws region>.amazonaws.com/pytorch-training:1.3.1-gpu-py3",
 "S3OutputPath": "s3://<your s3 bucket name>/sagemaker/pytorch-bert-financetext",
 "SageMakerRoleArn": "arn:aws:iam::<your aws account>:role/service-role/<your sagemaker execution role>",
 "S3UriTraining": "s3://<your AWS S3 bucket>/sagemaker/pytorch-bert-financetext/train.csv",
 "S3UriTesting": "s3://<your AWS S3 bucket>/sagemaker/pytorch-bert-financetext/test.csv",
 "InferenceImage": " aws hosting account>.dkr.ecr. <aws region>.amazonaws.com/pytorch-inference:1.3.1-cpu-py3",
 "SAGEMAKER_PROGRAM": "train.py",
 "SAGEMAKER_SUBMIT_DIRECTORY": "s3:// <your AWS S3 bucket> /berttraining/source/sourcedir.tar.gz",
 "SAGEMAKER_REGION": "<your aws region>"
}

5. Check the processing status in the Step Functions console and make sure that the model has been trained and registered correctly. Once everything is completed, save the input JSON in a file called sf_start_params.json. Launch the SageMaker Studio environment you created here Using AWS ML Services, navigate to the folder where you had cloned the CodeCommit repository, and upload the sf_start_params.json file into it. Commit the change to the code repository and verify it is in the repository. We will use this file in the CodeCommit repository for the next section of the lab.

Summary

In this two-part series, we discussed the key requirements for building an enterprise ML platform to meet needs such as end-to-end ML life cycle support, process automation, and separating different environments. We also talked about architecture patterns and how to build an enterprise ML platform on AWS using AWS services. We discussed the core capabilities of different ML environments, including training, hosting, and shared services.

Building an ML-ops pipeline on AWS — Part 1 model training

Written by Neha Tomar

No responses yet