Reimagine MLOps Solution SageMaker VS AIQ

Anu Ganesan
7 min readFeb 11, 2021

What is MLOps? Why do we need end to end MLOps solution for successful machine learning projects?

The primary responsibility of Data Scientists involves extracting value out of data by building and operationalizing Machine Learning Models. As businesses are embracing data science to improve business strategy, Data Scientists are struggling to manage the growing number of Machine Learning models.

Fintech, Healthcare, and Retail industries are earmarking their Machine Learning budget for 2020 to increase by 25% raising the scale and complexity of Machine Learning models.

With the growing complexities, Data Scientists are finding it laborious to manage the rising number of Machine Learning models in production. Based on the budget allotted for ML projects, enterprises either have separate teams or data scientists responsible for engineering data and building and managing ML models at scale.

"Time Taken to deploy a single model is 31 to 90 days"

There is a dire need for a seamless end to end Machine Learning platform to experiment ML models with proper version management and to deploy at scale with reproducible deployment pipelines. Cloud Vendors have taken notice and hopped on to the bandwagon to build ML platform for managing Machine Learning projects end-to-end.

AWS SageMaker

Amazon SageMaker lets users train Machine Learning models by creating a notebook instance from the SageMaker console along with proper IAM role and S3 bucket access. One can use an already built-in algorithm or sell algorithms and models in AWS marketplace. SageMaker lets you deploy the model on Amazon model hosting service with an https endpoint for model inference. The compute resources are automatically scaled based on the data load. Amazon Model Monitor monitor model for any data drift or anomalies by making use of Amazon CloudWatch Metrics.Amazon provides modular services for managing ML projects. Using AWS Step function, one can automate the retraining and CI / CD process.

AWS Step Function interlinks AWS services like AWS Lambda, AWS Fargate, AWS SageMaker to build workflows of your choice either it be an application or process for continuous deployment of ML models.

AWS Codepipeline, a CI / CD service along with AWS Step Function handling workflow-driven action provides a powerful automation pipeline.

Sagemaker — Sample step function

{ “StartAt”: “First Step”, “States”: {

“Deploy”: { “Type”: “Task”,

“Resource”: “arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME”,

“Next”: “Second Step” }, “Second Step”: { “Type”: “Task”,

“Resource”: “arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME”,

“End”: true }

}}

Limitations using SageMaker

Cost Intensive

Eventhough SageMaker provides the flexibility to customize Machine Learning models, the lack of interoperability to mix-and-match any cloud vendor’s Machine Learning services burden the enterprises adopting specific ML platform. Every ML vendor provides Basic and Enterprise version and the cost and services offered varies based on the selection. As the enterprises advance in Machine Learning, the growing number of datasets demands increase in pricing with respect to processing and storage capacity.The lack of relevant documentation and training along with the increasing cost to manage Machine Learning projects poses a threat to enterprises leading to less number of ML models in production.

Vendor Lock-in

ML platforms from Google, Amazon, Microsoft can run only on their own cloud or on- premise. .Porting from one ML platform to another is a tedious task as it involves building ML platforms from ground up along with a steep learning curve. Vendor lock-in is real and it restricts enterprises to adopt multi-cloud strategy and deprives from using products and services from hybrid vendors.

Steep Learning Curve

Different cloud vendors use their own tools and technologies to build ML platforms. Moreover, the understanding of End-to-End varies as some platforms have streamlined build and deployment of ML models while others concentrate much on data engineering automating the ML build process. The commonality among the cloud vendors’ ML platform is that they are built on top of Kubernetes clusters. But the commonality ends there. Google’s Kubeflow Pipeline provides the flexibility in building ML pipelines either from Jupyter Notebook or from existing Kubernetes cluster using Python SDK or CLI. Amazon’s SageMaker uses Step Function and Code Pipeline to automate the CI / CD process. Microsoft’s Azure provides two different pipelines, one for building ML workflows and another for building CI / CD pipelines. The absence of integrated end-to-end ML platform and the variety of options from different cloud vendors not only demands steep learning curve but porting from one platform to another near to impossible.

Lack of Documentation

Each and every cloud vendor provides detailed overview and steps to follow to build the ML pipelines. There are even articles from renowned data scientists to build end-to-end ML platforms. As Artificial Intelligence is evolving with constant improvements to the ML platform, the steps to build ML pipelines are often error prone. Some of the ML platform’s Kubernetes engine still uses the older version of Kubernetes and the recent Kubernetes documentation remains irrelevant. To add more confusion, there is no documentation supporting the version mismatch. The heterogeneity in tools and technologies along with various cloud vendors and open source communities building ML platforms have exponentially increased the number of documentation. The increase in documentation not only adds steep learning curve but also leads to confusion.

How AIQ powers Machine Learning projects?

Increase Visibility into Machine Learning Projects

Machine Learning challenges varies with perspective and approach. For instance, management would like better visibility into machine learning projects with faster onboarding of Datascience teams and reduced cost.

Flexibility to choose any ML Libraries, Frameworks and AutoML

Data Scientists would require automated deployment pipelines which can integrate with the models implemented using any ML libraries, frameworks or AutoML of their choice. Models should be deployed automatically with minimal effort providing inference endpoints for applications to make use of the model.

Deploy effortlessly with automated CI/ CD pipelines

Data Scientists experiment the models repetitively with different algorithms by tuning the hyperparameters for continuous improvement of model’s accuracy. After experimentation phase, the trained models are deployed to staging environment for evaluation before pushing to production. Ever-changing data along with the iterative nature of machine learning projects mandates for an automated CI / CD pipelines wherein any new environment like staging, production are reproduced automatically with minimal effort.There are readily available CI / CD pipelines from different cloud and machine learning vendors. But changing from one vendor to another demands revamping your entire CI / CD pipelines.

Multi-Cloud Support

Enterprise needs the ability to store models in any cloud or in-house registry, deploy models to any cloud-agnostic environment without having to re-engineer their pipelines. Integrated MLOps should be able to deploy to any cloud, on-prem or hybrid based on the infrastructure of your choice by determining the cost for managing the computing resource and monitoring the performance of your machine learning models. Kubernetes based deployment with reproducible CI / CD pipelines makes it easier not only to onboard any new environment but also onboard new team with machine learning models along with the infrastructure needed to train and run the inference for the model.

Automatic Scaling & Complex Deployments

Deployment pipelines should be capable of provisioning different resource types (CPU, GPU or TPU) with auto-scaling capabilities. It should also assist in complex model deployment scenarios such as A/B deployments, shadow and phased rollouts which have now become a common pattern in data-driven applications, all while providing real-time insights and recommendations.

Beyond Monitoring

End to End MLOps solution necessitates automated monitoring service inspecting model’s health score along with data drift, usage predictions and resource utilization.

The performance of any machine learning model is affected by any change in data, code, or model’s behavior. For instance, let’s consider a machine learning model approving credit applications. Previously the model required only FICO score and income but later enhanced to use customer’s digital footprint expanding the landscape of potential borrowers. This mandates for code change along with retraining the model with new training datasets with additional features. The CI / CD pipelines should be able to automatically detect these changes, retrain and deploy the trained model with minimal effort. Monitoring should not only capture data drift but also monitor and auto-scale computing resource for better cost management. Machine learning models without diversified datasets tend to be biased. Enterprises with biased models lose their reputation increasing the customer churn rate.

AIQ — Manage ML with Confidence

MLOps solution should be able to go beyond deployment and monitoring with the ability to observe and act on insights with self-explainable capabilities justifying why a model behaved in a certain manner.

To learn more about our AIQ tool:

AIQ Workbench

AIQ Deploy

AIQ Monitor

Follow us to learn more about how we increased productivity with reduced cost and effort with our end to end mlops solution.

--

--