MLOps in 2021: The pillar for seamless Machine Learning Lifecycle

What is MLOps?

MLOps is the new terminology defining the operational work needed to push machine learning projects from research mode to production. While Software Engineering involves DevOps for operationalizing Software Applications, MLOps encompass the processes and tools to manage end-to-end Machine Learning lifecycle.

Software Development + DevOps => Software Application

Machine Learning + MLOps => Machine Learning Projects

Machine Learning Lifecycle

Machine Learning defines the models’ hypothesis learning relationships among independent(input) variables and predicting target(output) variables.

Machine Learning projects involve different roles and responsibilities starting from the Data Engineering team collecting, processing, and transforming data, Data Scientists experimenting with algorithms and datasets, and the MLOps team focusing on moving the trained models to production.

Machine Learning Lifecycle represents the complete end-to-end lifecycle of machine learning projects from research mode to production.

Scope of MLOps Solution in ML Lifecycle

Depending on the AI adoption maturity, the scope of an end-to-end MLOps solution varies. For starters who want to venture into Machine Learning, onboard their Data Science team by automatically spinning up notebook servers on pre-built infrastructure.

But for enterprises already experimenting with models look for an automatic deployment solution turning models into API with reusable deployment templates like A/B testing. As the number of models increases in production, monitoring model performance, data drift, bias, and compliance becomes mandatory.

MLOps Challenges

Model Development Phase

During the development phase, Data Scientists spin up notebook servers experimenting with different algorithms, tuning the hyperparameters to find the best model with optimal accuracy.

Challenges

  • Collaborative Environment: Data Scientists should work closely with Data Engineers and Businesses once a project is strategized. But they often miss the mark in deliberating key details due to the non-availability of a collaborative environment. Collaborative environment fasten the onboarding experience for new Data Scientists by providing visibility into existing Machine Learning Models.
  • Version Control Experiments: Data Scientists should be able to log metrics and parameters automatically with ways to compare metrics among different versions to choose the best fit model
  • Manage Model Artifacts: Even though there are repositories like Github to locate model code, there is a need for centralized repositories to store model artifacts. Storing trained models, training & test datasets, and the libraries needed to run the model make it easy to set up new environments like staging in a nick of time.

Model Deployment Phase

After the experimentation phase, models are deployed to their designated infrastructure either on-prem, cloud, or hybrid.

Challenges

  • Automated CI / CD Pipelines: Data is not finite and it changes with time. Any change in data causes data to drift demanding re-training of models. Manual retraining of models, whenever there is a drift in data or model, is laborious and time-consuming.
  • Complex Deployments: As AI adoption grows, so does the adoption of various deployment strategies like A/B testing, Graph, canary, etc. The ability to create deployment workflows via Web UI, CLI, or SDK becomes crucial for the success of machine learning projects.
  • Re-usable Deployment Templates: Building deployment pipelines without addressing their reusability make the models irrelevant over a period of time.
  • Model Registry & Governance: Models deployed without proper governance over access rights and approvals make it harder to improve the deployment process and thereby increasing the possibilities for more deployment failures.

Model Management Phase

The Model Management phase kicks off with the monitoring of models once they are deployed to production.

Challenges

  • Black Box Models: Models without visibility into their quality and performance lose their credibility draining both time and effort poured into building the models.
  • Monitor & Alert: After deployments, models need to be continuously monitored for performance degradation. Monitoring without alerting appropriate team members to fix the issue not only increases the meantime to resolve the issue but also reduces the credibility and success of machine learning projects.
  • Privacy & Bias Preserved Inference: Privacy and Bias Governance should be the foundation for any machine learning model. Models built without the emphasis on privacy and bias loses their reliability and prominence.
  • Explainable AI: The time spent by Data Scientists in explaining why models behaved in a certain way should be automated by inducing explainability metrics while building the models.

MLOps Solutions

Steering MLOps challenges to reach the aspired reality of seamless end-to-end Machine Learning Lifecycle has seen tremendous improvements with multiple solutions.

MLFlow

Managed MLFlow from Databricks is built on top of MLFlow, an open-source platform to manage Machine learning projects end-to-end. MLFlow consists of different components like Experiment Tracking, Model Management, and Model Deployment.

Experiment Tracking enables Data Scientists to track model parameters and metrics along with version management. Model Management provides a way to collaborate and share models integrated with approval workflows. Model Deployment has the ability to perform batch inference on Apache SparkTM or infer as REST API using docker containers.

MLFlow later added additional components like MLFlow Projects to run MLFlow from any Git or Conda environment, MLFlow Model Registry for better model governance.

MLFlow has seen more adoption among Data Scientists to track and version control experiments but it is not widely popular when it comes to deployment.

Azure ML

Azure ML is an enterprise-ready product enabling Data Scientists to deploy models faster. While experiment tracking, hyper-parameter tuning is easier with MLFlow, deployment is more efficient with Azure ML.

Azure ML provides both SDK and Web Interface enabling Data Scientists to create deployment workflow.

Azure pipeline has the ability to deploy models to a local compute, Azure Container Instance, or Azure Kubernetes Service either using CLI or Python SDK.

from azureml.core.webservice import LocalWebservice, Webservice

deployment_config = LocalWebservice.deploy_configuration(port=8890)

service = Model.deploy(ws, “myservice”, [model], inference_config, deployment_config)

service.wait_for_deployment(show_output = True)

print(service.state)

Every deployed model is provided with a REST endpoint to infer the model.

Additionally one can enable continuous deployment by just turning on the trigger flag as follows:

Microsoft released Azure ML Studio around 2015 andAzureML Services in 2018. While ML Studio enabled building models by drag-and-drop, ML Services offered a much more rich experience with AutoML, GPU Support, hyper-parameter tuning, and auto-scaling Kubernetes cluster based on the load.

SageMaker

Amazon SageMaker lets users train models by creating a notebook instance from the SageMaker console along with proper IAM role and S3 bucket access. One can use an already built-in algorithm or sell algorithms and models in AWS marketplace. SageMaker lets you deploy the model on Amazon model hosting service with an HTTPS endpoint for model inference. The compute resources are automatically scaled based on the data load. Amazon Model Monitor monitor model for any data drift or anomalies by making use of Amazon CloudWatch Metrics. Amazon provides modular services for managing ML projects. Using the AWS Step function, one can automate the retraining and CI / CD process.

AWS Step Function interlinks AWS services like AWS Lambda, AWS Fargate, AWS SageMaker to build workflows of your choice for continuous deployment of ML models.

AWS Codepipeline, a CI / CD service along with AWS Step Function handling workflow-driven action provides a powerful automation pipeline.

Kubeflow

Both Data Scientists and ML Engineers can build Kubeflow Pipelines by using either Notebook or python code deployed on Kubernetes containers. Below are the different ways to build kubeflow pipeline:

1. Build Kubeflow Pipeline by creating Kubernetes cluster via google cli

2. Build Kubeflow Pipeline from Google Cloud Platform

3. Build Kubeflow Pipeline from already created Kubernetes cluster and Python code

Once the Kubeflow pipeline is built, data scientists and ML engineers have the option to run the ML workflow either from jupyter notebook directly or uploading from google storage. There are also other options in market to automate kubeflow pipeline like Kale, Kubeflow Automated Pipeline Engine.

Kubeflow pipeline automates the ML workflow aiding in Continuous Training (CT) whenever there is a need for retraining after a change in the data and automating the CI / CD process whenever there is a code change triggering redeployment of ML models.

Limitation of Enterprise MLOps Solutions

Cost Intensive

Even though SageMaker provides the flexibility to customize Machine Learning models, the lack of interoperability to mix-and-match any cloud vendor’s Machine Learning services burden the enterprises adopting specific ML platforms. Every ML vendor provides a Basic and Enterprise version and the cost and services offered varies based on the selection. As enterprises advance in Machine Learning, the growing number of datasets demands an increase in pricing with respect to processing and storage capacity. The lack of relevant documentation and training along with the increasing cost to manage Machine Learning projects poses a threat to enterprises leading to fewer ML models in production.

Vendor Lock-in

ML platforms from Google, Amazon, Microsoft can run only on their own cloud or on-premise. Porting from one ML platform to another is a tedious task as it involves building ML platforms from the ground up along with a steep learning curve. Vendor lock-in is real and it restricts enterprises to adopt a multi-cloud strategy and deprives them of using products and services from hybrid vendors.

Steep Learning Curve

Different cloud vendors use their own tools and technologies to build ML platforms. Moreover, the understanding of End-to-End varies as some platforms have streamlined build and deployment of ML models while others concentrate much on data engineering automating the ML build process. The commonality among the cloud vendors’ ML platforms is that they are built on top of Kubernetes clusters. But the commonality ends there. Google’s Kubeflow Pipeline provides the flexibility in building ML pipelines either from Jupyter Notebook or from existing Kubernetes cluster using Python SDK or CLI. Amazon’s SageMaker uses Step Function and Code Pipeline to automate the CI / CD process. Microsoft’s Azure provides two different pipelines, one for building ML workflows and another for building CI / CD pipelines. The absence of an integrated end-to-end ML platform and the variety of options from different cloud vendors not only demands a steep learning curve but porting from one platform to another near to impossible.

Lack of Documentation

Each and every cloud vendor provides a detailed overview and steps to follow to build the ML pipelines. There are even articles from renowned data scientists to build end-to-end ML platforms. As Artificial Intelligence is evolving with constant improvements to the ML platform, the steps to build ML pipelines are often error-prone. Some of the ML platform’s Kubernetes engine still uses the older version of Kubernetes and the recent Kubernetes documentation remains irrelevant. To add more confusion, there is no documentation supporting the version mismatch. The heterogeneity in tools and technologies along with various cloud vendors and open source communities building ML platforms have exponentially increased the number of documentation. The increase in documentation not only adds a steep learning curve but also leads to confusion.

How Predera AIQ solves MLOps challenges?

Predera introduces AIQ, an automated end-to-end MLOps solution for machine learning teams to drastically cut down on the challenges faced today in building, deploying, and managing machine learning models.

AIQ provides a command center view of all your ML models in one place to improve the visibility and decision-making for leadership.

Build, Deploy, and Monitor Machine Learning Projects with minimum effort and low cost by automating end-to-end MLOps solutions.

AIQ WorkBench: ML Solution for Data Scientists

Faster Onboarding

Onboard any Data Science project in minutes with AIQ Workbench. Your data scientist can kick start model building by spinning notebook servers, connecting to git, and track modeling activities within the team.

Version Control

AIQ Workbench provides the ability to version control experiments with no coding effort.

Automatic Logging

Most other ML platforms require additional code to log metadata around your features and models. This often clutters the model code base with too many log statements making it less-readable and manageable (‘code spaghetti’). With just 2 lines of code, AIQ empowers ML models to log all the required features. Never lose another AI/ML experiment, artifacts, metrics, lineage. We collect it all — seamlessly, agnostic to the programming language.

Foster Collaboration

AIQ Workbench fosters team collaboration with visibility into Machine Learning projects not just for Data Scientists but also for business leaders to make affirmative decisions based on ML outcomes.

AIQ Deploy: Automated MLOps for ML Engineers

Single-click Deploy

Single-click deployment to turn Machine Learning Models into scalable APIs? Use AIQ Deploy.

Re-usable CI/CD Template

Deploy Machine Learning Models to Production within minutes with reusable CI/CD templates and automatic scaling of computing resources.

Automatic Scaling & Complex Deployments

Deployment pipelines should be capable of provisioning different resource types (CPU, GPU, or TPU) with auto-scaling capabilities. It should also assist in complex model deployment scenarios such as A/B deployments, shadow, and phased rollouts which have now become a common pattern in data-driven applications, all while providing real-time insights and recommendations.

Multi-Cloud Support

Enterprise needs the ability to store models in any cloud or in-house registry, deploy models to any cloud-agnostic environment without having to re-engineer their pipelines. Integrated MLOps should be able to deploy to any cloud, on-prem, or hybrid based on the infrastructure of your choice by determining the cost for managing the computing resource and monitoring the performance of your machine learning models. Kubernetes-based deployment with reproducible CI / CD pipelines makes it easier not only to onboard any new environment but also onboard a new team with machine learning models along with the infrastructure needed to train and run the inference for the model.

And when it’s time to switch to your compute environment of choice, simply log into AIQ Deploy and redirect your models to it. No more vendor lock-in!

AIQ Monitor: Unified Monitoring Dashboard for ML Projects

Unified Dashboard

AIQ Monitor (in beta) provides a unified dashboard enforcing collaboration among Data Scientists, the MLOps team, and Businesses to collectively monitor model performance and resource consumption to reduce cost and at the same time improve the efficiency of machine learning models.

Monitor performance, bias, data drift

Monitor model performance and resource utilization along with bias and data drift in real-time in a reliable, scalable, and explainable way so your Data Scientists spend less time debugging.

Follow us to learn more about our MLOps journey building and managing Machine Learning models for different industries like Retail, Healthcare, Pharma and Fintech

LinkedIn

Twitter