About the author: Varun Chitale is an experienced Machine Learning Engineer Manager with more than 6 years of experience. He has worked in ML applications of all kind with particular focus on NLP and Recommendation Systems.
How We Explored And Selected MLOps Technologies
At Sennder, we're proud to be at the forefront of machine learning and AI technology. Our team of skilled data scientists and engineers work tirelessly to develop cutting-edge models that deliver value to our customers and partners.
But as any seasoned data science practitioner knows, building a machine learning model is only half the battle. The real challenge lies in productionizing it - in deploying it at scale, ensuring it's performant, and keeping it up to date as data and business requirements evolve.
That's why at Sennder, we've invested heavily in the development of a robust machine learning engineering platform. Our platform includes several key components that are critical to productionizing an ML model, including an ML pipeline orchestrator, ML metadata and artefact tracking, a model registry, and tools for model training and serving.
To make sure we chose the right tools for each of these components, we used the process of Markdown Any Decision Records (MADR). This helped us make informed decisions that were well-documented and transparent.
MADR (Markdown Any Decision Records) is a process for documenting decisions made during a project or initiative. It involves creating a simple Markdown file that outlines the decision, its rationale, and any relevant context or background. MADR files can be easily shared and reviewed, making it easier for team members to stay informed and make informed decisions going forward. Overall, MADR helps teams make better decisions and promotes transparency and collaboration.
In this blog post, we'll dive deeper into each of these components, explore the decision-making process behind our tooling choices, and showcase some of the amazing work our team has done with this technology. We hope this post inspires other data science practitioners and potential team members to join us in our mission to push the boundaries of what's possible with machine learning.
The Components
ML Pipelines Orchestrator
An ML pipelines orchestrator is a tool or platform that manages the flow of data and computations during the machine learning process. Essentially, it helps you automate the steps involved in training and deploying an ML model, so that you can more easily manage the complexity of these processes. A common example of an ML pipelines orchestrator is Apache Airflow, which allows you to create DAGs (Directed Acyclic Graphs) to define the sequence of tasks that need to be executed.
Thanks to our use of MADR, we were able to carefully evaluate and document our decisions about which ML Pipelines Orchestrator tools to consider, making it possible to compile this comprehensive list for your reference.
Features | Airflow | Flyte | Metaflow | Kubeflow Pipelines | Prefect | Kedro | Databricks Workflow |
---|---|---|---|---|---|---|---|
Environment & Dependency Isolation | Poor | Good | Good | Good | Poor | Poor | Good |
Pipeline Versioning | No | Yes | Yes | Yes | Yes | No | No |
Data flow between tasks | No | Yes | Yes | No | Yes | No | No |
Task level resources | Yes (Kubernetes) | Yes | Yes | Yes | No | No | No |
Graphical User Interface (GUI) | Yes | Yes | Yes (Not mature) | Yes | Yes | Yes (Not mature) | Yes |
Credentials Management | Poor | Good | Good | Good | Poor | Poor | Good |
Ecosystem - Plugins | Great | Good | Poor | Good | Good | Poor | Good |
Kubernetes Support | Yes (Not Native) | Yes - Native | Yes (Not Native) | Yes - Native | Yes | Yes - Not Native | Yes - Vendor Managed |
Documentation* | Great | Good | Poor | Poor | Poor | Poor | Good |
Release Version | 2.4.3 | 1.2.1 | 2.7.14 | 1.6 | 2.6.9 | 0.18.3 | 11.3 |
Github Stars | 28.3k | 2.9k | 6.2k | 12.1k | 10.6k | 7.8k | - |
License | Apache | Apache | Apache | Apache | Apache | Apache | Proprietary |
Cloud/SaaS | Astronomer | UnionML | Outerbounds | - | PrefectHQ | - | Databricks |
Additional Notes | Rich Ecosystem for Data Engineering | LF AI & Data (Graduate Project) | Not mature as product offering | Complex Deployment | Focus on Data Engineering | Not Mature | Mature and Paid service, vendor lock in. |
Model Training
Model training is the process of training your machine learning models on data to improve their performance. This involves selecting the appropriate algorithms and hyperparameters, preparing the data, and running the training process itself. A common example of a tool for model training is scikit-learn, which provides a wide range of algorithms and tools for preprocessing data and running the training process.
Features | Ray | Apache Spark | Dask | Horovod | BigDL | Distributed Tensorflow | Mesh Tensorflow | GPipe | PyTorch | Microsoft DeepSpeed |
---|---|---|---|---|---|---|---|---|---|---|
Type of training | Data + Model parallelism | Data parallelism | Data + Model parallelism | Data parallelism | pipeline parallelism | Data parallelism | Data + Model parallelism | pipeline parallelism | Data +Model parallelism | data + model + pipeline parallelism |
Integrations | Yes | Yes | Yes | Yes | Yes | Yes | Unknown | None | Yes | Yes |
Scalable training | Yes via Train | Yes: standalone via MLib Cluster via MapReduce |
Yes via parallelising numpy, scikit-learn, pandas | Yes using different backends: MPI, Gloo, NCCL, oneCCL | Yes | Yes | Yes | Yes | Yes (e.g. via FairScale) |
Yes |
Scalable Hyperparameter tuning | Yes via Tune | Yes via MLib | Yes: built in via HyperbandSearchCV | Yes via Ray Tune | Yes via Orca | Yes via external libs | No | n/a | Yes via Ray Tune | Yes bult in + LAMB |
Hardware+software support | CPU/GPU/TPU | CPU GPU via RAPIDS Accelerator |
CPU GPU not native support |
CPU/GPU | CPU | CPU/GPU/TPU | CPU/GPU/TPU | GPU/TPU/CPU | CPU/GPU/TPU | CPU/GPU |
License | Apache 2.0 | Apache 2.0 | BSD 3-Clause License | Apache 2.0 | Apache 2.0 | MIT License | Apache 2.0 | Open source | modified BSD license | MIT License |
Documentation* | Great | Great | Great | Great | Good | OK | OK | Bad | Great | Good |
Github Stars | 23.9k | 34.9k | 10.7k | 13k | 4.1k | n/a | 1.4k | 2.7k (part of Lingvo) | 62.3k | 8.6k |
Table 2: Platforms For Distributed Training
Features | AWS Sagemaker | Google Cloud Computing | Microsoft Azure |
---|---|---|---|
Storage | S3, Redshift, RDS | GCS, BigQuery | Blob, Data Lake |
ETL | EMR, Glue, Sagemaker with PySpark | Cloud Dataflow, BigQuery | Azure Databricks, Synapse, Kusto |
Visualisation | QuickSight | DataStudio | Power BI, Cognitive Services |
Exploration | Athena, Sagemaker Autopilot | BigQuery, AutoML table | Azure ML Studio, Azure Databricks |
Distributed training | Yes | Yes | Yes |
Model versioning | Yes | Yes | Yes |
Experiment tracking | Yes | Yes | Yes |
Error analysis | Sagemaker Debugger | AutoML table with BigQuery | Azure ML |
1-click deployment | Yes | Yes | Yes |
Batch prediction | Yes | Yes | Yes |
Native Pipelines for MLOps | No | No | Yes |
SSO Admin | Yes | Yes | Yes |
Scaling options | Auto Scaling | Autoscaler | Azure autoscale |
Analytics | Amazon kinesis | Cloud dataflow | Azure stream analytics |
ML Metadata And Artefacts Tracking
ML metadata and artefacts tracking is the process of capturing and storing information about the data and models used in the machine learning process. This helps ensure reproducibility and auditability, as well as facilitating collaboration and experimentation. A common example of an ML metadata and artefacts tracking tool is MLflow, which allows you to log information about the data, models, and experiments you run during the machine learning process.
You know the drill, here’s the compiled data for different tools for this task –
Features | ML Flow | DVC | Neptune AI | Valohai | Amazon Sagemaker | Tensorflow ml-metadata | Weights and biases | clearml | Polyaxon | comet.com | Microsoft Azure |
---|---|---|---|---|---|---|---|---|---|---|---|
Experiment tracking (Yes / No) | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Model registry (Yes / No) | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No (Uses ml Flow) |
Data versioning (Yes / No) | No ( Usually combined with DVC) | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes |
Lineage tracking (Yes / No) | No | No | Not sure | Yes | Yes | Yes | Yes | No | No | No | No (Uses microsoft purview) |
CI/CD/CT Integration (Yes / No) | Yes | Via separate tool CML |
Yes (Github Actions or Docker) | Yes (Gitlab, Docker, Github) | Yes | Yes | Yes (early access) | Not sure | Github Actions, Jenkins, argo, ariflow, kafka zapier | Not sure | Github Actions |
Integration with python | Python, | Python | Python (most ML libraries) Jupyter notebooks |
Python, Jupyer notebooks, |
Pyhton (most ML libraries), | Python (most ML libraries) | Python (most ML libraries) | Python(most ML libraries) | Python(most ML libraries) | Python(most ML libraries) | Pyhton |
Integrations (Frameworks/tools) | Docker, Databricks |
N/A | Kedro, ZenML, Sacred, Optuna, Arize, Sagemaker, google colab, deepnote , etc. | Integrations | Docker, Mostly with Sagemaker features. |
Kubeflow | Sagemaker, Kubeflow, Databricks | Sagemaker, Kubeflow, Kubernetes, Optuna, etc. | A lot of services | A lot of services | Other Azure services |
Organising and searching experiments, models and related metadata. (Yes / No) | Yes | Yes | Yes (Using graphic interface) | Yes (Using graphic interface) | Yes | Yes (Using sql queries) | Yes | Yes | Yes | Yes | Yes |
UI/ Dashboard (Yes / No) | Yes | Yes via Iterative Studio |
Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes |
API | Yes (Python, R, Java, REST) | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes |
Part of larger ecosystem (Yes / No) | Yes | No | No | No | Yes | ~ Kubeflow uses it. | No | No | Yes | No | Yes |
Github Starts (#) | 13.4k | 10.9k | N/A | N/A | N/A | 507 | N/A | 4k | 3.2k | N/A | N/A |
License | Apache | Apache | Proprietary | Proprietary | Proprietary | Apache | Proprietary | Apache | Apache | Proprietary | Proprietary |
Documentation* (Great, Good, Ok, Bad) | Great | Good | Good | Great | Great | Ok | Good | Ok | Good | Good | Good |
Model Registry
A model registry is a tool or platform that allows you to store and manage different versions of your machine learning models. This helps you keep track of the changes made to your models over time, and makes it easier to deploy and manage them in production. A common example of a model registry is the TensorFlow Model Garden, which provides a centralized repository of pre-trained models and tools for managing model versions and deployments.
Drill –
Features | ML Flow | DVC | Valohai | Verta AI | Sagemaker | Dataiku | DataRobot | Azure ML | Comet | Weight & Biases | ModelDB | H2O MLOps | Neptune AI | Yatai |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Deployment type | Managed (self-hosted), fully-managed (via Databricks) | Managed (self-hosted) | fully managed | fully-managed, enterprise deployment (on-premise or VPC) | fully managed | self-hosted, fully-managed (SaaS), On-premise/ on-cloud deployments | self-hosted, fully-managed, On-premise / on-cloud deployments | Hosted on Azure ML Workspace, available through SDK, on-premise/on-cloud deployments | fully-managed On-premise/on-cloud deployment |
fully-managed On-premise/on-cloud deployment |
self-hosted Fully-managed |
Fully Managed | Managed (self-hosted), fully-managed | Managed (self-hosted) on-cloud deployment (BentoML) |
Experiment tracking | Yes | Yes | Yes | Yes | Yes | Yes | n/a | Yes | n/a | Yes | n/a | n/a | n/a | n/a |
Part of larger ecosystem | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes part of BentoML |
Team collaboration | No ❓ | Yes | Yes | Yes | Yes | N/A | N/A | Yes | Proprietary | Yes | n/a | n/a | Yes | Yes |
Access management | No | Yes | Yes | Yes | Yes | N/A | N/A | Yes | Proprietary | Yes | n/a | n/a | Yes | Yes |
Code versioning | No | Yes | Yes | N/A | Yes | N/A | N/A | Yes | Yes | Yes | Yes | n/a | No. Logging only | No |
Data versioning | No | Yes | Yes | Yes | Yes | Yes | Yes | No | n/a | n/a | Yes via dataset metadata logging | No | ||
API integration | Yes | Yes Python only ❓ |
No | Yes | Yes | Yes | Yes | Yes | Yes | ❓ | n/a | Yes | Yes | |
UI / Dashboards | Yes | Yes via Iterative Studio |
Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Dashboards | n/a | Yes | Yes |
CI/CD workflow integration | Yes | Via seperate tool CML |
Yes | Yes (Jenkins, Chef, GitOps…) |
Yes ❓ | Yes (Jenkings) |
N/A | Yes | ❓ | ❓ | n/a | n/a | No ❓ | Yes Jenkins, CircleCI and Github Actions etc |
Model staging | Yes | Yes | Yes ❓ | Yes | Yes | N/A | N/A | Yes | Yes | Yes | n/a | n/a | Yes | No |
Model promotion | Yes | N/A | Yes ❓ | Yes | N/A | N/A | N/A | Yes | ❓ | Yes | n/a | n/a | Yes | No |
Tool integration | N/A | No ❓ However other tools have DVC integrations (e.g. Hydra, Hugging Face, DagsHub, VS code) |
N/A | Yes (Docker, Kubernetes, Tensorflow, PyTorch, etc) |
Yes | Yes | N/A | n/a | Yes | Yes (Kubeflow, Ray, ZenML, Tensorboard…) |
Yes | n/a | Yes | Yes Tensorflow, PyTorch, Keras, XGBoost etc |
Model deployment integration | N/A | Yes | N/A | Yes | Yes | Yes | Yes | Yes | Yes | n/a | n/a | n/a | n/a | Yes |
Model training / experiments integration | N/A | Yes | N/A | No | Yes | Yes | Yes | Yes | Yes | Yes | n/a | n/a | n/a | Yes |
Github Stars | 13.3k | 10.9k | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | 1.5k | n/a | n/a | 493 |
License | Apache | Apache | Proprietary | Proprietary | Proprietary | Proprietary | Proprietary | Proprietary | Proprietary | Proprietary | Apache | Proprietary | Proprietary | Apache |
Documentation* | Good | Good | Good | Good | Good | Good | Ok | Good | Good | Ok | Ok | Bad | Good | Ok |
Model Serving
Model serving is the process of making your machine learning models available to users or applications for inference. This involves deploying the model to a production environment and providing an API or other interface for accessing it. A common example of a model serving tool is TensorFlow Serving, which allows you to deploy TensorFlow models to production environments and provides an API for making predictions.
Features | BentoML | Cortex | TensorFlow Serving | TorchServe | Seldon-MLServer | KFServe | Azure ML | Valohai | ForestFlow | Databricks | Sagemaker |
---|---|---|---|---|---|---|---|---|---|---|---|
Multiple models serving (Yes/No) |
Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
RESTful endpoint monitoring (Yes/No) |
N/A | Yes | No | Yes | No | Yes | Yes | Yes | No | Yes | Yes (via Cloudwatch) |
Control via UI (Yes/No) |
Yes (with Yatai) |
No | No | Yes | No | No | Yes | Yes | No | Yes | Yes |
Input/output distribution shifts monitoring (Yes/No) |
No | No | No | No | No | No | Yes | No | No | No | Yes (via Cloudwatch) |
Score via UI (Yes/No) |
No | No | No | No | No | No | No | No | No | Yes | No |
Serving cluster customization/autoscaling (Yes/No) |
Only with Clipper | Yes | Yes | Yes (via TorchX) |
No | Yes | Yes | Yes | Yes | Yes | Yes |
Model Agnostic (Yes/No) |
Yes | Yes | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Dependency manager (Yes/No) |
Yes | N/A | Yes | Yes | Yes | No | Yes | Yes | N/A | Yes | Yes |
Part of a larger ecosystem (Yes/No) |
No | No | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes |
A/B testing tools (Yes/No) |
No | No | Yes | Yes | No | Yes | Yes | Yes | Yes | No | Yes |
Support for gRPC (Yes/No) |
Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | N/A | Yes | Yes |
Integrations | Airflow, MLFlow, Spark | TensorFlow, Keras, PyTorch, Scikit-learn, XGBoost | TensorFlow | PyTorch, MLFlow, Kubeflow | N/A | Kubeflow, MLFlow | N/A | N/A | TensorFlow, Spark ML | Spark ML, MLFlow, scikit-learn | MLFlow, Kubeflow, Spark ML |
Github Stars | 4.5k | 7.9k | 5.7k | 3.1k | 349 | 1.9k | N/A | N/A | 61 | N/A | N/A |
License | Apache License 2.0 | Apache License 2.0 | Apache License 2.0 | Apache License 2.0 | Apache License 2.0 | Apache License 2.0 | Proprietary | Proprietary | Apache License 2.0 | Proprietary | Proprietary |
Documentation* | Good | Ok | Good | Good | Ok | Ok | Good | Good | Ok | Good | Good |
The Final Stack
Together, that is a lot of information to digest. Individually, each of these were parallelised for simultaneous completion.
Next, we used the popular MoSCoW Method to evaluate the tools and decide the winners. Here's a crisp explanation of MOSCOW prioritization:
- MOSCOW prioritization is a project management technique to prioritize requirements or features based on their importance.
- It stands for Must have, Should have, Could have, and Won't have.
- Must-have requirements are critical and necessary for the project's success.
- Should-have requirements are important but not critical.
- Could-have requirements are desirable but not essential.
- Won't-have requirements are excluded from the project scope.
- MOSCOW prioritization helps teams focus on the most important features and manage stakeholder expectations.
- It also helps reduce scope creep and ensure the project delivers the most value with available resources.
Component | Tool |
---|---|
ML pipeline orchestrator | Flyte |
ML metadata and artefact tracking | Sagemaker |
Model Registry | Sagemaker |
Model Serving | Sagemaker (BentoML was a close runner-up) |
Model Training | Ray + AWS(as the platform) |
Azure was a close runner-up in some cases, however, Sagemaker beat it because of it’s wide applicability.
In conclusion, choosing the right frameworks for building a robust and scalable machine learning architecture can be a challenging task. However, the MADR approach can be a valuable tool in streamlining the decision-making process. By evaluating various tools and documenting the decision-making criteria and rationale, teams can ensure transparency and accountability in the decision-making process. While extensive research may seem time-consuming, it is essential to make an informed decision that aligns with the organization's goals and requirements. In the end, a well-designed machine learning architecture can significantly impact the success of ML projects, and investing in the right tools and frameworks can make all the difference.