Inside Modern Machine Learning Platforms: A Survey Across the Industry

April 13, 2025 · 8 min read

Systems builder. Platform tinkerer. Distributed architecture troublemaker.

Machine Learning (ML) is no longer just a research field or a niche corner of data science teams. Today, it's deeply embedded in products we use daily — be it your Uber ETA, Instagram feed, Netflix recommendations, or fraud detection in fintech. And the real magic behind deploying these models reliably, at scale, lies in how companies build their internal ML platforms.

In this post, we dive deep into how different organisations — ranging from tech giants to startups — are architecting their ML systems. We've compiled insights from real-world case studies, open-source projects, and platform blueprints to show how companies are solving the same problem with wildly different approaches.

ML Platform Computation Complexity

We'll cover:

What a typical ML platform looks like
How tech companies like Uber, Facebook, Airbnb, and Instacart architect theirs
Trade-offs in design: scalability vs. flexibility vs. cost
The rise of open-source ML infrastructure
When to consider managed cloud offerings like AWS SageMaker, Vertex AI, or Azure ML
A comparison table to help you decide what's right for your org

Let's get started.

What is a Machine Learning Platform, Really?

At a high level, a Machine Learning Platform is everything that sits between raw data and live predictions. Here's a common breakdown:

Data Ingestion & Processing: Scheduled or real-time pipelines (Spark, Airflow, Kafka)
Feature Store: A central system to define, compute, and serve features for ML
Experimentation: Notebook environments, experiment tracking, model training
Model Registry: Cataloging model artifacts, versions, metrics
Deployment: Real-time serving or batch prediction, often via APIs or cron jobs
Monitoring: Tracking model drift, latency, and performance degradation
Orchestration: Pipeline DAGs tying all the above together

This sounds like a lot — and it is. Which is why companies have taken vastly different paths depending on their scale, domain, and engineering maturity.

The Tech Giants: Heavyweight End-to-End Platforms

Uber: Michelangelo

Arguably one of the most referenced ML platforms in the industry.

Michelangelo isn't just a platform, it's a philosophy of consistency and scale. Designed to democratise ML across teams, The platform covered everything: data pipelines, a feature store, distributed training, a model registry, and serving infrastructure.

Palette (Feature Store): Both offline and online features accessible via a common API, versioned and reusable.
Training Stack: Uses Spark for distributed preprocessing and TensorFlow or XGBoost for model training. Includes auto-tuning and training across thousands of configurations.
Model Deployment: Kubernetes-based serving with one-click deployments and autoscaling.
Monitoring: Collects real-time performance metrics, model output distributions, and a proprietary model health score to trigger retraining.

Michelangelo powers ETA predictions, fraud models, driver allocation, and even supply-demand balancing models.

Why it works: Uber has high QPS, latency-sensitive products and ML is core to business (e.g. pricing, ETA). They needed strong guardrails.

Trade-offs: High engineering investment. Not for small teams.

Facebook: FBLearner Flow

Facebook took a slightly different route. FBLearner was built to let any engineer build and train ML models with minimal boilerplate.

A Workflows-as-code platform, built for internal reusability and speed:

Workflows-as-Code: Define training pipelines as DAGs using Python APIs.
Pluggable Components: Standardised modules for preprocessing, model training, evaluation, and deployment.
Experiment Management: Automatically tracks every experiment, dataset, and outcome.
Scale: At its peak, powered 6M predictions/sec across Facebook apps.

FBLearner also focused heavily on reuse: train once, serve anywhere. The internal adoption was massive — 25%+ of engineers were using it within a few years.

Why it works: At Facebook's scale, ML is everywhere. Democratisation and reuse matter more than custom infra per team.

Trade-offs: Deeply tied to Facebook infra. Not transferable.

Airbnb: Bighead

Airbnb opted for modularity with a suite of focused tools:

Zipline: Feature engineering as code; supports time-windowed feature computation and training-serving parity.
Bighead Lib: Pipelines-as-Python for rapid experimentation.
Redspot: Hosted notebook environments with consistent containers.
Deep Thought: Wraps models into Docker services, adds observability, and deploys to Kubernetes.

Each component was replaceable, making the platform evolve-friendly. Most importantly, Bighead drastically reduced time-to-production for models.

Why it works: Airbnb needed consistency and productivity, not one-size-fits-all.

Trade-offs: Building many micro-components adds integration overhead.

The Modular Approach: Open-Source Stacks

Companies without 100+ engineers can still build sophisticated ML platforms by stitching together open-source tools:

Common OSS Stack:

Data Ingestion: Airflow or Dagster for pipelines
Feature Engineering: Feast or custom ETL over Snowflake/BigQuery
Training & Tracking: MLflow, PyTorch Lightning, or TensorFlow with Optuna/Ray Tune
Model Registry & Deployment: MLflow Model Registry, Seldon Core, BentoML
Serving: KServe for scalable real-time inference
Monitoring: Evidently, WhyLogs, Prometheus + Grafana

This approach gives full control, and avoids vendor lock-in.

Real-World Example: Instacart's Griffin ML platform combines Feast, Databricks, Ray, MLflow, and a custom orchestrator — scaled to hundreds of models without building from scratch.

Trade-offs: You own integration, observability, and scaling.

Cloud-Managed ML Platforms

Cloud providers offer full-service ML stacks — great for startups, prototypes, or when internal DevOps resources are scarce.

AWS SageMaker:

Feature Store, Model Registry, Pipelines, and Endpoints all managed
Tight integration with S3, Glue, CloudWatch, IAM
Supports Bring-Your-Own-Container (BYOC) for custom model logic

Google Vertex AI:

Managed pipelines built on Kubeflow
Integrated with BigQuery, GCS, AutoML, and Explainable AI
Feature Store with real-time and batch support

Azure ML:

End-to-end model lifecycle with drag-drop UI or code-based SDKs
Strong MLOps integrations with GitHub Actions and Azure DevOps

Best for: Teams that want to skip infrastructure and focus purely on modeling.

Drawbacks: Opaque internals, higher long-term cost, and limited flexibility.

Databricks: The Lakehouse Approach to MLOps

Databricks deserves special mention as a platform that blends the benefits of a managed cloud platform with the openness and flexibility of OSS-based design.

If you don't want to go with self-managed OSS platform but still want OSS stack, Databricks by far the best Platform for all Data and ML Platform, little expensive then the Cloud solution but its worth the money spent.

Why Databricks Stands Out:

Unified Lakehouse Architecture: Combines the data warehouse and data lake into a single platform, making it easier to manage both structured and unstructured data in one place.
Native MLflow Integration: As the creators of MLflow, Databricks offers best-in-class experiment tracking, model versioning, and model registry baked right into its platform.
Auto-scaling Compute: Easily provision GPU/CPU clusters for ML workloads with zero DevOps involvement.
Delta Lake: ACID-compliant, versioned data tables optimized for large-scale ML pipelines.
Model Deployment: Supports batch inference via jobs or real-time endpoints with model serving. Integrates with Unity Catalog for governance and lineage.
Deep Collaboration Features: Real-time collaborative notebooks with Git integration, commenting, and version history.

Ideal For:

Enterprises that want a data-first ML platform
Teams already using Spark, Delta Lake, or lakehouse patterns
Organizations looking for strong governance and secure data access (via Unity Catalog)

Trade-offs:

Cost: Running heavy ML workloads at scale on Databricks can be expensive.
Real-time deployment flexibility: Some teams find KServe or BentoML more flexible for specialized serving use cases.

Still, for many, Databricks hits the sweet spot of managed infrastructure, OSS compatibility, and platform maturity.

Comparison Table

Platform Type	Scale	Maintenance	Flexibility	Cost	Best For
In-House (Uber)	Very High	Very High	Very High	High	Large tech companies
Modular (Airbnb)	High	Medium	High	Medium	Mid-size product orgs
OSS Stack	High	Medium	Very High	Low to Med	Startups with ML infra teams
Cloud-Managed	Med-High	Low	Medium	Medium-High	Teams without infra engineers
Commercial AutoML	Medium	Low	Low	High	Traditional enterprises

Final Thoughts

There's no one right ML platform architecture. What matters is choosing something that aligns with your team's size, experience, and goals.

If you're Uber or Facebook, you'll likely need to build.
If you're a 20-person startup? Don't.
If you're somewhere in the middle? Open-source + cloud might be your sweet spot.

Regardless of what you pick, the trends are clear: feature stores are no longer optional, monitoring is essential, and modularity is king. And as the tooling ecosystem matures, the barrier to building solid ML infra is lower than ever.

Have thoughts or want to share how your company does it? I'd love to hear from you.

Can also read on Medium

What is a Machine Learning Platform, Really?​

The Tech Giants: Heavyweight End-to-End Platforms​

Uber: Michelangelo​

Facebook: FBLearner Flow​

Airbnb: Bighead​

The Modular Approach: Open-Source Stacks​

Common OSS Stack:​

Cloud-Managed ML Platforms​

AWS SageMaker:​

Google Vertex AI:​

Azure ML:​

Databricks: The Lakehouse Approach to MLOps​

Why Databricks Stands Out:​

Ideal For:​

Comparison Table​

Final Thoughts​