Kubernetes in the Data Stack

Oct 12, 2021

The data analytics stack is changing. The proliferation of data sources and the complexity of pipelines has driven companies to look beyond traditional SaaS solutions. A look at how Kubernetes is impacting data analytics workflows.

Workflow Orchestration

Airflow and Airflow-like orchestrators like Prefect, Astronomer, or Dagster provide a workflow execution engine for data-intensive workloads. Usually, this means a combination of extracting, transforming, or moving data around.

Kubernetes is a possible execution platform for many of these tools. Kubernetes provides the API to launch and track different tasks in the DAG. Combine this with autoscaling or a smart scheduler, and many of these tasks can be performed cheaper and quicker.

For more DevOps-savvy teams, Kubernetes is can also serve as a common deployment target.

Data Ingestion

At the top of the analytics funnel is collecting and storing real-time data. Startups like PostHog provide a simple Kubernetes deployment for self-hosted analytics. An open-source Segment alternative called RudderStack also deploys to Kubernetes. These kinds of products need a columnar database to query efficiently, so PostHog ships with a high-performance database called ClickHouse (whose maintainers have also recently started a company around).

ClickHouse has a battle-tested Kubernetes operator to scale up and down deployments, maintained by a different company.

Extract & Load

Closely related to workflow orchestration is the process of extracting data from sources and loading it into a data warehouse like Snowflake. Think Zapier but more operational. The main problem here is ingesting data from many third-party APIs while maintaining data quality and surfacing API changes or breakages.

Historically, Fivetran has approached this by maintaining high-quality in-house connectors. Two startups have tried an open-source approach. Meltano, a GitLab spin-out, has focused on an open-source ecosystem of third-party connectors. Airbyte runs on Kubernetes, and also leveraged open-source connectors. Containers map one-to-one with different connectors.

Where Else?

I think that Kubernetes has the potential to change the data analytics stack beyond being a convenient deployment target. I think that this is just the beginning of an interesting partnership between data engineers and DevOps engineers.