The data analytics stack is changing. The proliferation of data sources and the complexity of pipelines has driven companies to look beyond traditional SaaS solutions. A look at how Kubernetes is impacting data analytics workflows.
Workflow Orchestration
Airflow and Airflow-like orchestrators like Prefect, Astronomer, or Dagster provide a workflow execution engine for data-intensive workloads. Usually, this means a combination of extracting, transforming, or moving data around.
Kubernetes is a possible execution platform for many of these tools. Kubernetes provides the API to launch and track different tasks in the DAG. Combine this with autoscaling or a smart scheduler, and many of these tasks can be performed cheaper and quicker.
For more DevOps-savvy teams, Kubernetes is can also serve as a common deployment target.
Data Ingestion
At the top of the analytics funnel is collecting and storing real-time data. Startups like PostHog provide a simple Kubernetes deployment for self-hosted analytics. An open-source Segment alternative called RudderStack also deploys to Kubernetes. These kinds of products need a columnar database to query efficiently, so PostHog ships with a high-performance database called ClickHouse (whose maintainers have also recently started a company around).
ClickHouse has a battle-tested Kubernetes operator to scale up and down deployments, maintained by a different company.
Extract & Load
Closely related to workflow orchestration is the process of extracting data from sources and loading it into a data warehouse like Snowflake. Think Zapier but more operational. The main problem here is ingesting data from many third-party APIs while maintaining data quality and surfacing API changes or breakages.
Historically, Fivetran has approached this by maintaining high-quality in-house connectors. Two startups have tried an open-source approach. Meltano, a GitLab spin-out, has focused on an open-source ecosystem of third-party connectors. Airbyte runs on Kubernetes, and also leveraged open-source connectors. Containers map one-to-one with different connectors.
Where Else?
I think that Kubernetes has the potential to change the data analytics stack beyond being a convenient deployment target. I think that this is just the beginning of an interesting partnership between data engineers and DevOps engineers.