There's been a series of blog warfare between two cloud data giants over the last two weeks. Databricks claimed to be significantly faster than Snowflake in database performance. Snowflake quipped the benchmark was unfair and wrong. Databricks said it stands by its assessment. Here's how it escalated.
- Nov 2: Databricks Sets Official Data Warehousing Performance Record
- Nov 12: Industry Benchmarks and Competing with Integrity – Snowflake
- Nov 15: Snowflake Claims Similar Price/Performance to Databricks, but Not So Fast!
Why are both companies in such heated competition?
Databricks has historically been centered around the data science workflow, borne out of the Apache Spark project and Berkeley. At the core of the data science workflow is unstructured (images, videos, documents) and semi-structured data (XML, JSON, YAML).
Snowflake provided the bedrock of the modern analytics workflow with their cloud data warehouse product. This allowed massive amounts of structured data to be stored and queried with SQL.
Organizations will always need to analyze both structured and unstructured data. As a result, both companies are trying to offer a full data cloud platform. Databricks has wrapped their unstructured data lakes with a SQL layer, and Snowflake has added support for unstructured data.
Who has the upper hand? Some preliminary thoughts. It's not about the benchmarks. Performance matters, but has become table stacks. It's about building enough solutions to become a data platform. I think it will come down to (1) who has the better wedge? (2) who has faster product velocity?
Databricks is open source. Open Source is the future of infrastructure. I imagine Snowflake has grown so big despite being closed source because it has a well-defined API which is SQL. This allows it to scope its platform and job (read and write data). On the other hand, Databricks was built on open source, and successfully has navigated the replatforming of cloud to cloud native and containers. A true data platform requires a large API surface, and open source provides the most extreme API surface.
More data analysts than data scientists. Data analysts are cheaper to hire and every single company needs structured data. Analysts are responsible for writing queries to calculate metrics like ARR, MRR, and other key KPIs. Nearly every business needs those metrics. On the other hand, data scientists are more expensive and many companies don't have the expertise or data to have interesting data science work.
But will the roles of data scientists and data analysts converge? Data science libraries are often written in Python and require working knowledge of statistics. Data analysts only need to know SQL to calculate and transform business data. (For more, read Unbundling of the Software Dev).