“Data is the new oil” was the slogan of the last decade. Companies were told how valuable their data was (or could be). They rushed to invest in a modern data stack and store terabytes of data in data warehouses. Data science teams crunched the numbers, and the analyses were supposed to be used to inform product decisions (or even, in some cases, customer-facing features like recommendation feeds).
There were success stories, but many organizations failed to execute. Siloed data (or data teams), expensive cloud data warehouses and rogue queries (that are now being downsized), and the absence of clean data pipelines (significant ops work to get the data in a refined state).
Now, with generative AI, is data still a moat? Is data more or less valuable when synthetic datasets account for a non-zero part of training and inference pipelines?
On the one hand, quality data still matters. A lot of focus on LLM improvement is on model and dataset size. There’s some early evidence that LLMs can be greatly influenced by the data quality they are trained with. WizardLM, TinyStories, and phi-1 are some examples. Likewise, RLHF datasets also matter.
On the other hand, ~100 data points is enough for significant improvement in fine-tuning for output format and custom style. LLM researchers at Databricks, Meta, Spark, and Audible did some empirical analysis on how much data is needed to fine-tune. This amount of data is easy to create or curate manually.
Model distillation is real and simple to do. You can use LLMs to generate synthetic data to train or fine-tune your own LLM, and some of the knowledge will transfer over. This is only an issue if you expose the raw LLM to a counterparty (not so much if used internally), but that means that any data that isn’t especially unique can be copied easily.