Google has been training different AI tools using version control as a source of training data. The idea: take the various snapshots of the software development process as seen by code reviews, commits, and iteration steps as seen in their centralized monorepo. Use this to build AI-powered tools that debug, repair, code review, and code edit using a similar process as humans. (DIDACT from Google)
There’s a wealth of data in version control. I’ve long suggested GitHub Copilot, but for merge conflicts — the training data comes from the merge commits and manual conflict resolutions stored in git.
Sometimes understanding the process of how a result came to be can be more valuable than the end result without context. At Stanford, I analyzed GitLab’s Sales organization using their company handbook, which is publicly stored in git. I could see how key sales metrics and headcount grew over time. I could see inflections at each round of funding and other pivotal events in the company.
There’s an ongoing conversation in the industry and research community on whether we have enough training data tokens to continue reaching new milestones in LLMs. There’s plenty if we get a little more creative on how we think about it.