A simple framework based on my short taxonomy of open source strategies (specifically applied to machine learning models).
Dimensions of what you can open-source:
- A paper that describes the algorithm
- the code that implements the algorithm
- The model weights (and the code)
- The training data, model weights, and code
Under varying levels of licensing
- Copyleft (i.e., you can use but derived works must be open-sourced)
- Permissive (might retain some rights but usable in commercial derivatives without OSS)
Some newer restrictions
- “No competing models” (OpenAI)
- “No copying if you offer it as a managed service” (e.g., Elastic)
- Academic/non-commercial (LLaMA)
What should you choose? Non-exhaustive list of a few examples
You have proprietary data, but not enough resources or expertise. Twitter’s model features are probably not relevant to many other businesses. Even if you deployed Twitter’s ranking algorithms, you would still need to figure out a way to break down Twitter’s network effects and decade of proprietary data.
You want to recruit and retain top researchers. The best researchers want to publish their work. Combined with the resources and data from the large tech companies, researchers might accomplish more in an industry lab than academic one. Even Apple has relaxed its insistence on secrecy to retain top talent.
You sell hardware or cloud resources. Commoditize your complement. If you sell GPUs or cloud resources, you want the greatest number of organizations to be running large custom training jobs, inference and everything else.
You have no distribution but have a breakthrough insight. Open-source provides a level of distribution when you have none. Stability Diffusion (Stability AI) grew on this principle. Distribution is often the hardest part.