How do you evaluate a general-purpose large language model?
Evaluating a model is essential. How good a model is at your particular task is one of the top criteria for choosing the right model (in addition to cost, latency, etc.).
Today, it is more art than science.
There are frameworks to evaluate models from researchers (Holistic Evaluation of Language Models, HELM). Still, researchers and commercial products often have different north stars, not to mention how much innovation is happening behind closed doors (how much can you evaluate when the model isn’t open-sourced?).
The industry also has its own evaluation frameworks (e.g., openai/evals). But these haven’t proven that useful outside the companies open-sourcing them.
Companies are building their own QA tools to test regressions via new prompts and to track performance across models, but very few go beyond human evaluation. SaaS companies have popped up to help these companies with some of this infrastructure.
What if we can’t evaluate model performance accurately? That puts a higher premium on everything else — UI/UX, brand, functionality, and more.