Model evaluation is still more art than science. New models claim to have superior performance every week. Practitioners have their own favorite models. Researchers continue to develop frameworks, only to have unique use cases break them.
- Evaluation tests don’t reflect real-world usage. It’s difficult to build a high-quality test set that covers a seemingly endless number of use cases with natural language. Many use cases are found daily and aren’t reflected in the evaluation set.
- What metrics matter? How do you measure things like model “creativity”?
- Overfitting. A problem with every model (even the ones that aren’t “machine learning”). LLMs consume trillions of tokens, some of which might include parts of the test set in some form.
- It’s expensive. It’s expensive to build and evaluate test datasets (especially ones graded by other LLMs).
Some more specific methods and where they fall short:
- Perplexity. Measures how well the probability distribution predicted by the model aligns with the actual distribution of words. It is not always correlated with human judgment. Doesn’t work as well comparing models across different tasks.
- GLUE (General Language Understanding Evaluation). A collection of NLP tasks. Doesn’t
- Human evaluation.
- LLM evaluation.
- BLEU (Bilingual Evaluation Understudy). Compares n-grams in the model’s output to reference outputs. Sensitive to slight variations (only exact matches). Other variations that have improved on BLEU are ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and METEOR (Metric for Evaluation of Translation with Explicit ORdering).
- F1 Score/Precision/Recall. A classic way of measuring model quality. Evaluates the balance between precision and recall.