Model Merge - (Frankenmerge)

Nov 16, 2023

Most AI models are just a (1) architecture (how many layers, what equations, what optimizers, etc.) and (2) parameters (weights, biases, etc.).

What happens when you take two models and merge them? Sometimes, interesting things.

Model merges (sometimes, “frankenmerges”) today are primarily used by hackers, not researchers or big corporations. It’s cheap, dirty, and takes a lot of trial and error.

The goal of model merging: ideally, combine model understanding of multiple models without an expensive re-training step.

There are too many to count, but a few merged models:

Goliath 120B (Twin and Euryale)
MythoMax — a blend of Hermes, Chronos, Airoboros, and Huginn models.
Toppy — OpenChat, Nous Capybara, Zephyr, AshhLimaRP-Mistral, and more
Goliath— Two fine-tuned Llama 70B into one 120B model.

Modifying the parameters directly modifies the model. But with billions of parameters, we have little understanding of what parameters do what (and highly complex interactions between parameters). Fine-tuning modifies some or all of the parameters but in a way that we can make (a little more) sense of (it just looks like training).

The main problem: what parameters need to be merged? How should they be merged? How to preserve the “stuff” you don’t want to change (general knowledge) and combine the “stuff” that you want in a single model (niche knowledge).

Simple average (all parameters). Average the weights between one or more models. This is fairly common in the Stable Diffusion community, where you might merge two models with varying weights (e.g., 30% photorealistic base model and 70% cartoon base model). This is the most straightforward method.

The rest of the methods try to isolate the important parameters, merge them (“smoothly”), and combine the knowledge.

TIES (TRIM, ELECT SIGN & MERGE). TIES is a method that tries to identify the relevant parameters that need to be merged and ignores merging the rest.

SLERP (Spherical Linear Interpolation)

mergekit is a utility that many hackers use to merge their models that implements TIES, SLERP, and linear averaging.

It will be interesting to see the evolution of model merging and whether it evolves from just a hacker’s bag of tricks to being useful at the cutting edge.