What would a token-efficient language for LLMs look like?
YAML is pervasive as a configuration language partly because of its human readability over JSON. But — what’s more token-efficient if you’re using an LLM?
Let’s look at some configuration languages as an example, but the concept also generalizes to programming languages.
Here’s an example object that covers most of the JSON features (this is the minified version):
{"name":"John Doe","age":30,"isStudent":false,"hobbies":["reading","running","painting"],"address":{"street":"123 Main St","city":"New York","zip":"10001"},"friends":[{"name":"Jane Smith","age":28},{"name":"David Johnson","age":32}]}
How does it tokenize with OpenAI models? 337 characters map to 162 tokens (unminified, standard indentation).
Now, what if we use the equivalent YAML representation of the object? (Remember, YAML is a superset of JSON). YAML only takes 227 characters, which maps to 85 tokens.
Finally, what if you minify the JSON? Remove unnecessary whitespace, indentation, and newlines. This further reduces the representation to 223 characters but only 64 tokens. So minified JSON is the most efficient tokenized representation for this configuration object. I’d assume this is generally true for most texts.
But what if we had a more token-efficient language for configuration — or even programming? What if the constraints have now evolved to be both human-readable and token-efficient?
Some other languages:
TOML: 91 tokens (Minified, 79)
XML: 201 tokens (Minified, 121)
HCL: 79 tokens
INI: 84 tokens
What would it look like?
Here’s a quick attempt at Tok-son.
name John Doe
age 30
isStudent false
hobbies
reading
running
painting
address
...street 123 Main St
...city New York
...zip '10001'
friends:
...name Jane Smith
... age 28
...name David Johnson
... age 32
It registers as 61 tokens and 207 characters.
And the grammar:
config = entry*;
entry = identifier, [indent, value | value_sequence | subentry_sequence ];
indent = '...';
value_sequence = value, ws+ , value_sequence | value;
value = string | number | boolean;
subentry_sequence = indent, entry, subentry_sequence, | indent, entry;
identifier = identifier_char+;
number = ['-'], digit, {digit};
boolean = 'true' | 'false';
string = ''' | character , { whitespace | character } , ''' | character;
identifier_char = '?<any printable character excluding whitespace, "=", and ":">';
whitespace = ws_char+;
ws_char = ' ' | '\t';
Far from ideal. But I learned some interesting takeaways from my quick experiment. There are a few ways to take advantage of the BPE tokenizer.
- Whitespace is your friend. Whitespace can often be elided with common words. For instance, “age” and “ age” are 1 token each. But only there are no duplicate whitespace tokens (at least for OpenAI’s tokenizers). One space = one token.
- Many of the minified JSON brackets and parentheses were already tokenized together. This made it hard to be more efficient than that.
- One potential solution is to use longer strings that are still one token. For example, “…” is only one token. So is “****” and “******”. Sometimes these can make things more readable.
Some questions for future exploration:
- Is this even a good thing? Is there a trade-off between model ability and language grammar? That is, will models generate better JSON than a made-up language?
- What about a different tokenizer? Different tokenizers will produce different results. For example, LLaMA’s tokenizer outputs 149 tokens for JSON, 98 for YAML, and 84 for minified JSON.