Markdown isn’t a perfect abstraction. In fact, when it was initially released (2004), the spec (if it could even be called that) was so ambiguous that, for the next decade, dozens of flavors of Markdown co-existed. A formal specification was introduced in 2014.
Markdown was designed to be read-optimized. It was based on conventions as old as the Internet (in email, on Usenet, etc.). It was meant to be a text-to-HTML converter. It’s not as rich as Rich Text Format (RTF), HTML, or even the markup language for Wikipedia, but it’s good enough.
Markdown hasn’t endured because it’s the perfect abstraction. Instead, it’s a good enough abstraction. And in the world of AI datasets, it’s more important than ever. Mainly because it’s a good enough abstraction.
- Good enough as a universal data format. The lowest common denominator wins when the training data is everything (books, websites, raw text). Enough to not lose too much semantic structure.
- Good-enough structure. Text alone isn’t always enough. We need some form of metadata, structure, or even a programming language to make the abstraction useful. But structure can be lossy. And most of the time, that’s good enough.
- Good enough to be transformed. What’s easy to read is easy to write. What’s easy to write is easy to convert. Markdown is easy to convert to HTML, PDF, or custom formats. Even the structure of Markdown itself is easy to work with programmatically. The same can be said about text for LLMs. It’s easy to convert, summarize, or analyze. If you can write a regex over it, you can transform it.
Enduring abstractions aren’t always the philosophically pure ones. They’re the ones that model the way that we interact with the world. Sometimes, they are a bit messy or lossy. But the Lindy ones are good enough.