With advancements in machine learning, screenshots are quickly becoming a universal data format. It's now (relatively) easy to extract meaning (image-to-text), layout information (object recognition), text (optical character recognition, OCR), and other metadata (formatting, fonts, etc.).
Now, with diffusion-based models like Stable Diffusion and DALL-E, we have an encoder – text-to-image.
Screenshots-as-API solves a few problems:
- Easier to parse than highly complex layout formats. When I wrote about Rethinking the PDF, I didn't consider images an alternative. But image models are generic and don't need to understand the PDF encoding. Screenshots-as-API could mean significant changes for existing unstable APIs like web crawlers. Now that websites are primarily dynamic, it isn't easy to fully hydrate a website, parse the layout, and extract the same experience that an end-user would get (open-source crawlers like puppeteer from Google make this easier, but there are many edge cases). What if it was easier to parse a screenshot of the page?
- Universally available, easily copyable. While images aren't the most efficient encoding method for text, they can be the simplest for humans. Excel has had a screenshot-to-table feature for some time because some tables are notoriously tricky to copy (how do you solve that generically at the text level?). You can copy objects from photos in the latest iOS 16 update.
- Permissionless. Many applications won't allow you to export data. However, screenshots are always available (similar to the era of web crawling).
- More complex metadata. Look how effective image search is on mobile – you can search for people, places, things, and more. Some of this comes from the actual image metadata, but other is inferred with on-device models. Automatically encoding this data in traditional formats like PDF takes much longer.
An image is worth a thousand words.