Large language models are unique because you can get good results with in-context learning (i.e., prompting) at inference time. This is much cheaper and more flexible than fine-tuning a model.
But what happens when you have too much data to fit in a prompt but don’t want to fine-tune it? How do you provide the proper context for those models?
You have a few choices:
- Use the model with the most significant context window. For example, most models have a limit of 4k tokens (prompt and completion included), but GPT-4 has a window size of 32k tokens.
- Use a vector database to perform a similarity search to filter down the relevant context for the model. Only a subset (e.g., the three most similar sentences or paragraphs) are included in the prompt.
- Use a traditional search engine (e.g., ElasticSearch, Bing) to retrieve information. Unlike similarity search, there’s more semantic work to be done here (but possibly more relevant results).
- Use an alternative architecture where the model does some routing to more specific models or information retrieval itself (e.g., Google’s Pathways architecture)
What will be the dominant architecture in the future? Napkin math look at the cost of different methods. It’s a bit of an apples-and-oranges comparison — there are use cases that only work with a specific method, but this just looks at the use case of augmenting in-context learning with the relevant data.
(Let’s assume 1 page ~= 500 words, and 1 sentence ~= 15 words, 1 word ~= 5 characters).
Using the largest model. With large context lengths, let’s estimate there’s a 9:1 split between prompt tokens (currently $0.06/1k tokens) and sampled tokens ($0.12/1k tokens). This comes out to a blended $0.066 / 1k tokens.
Using OpenAI’s embeddings, 1 token ~= 4 characters in English, or 100 tokens ~= 75 words.
At the full token capacity, that’s $2.112 per query containing 24,000 words.
Vector search. You can convert chunks of text to vectors, let’s say a sentence per vector for retrieval for simplicity. In practice, chunk sizes might be larger (paragraphs) or shorter (single tokens).
Vector sizes. Let’s use 1536 dimensions since that’s the size of OpenAI’s embeddings. In practice, you would probably use a lower dimensionality to store in a vector database (768 or even 256). Pinecone, a vector database, has a standard tier that roughly fits up to 5mm 768-dimensional vectors, costing $0.096/hour or ~$70/mo. This includes compute. Let’s assume this equates to 2.5mm dim(1536) vectors.
A rough calculation of the storage size required for 2.5mm 1536-dimensional vectors (assuming float32).
2.5mm vectors * 1536 dimensions * 4 bytes per dimension ~= 15GB
That’s about 1.875mm words. Significantly larger than even the largest context window. Assuming 100 queries/day, that’s $0.023 per query.
Of course, you still need to put the relevant documents in the prompt.
Essentially, as long as the similarity search reduces the query by ~1% ($0.023/$2.112) tokens, you should run the vector search first. This seems like the no-brainer it is today.
The numerator ($/vector search) and the denominator ($/token) are likely to decrease over time. However, the $/token costs are likely to fall much faster than the vector database cost. If token costs fall 10x faster, we’re looking at a 10% trade-off. Maybe a different story.
Additional costs: maintaining the vector database infrastructure, added latency to make a database call first (who knows how slow the 32k-prompt models will be — a difference calculation), the margin of error (what finds the relevant information more often?), and the developer experience (a data pipeline vs. “put it all in the prompt”).