Retrieval Augmented Generation (RAG) solves a few problems with LLMs:
- Adds contextual private information without fine-tuning
- Can effectively extend the context window of information an LLM can consider
- Combats the hallucination problem by using ground truth documents.
- Additionally, it may “cite” these documents in the output, making the model more explainable.
But there’s no single RAG pipeline or strategy. Most involve a vector database (today). However, there are plenty of strategies that developers are doing today to improve RAG pipeline performance.
- Chunking data. Documents can be chunked into smaller pieces to make semantic search more precise. It’s also a natural limitation if the documents themselves will be added to the prompt and need to fit inside the context window. Instead of matching a similar document with a query, you might match a page, section, or paragraph. There’s likely not a one-size-fits-all approach, as different document types will have different ways they can be logically chunked.
- Multiple indices. Splitting the document corpus up into multiple indices and then routing queries based on some criteria. This means that the search is over a much smaller set of documents rather than the entire dataset. Again, it is not always useful, but it can be helpful for certain datasets. The same approach works with the LLMs themselves.
- Custom embedding model. Fine-tuning an embedding model can help with retrieval. This is useful if the concept of similarity is much different for your document set.
- Hybrid search. Vector search isn’t always (or usually) enough. You often need to combine it with traditional relational databases and other ways of filtering documents.
- Re-rank. First, the initial retrieval method collects an approximate list of candidates. Then a re-ranking algorithm orders the results by relevance.
- Upscaling or downscaling prompts. Optimize the query so that works better in the search system. This could be upscaling the query by adding more contextual information before doing a semantic search or even compressing the query by removing potentially distracting and unnecessary portions.