Gemini Gives Us Retrieval Augmented Generation with Confidence

Image by Imagen2

Digging into the recently released API for Gemini in Google AI Studio, I discovered what I think might be a hidden gem. They have tools they group under "Semantic retrieval" for building Retrieval Augmented Generation (RAG) systems. This includes a tailored model and API that assist in response generation. But there are some distinctly different elements to it that piqued my interest. This is just a preliminary exploration, and I look forward to digging into these features further.

What is Retrieval Augmented Generation?

This will merit another article to go into depth, but RAG is a way of allowing users to ask questions through an AI model and ground the answer in an external authoritative resource. It usually involves three phases:

  • Turning the user question into a concrete, and conventional, digital representation. This could be something like an SQL query, a data structure containing matching keywords, or an embedding.

  • Using conventional (non-AI methods) to find results, such as documents, that match the request. This could be done by running the SQL query, for example, or doing a nearest neighbor search with the embedding vector. The results are a set of data that, somehow, answer the question asked.

  • Prompting another language model with the original question and these sample answers to produce a result that gets sent to the user.

Our current tools can do this quite adequately, and there are several products (such as Google's Vertex AI Search and Conversation) that can do this out of the box and still provide API access. However, Gemini with the Google AI Studio are offering two interesting elements that make this offering different. The first is the Attributed Question and Answering API (the AQA API), which offers help on the "Generation" part, and the second is the Semantic Retriever API, which helps in the "Retrieval' part (as the name suggests).

The Attributed Question and Answering API

In short, the AQA API allows access to a generative AI model based on Gemini Pro. This model is similar to other Gemini models since it takes a prompt, temperature and other model parameters, safety restrictions, etc. It differs, however, by also allowing you to provide a set of passages which you think adequately answer the question.

The results are what make this different from using typical LLMs to generate results with RAG. While it does give an answer, as you would expect, it also gives two additional pieces of information:

  • Which passages, provided through an attribute called inlinePassages, were used to generate the result (the "attribution" part of the name). This allows you and the user to verify the information and where it came from.

  • A confidence score called answerableProbability, giving an indication about how well the answer was actually grounded in the question and provided passages.

As developers, we can use this confidence score to determine how to present the results. For example, with a confidence score of 90% we would present the results clearly and confidently. However, with a score under 50%, we might include a warning that the results are inaccurate, or we may not present the results at all and instead say that we didn't know.

The Semantic Retriever API

But the AQA API also goes a step further. In addition to providing a way to provide the passages inline, you can also create a corpus of documents and point to this corpus when asking it to generate a response to the question. These documents must be broken up into chunks of about 2k and each document and chunk can contain additional metadata.

Then, when you ask the model to answer a question, you can point it at this corpus. It will identify several of the "best" answers for the question, and provide them to the model as if they were inline passages.

In this case, the attributed results include any identifiers you may have included, so you can fetch the metadata and include them to the user. And, as always, it includes the answer probability so you can determine how you wish to handle replying with these results to the user.

But why?

If you're familiar with other RAG methods, you're probably asking yourself:  "Self, why would I use this method over other solutions - either pre-made products or putting it together yourself via something like LangChain?" The answer depends on your goals and how you use the API.

Using it just as a generator gives you the confidence score, which allows you to avoid passing off replies that aren't actually backed by the grounding information.

Using it as a full RAG gives you a bit more control over the LLM parameters than many pre-built systems. When compared to putting all the pieces together yourself, it should also be faster, since you're only making one API call rather than three or more.

As with most solutions, there are tradeoffs involved in every approach. Still, I think this solution fits a really nice middle ground that people should explore.

What's next?

There is a lot more to test here - from code that actually uploads and manages data in the corpus to effectively using the model. While all of this looks straightforward, I'm anticipating some "gotchas" that will make this trickier. (For example, how can we express other contextual information? Can we do that in the prompt? And many more questions!)

And then, of course, all the other new features that Gemini brings!

More of all of these in a later post.