How to tame your GenAI model: RAG, fine-tuning and all that jazz

Giacinto Palmieri
Feb 12, 2025
5 min read

Updated: Feb 13, 2025

In my experience as a trainer and consultant, the most common need felt by most companies with regard to generative AI is how to ground a model on reliable information and how to be sure that it responds in a predictable and consistent fashion. In a sense, this is a paradoxical situation, as the main leap forward offered by generative models over more traditional solutions, such as Q&A chatbots, consists exactly in the greater flexibility and "creativity", for lack of a better word, offered by these models. So, the first question to ask yourself is if whether you really need to deploy a generative mountain, so to speak, only to give birth to a question answering mouse.

The desire to deploy a generative model at all costs might be at least partially due to the hype surrounding this field, to some sort of technological fear of missing out. This is not necessarily a bad thing: exposure to new technologies can keep your skillset up to date and a very limited implementation can be the gateway to more comprehensive solutions. We all prefer the new toys and work, after all, is a continuation of play with other means. This type motivation, however, is a secret that we techies typically keep for ourselves, as it wouldn't really fly in a budget planning meeting. There are, however, more socially acceptable reasons for wanting to use a GenAI model for something like a technical support chatbot.

Indeed, as much as you can train your traditional Q&A bot to recognise synonyms, grammatical inflections and all sort of variations, the process is very time consuming and it really difficult to achieve the desired level of flexibility. On top of these advantages on the question parsing side, there are also obvious advantages on the answer production side: users would be able to ask answers in a specific format, style and even in a different language, far beyond what can be predicted by the bot designer. There might be some caves and crannies which still be useful in that mountain, even for a mouse.

So, we are back to the problem of how to tame the model to be sure that all this flexibility is not at the expenses of reliability. The first first tools that come to the mind are the system prompt and the temperature parameter.

The system prompt (a hidden prompt which is sent before any user prompt, in each and every completion request) allows to specify the style of communication, to specify the aim of the bot ("I'm a technical assistant answering questions on Microsoft Surface", for instance) and even as a place to hide rule-like instructions.

The temperature, on the hand, is a parameter sent with each completion request, which specifies the desired level of "creativity" the bot should display. Better, it's the level of the desired unpredictability: the higher the value (accepted values are decimal number between 0 and 1), the more likely it is that the model with give a different answer for the same question and higher will be level of variation between those answers. The choice depends, however, on the aim of the bot: for technical, legal or medical applications you will probably want a low temperature; if you are looking for a new slogan for your marketing campaign, you might want to pass a higher one. The parameter is sent for each request (so, it's not a property of the model deployment as such), but you can be sure that the right temperature value is sent in the application that integrates the model, for instance the bot's web page. A little curiosity about the temperature parameter: someone recently told me that the name "temperature" refers to hot iron, which is the more malleable the hotter it is. You're welcome.

System prompts and the temperature parameters can be part of the solution, but they are clearly not enough. The other essential part is grounding the bot on your own data, so that its responses can be informative, reliable and authoritative. Here two approaches come to mind : RAG (Retrieval Augmented Generation) and model fine-tuning. The latter is the equivalent of asking your model to complete a specialisation course, after it got a foundational degree (the original training). Training is the most resource-expensive, time-expensive and money-expensive phase in the model lifecycle and the same applies to that extra training which is fine-tuning. In some cases, you have no other choice: training a OCR model to recognise the handwriting of a particularly calligraphically challenged colleague, for instance, can only be done through fine-tuning. In the case of case of data-grounding, however, RAG usually represents a better alternative.

With RAG, for each completion request, you tell the model to interrogate a specific data source and to use the results to form the completion. It's what Bing Copilot does with the Bing search engine, Gemini with the Google search engine and Microsoft 365 Copilot with Microsoft Graph, through which it access your OneDrive files, Exchange mailboxes etc. To stay in the Microsoft field, you can use Azure AI Search as the preferred sources to implement your grounding. In turn, AI Search can create a semantic (so, not only keyword-based) index from a very wide variety of document format, such as PDF and Word, as well as sources of structured data such as SQL Server. Not only, but you can "enrich" these data by calling an entire pipeline of "skills", such as OCR or translation or image description.

One advantage is this approach is better separation of concerns and better reusability: you can use the same index to ground your GenAI model but also to an internal search website, for example. According to the fields of the index you might decide to capture, you can provide links to the original documents. On the other hand, in the case of fine-tuning, once the model has digested the information provided by the document used during the (extra) training, it forgets where this information came from: it makes it its own, so to speak, which is great... unless you want a link back to the document. Moreover, this acquired knowledge would reside in the model itself and, therefore, cannot be shared with other applications. If the documents change, fine-tuning would require re-fine-tune the model. With RAG, it would only allow updating the index.

The difference between RAG and fine-tuning, however, goes deeper than that: we should think of fine-tuning as a way of learning new capabilities, new ways of doing things, such as producing responses in a newly acquired style. It's more like learning to drive than learning the capital of France. Which brings us to the last and, in my opinion, decisive consideration in the choice between RAG and fine-tuning: with RAG, you can tell the model to generate responses if and only if they can be grounded in the data. This is the safest way to be sure that the responses it generates are reliable... or at least are as reliable as your data.

Comments