Large language models (LLM) can perform multiple tasks associated with language understanding, summarization, translation, extraction and questions and answers across various topics and domains. But while performing these tasks they are prone to hallucinate. Plausible-sounding answers are, in fact, utterly false. The source of the error ranges from deviating from facts or contextual logic to contradictory, or worse, completely fabricated statements. All hallucinations are troublesome, however, some can be verified (Intrinsic) while others are more challenging to verify (extrinsic). In our conversations on the topic of hallucinations, we often hear of methods to reduce them that simply do not work, while other methods have a reasonable shot. Here are a few common proposed solutions and our take on what is true, what is false and—as is often the case—what depends on other factors.
Loss is defined by the model’s performance in generating the desired output compared to the ground truth. The loss function quantifies the distance between the predicted output and the actual target output. In general, loss declines (barring overfitting) as these models grow in size as defined by the number of parameters and the amount of data used to train them.
This depends on the task. The number of parameters in LLMs has continued to increase (almost tenfold) year over year, but hallucinations remain an issue. Research conducted by the University of Oxford on GPT3 introduces two concepts: truthfulness and Informativeness. An answer is truthful if—and only if—it avoids asserting a false statement; an answer is informative if it is potentially relevant to the question. Truthfulness is accuracy of information and is thus more closely related to hallucination. However, informativeness is the completeness of information captured during answering. The University of Oxford study reported that the average truthfulness on practical questions drops from 38% to 21% as model size increases from 350M to 175B parameters. As the number of parameters increases, these models are still learning the most frequent relations, and their ability to select the long tail remains suspect. One reason why is that increasing the parameters increases the possible paths these models can take while selecting the next words (tokens), leading to more hallucinations. Meanwhile, in the same University of Oxford study, average informativeness increases from 75% to 99% as the model size increases from 350M to 175B parameters.
Using more data to train the model will reduce hallucination. False. Quality of data is critical to reducing hallucination. Using incomplete or biased data in LLM training leads to higher numbers of hallucinations. More of such data is unlikely to be helpful. For example, LIMA (Less Is More for Alignment) is a fine-tuned 65B parameter version of LLaMa. LIMA is trained using 1,000 carefully curated prompts and responses without any reinforcement learning or human preference modeling to align the model to follow a specific response format and input style to produce high-quality output. It has been shown to perform comparably to other, bigger fine-tuned models. LIMA is less likely to hallucinate than LLaMa on specific tasks as they both use the same number of parameters, but the former is trained on very carefully curated information.
Higher diversity in the data is likely to reduce hallucination. True. Training these models on diverse data sets helps to improve their reasoning capability. As these models’ reasoning capability increases, overall hallucination decreases. However, diversity may still increase hallucination for specific tasks such as language translation.
Prompt engineering, the art of providing more and more specific context to the LLM to improve its responses, is one of the most common approaches to get the LLM to respond more accurately. Some common approaches include:
Better context reduces hallucination. True. Providing clear and specific prompts helps LLMs generate the most relevant and accurate outputs, thereby reducing hallucination. For example, instead of asking, “What is a Jaguar?” when we could be referring to the car or the animal, we should ask, “What are the different models of a Jaguar?”
Longer context reduces hallucination. False. Longer context does not directly help to reduce hallucination—it’s the clarity and specificity of prompts that help. So if you are getting more specific in the longer prompts, it is likely to be helpful.
Adding more examples as context helps reduce hallucination. True. Adding multiple examples of the desired output, format or context to provide clear expectations does help to reduce hallucination. For example, adding interaction examples between agent and customer will help to improve LLM responses:
[SHOT 1]
Customer: Hello, I have a flight booked with your airline, and I need some assistance with my reservation.
Agent: Hello! Could you please provide me with your booking reference number or the email address used to make the reservation?
[SHOT 2]
Customer: My booking reference number is REF123.
Agent: Thank you for providing the reference number. How can I assist you with your reservation today?
[SHOT 3]
Customer: I would like to request a seat upgrade for my upcoming flight. Is that possible?
Agent: Certainly! Let me check the availability for seat upgrades on your flight. May I know your flight details, such as the date and destination?
[SHOT 4]
Customer: My flight is on July 15 from New York to London.
Agent: Thank you for the information. Let me check the availability of seat upgrades for that specific flight.
[SHOT 5]
Agent: I apologize, but there are no seat upgrades available for your flight on July 15. However, I can add you to the waiting list in case any upgrades become available. Would you like me to proceed with that?
Customer: Yes, please add me to the waiting list. Thank you.
Agent: You’re welcome! I’ve added you to the waiting list for a seat upgrade. If any seats become available, we will notify you. Is there anything else I can assist you with?
Longer responses reduce hallucination. False. The “Be Concise” prompt is used to restrict the generation process of LLM models. However, hallucination is not necessarily related to response length. There are hallucinations associated with factual contradictions: “The average distance between the Earth and the moon is about 93 million miles (149.6 million kilometers).” The response can be even more concise but nonetheless false. (The actual distance between Earth and moon is 238,855 miles.)
Models fine-tuned in a very specific domain and context will have reduced hallucination. False. Fine-tuning is the process of taking a pre-trained model and adapting it to a specific task or domain by further training it on a smaller, task-specific data set. The pre-trained model is typically trained on a large-scale data set using a related task, such as language modeling or image classification. Fine-tuning leverages the knowledge and representations learned by the pre-trained model and applies them to a new, more specific task. Fine-tuning helps a LLM respond to a specific task such as predicting the next event in a customer’s journey. Fine-tuning on a specific data set increases the expertise of these models. However, their conversational capability as defined by their ability to engage in and sustain coherent and contextually appropriate conversations with users may not have improved. This leads to potentially higher contradiction and nonsensical types of hallucination within the smaller model. This is intuitive—smaller data and more contextual data implies that there is more left for the model to figure out. Remember that a LLM is a stochastic parrot. It does not learn; it merely repeats what it has been fed.
The sentence “Large language models do not hallucinate” is a contradiction. All LLMs are expected to hallucinate considering they are trained on a big corpus of varied data sets that may have incompleteness, contradictions, inconsistencies and other biases. The best course of action is to minimize these hallucinations using techniques such as one-shot prompting, few-shot prompting, context injection and grounding to use-case-specific sources.