Large language models (LLM) often produce errors, including inaccuracies of objectives, biases and reasoning failures, collectively called “hallucinations.” Recent studies have shown that the internal states of the LLMs encode information on the veracity of their results, and that this information can be used to detect errors. In this work, we show that the internal representations of LLM encode much more information about the veracity of what was previously recognized. First we discover that veracity information is concentrated in specific tokens, and taking advantage of this property significantly improves error detection performance. However, we show that such error detectors fail to generalize in all data sets, which implies that, against previous statements, truthful coding is not universal but rather multifaceted. Next, we show that internal representations can also be used to predict the types of errors that the model probably makes, facilitating the development of custom mitigation strategies. Finally, we reveal a discrepancy between the internal coding of LLM and external behavior: they can encode the correct answer, but constantly generate an incorrect. Together, these ideas deepen our understanding of LLM errors from the internal perspective of the model, which can guide future research on the improvement of errors analysis and mitigation.
† Work partially during Apple's internship
‡ Technion – Israel Institute of technology
§ Google research