This article was accepted into the Foundation Model Interventions (MINT) Workshop at NeurIPS 2024.
Following instructions is crucial for creating ai agents with large language models (LLMs), as these models must strictly adhere to the guidelines provided by the user. However, LLMs often do not follow even simple instructions. To improve instruction-following behavior and prevent undesirable outcomes, we need a deeper understanding of how the internal states of LLMs relate to these outcomes. Our analysis of the internal states of LLM reveals a dimension in the input incorporation space linked to successful instruction following. We show that changing representations along this dimension improves success rates in following instructions compared to random changes, without compromising response quality. This work provides insight into the inner workings of LLM instruction following, paving the way for reliable LLM agents.