This paper was accepted into the Efficient Natural Speech and Language Processing (ENLSP) Workshop at NeurIPS 2024.
Large language models (LLMs) typically generate results token by token using a fixed computing budget, leading to inefficient resource utilization. To address this shortcoming, recent advances in combining expert modeling (MoE), speculative decoding, and early exit strategies leverage the idea that computational demands can vary significantly depending on the complexity and nature of the input. However, identifying optimal routing patterns for dynamic execution remains an open challenge, limiting the full potential of these adaptive methods. To address this need, we study adaptive computing in LLM more systematically. We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward network layer of the LLM. This design allows for dynamic routing of tokens depending on the complexity of the task: tokens can be processed by small or large modules at each layer, or even skip certain layers entirely. This allows us to introduce a novel notion of a token's difficulty, defined by its potential to benefit from additional computational resources. Importantly, by employing oracles to identify optimal patterns of adaptive computations, we gain valuable insights into the inner workings of LLMs and routing processes in a simplified heterogeneous MoE setup. We show that trained routers operate differently than oracles and often produce suboptimal solutions. In particular, activating a large module in a single layer outperforms models using large modules in all layers, underscoring the gap between practical implementations of routing in MoE models and theoretical optima for adaptive computation.