The large language model or LLM inference has two phases, the request (or preload) phase to generate the first token and the extension (or decode) phase to generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead, to accelerate the fast phase. The key observation is that the extension phase generates tokens faster than the request phase due to the key-value cache (KV-cache). Therefore, KV-Runahead parallelizes the request phase by orchestrating multiple processes to fill the KV cache and minimizes time to first token (TTFT). The dual-purpose KV cache scheme has two main benefits. First, since KV-cache is designed to take advantage of the causal attention map, we minimize computation and computation automatically. Second, since it already exists for the extension phase, KV-Runahead is easy to implement. Furthermore, we propose context-level load balancing to handle uneven KV cache generation (due to causal attention) and optimize TTFT. Compared to an existing parallelization scheme, such as tensor or sequential parallelization, where keys and values are generated and exchanged locally across collectives, our experimental results demonstrate that KV-Runahead can deliver speedups of more than 1.4× and 1.6× for Llama 7B and Falcon 7B. respectively.