Speculative streaming: fast LLM inference without auxiliary models

Speculative decoding is a prominent technique for accelerating the inference of a large target language model based on predictions from a preliminary auxiliary model. While effective, in specific application environments it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of subsequent tasks grows, these preliminary models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting with the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative streaming speeds up decoding by 1.8 to 3.1 times on a diverse set of tasks, such as summarization, structured queries, and meaning representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter efficient. It achieves speedups equal to or greater than Medusa-style architectures while using ~10,000x fewer additional parameters, making it well suited for resource-constrained devices.

Speculative streaming: fast LLM inference without auxiliary models

Technical Terrence Team

1 stock I would love to buy on the FTSE 100 in October

Leave a Reply Cancel reply

Recommended.

Lingokids presents space lessons in partnership with NASA to inspire young minds

Ubisoft and Immutable will pioneer a new era of Blockchain games

Average Americans Now Have Retirement Savings and 401(k) Options

TikTok fined £12.7m for breaching UK data protection law | Tik Tok

Las Vegas Strip casino has one last act before implosion

Categories

Important Links

Speculative streaming: fast LLM inference without auxiliary models

Related

Technical Terrence Team

1 stock I would love to buy on the FTSE 100 in October

Leave a Reply Cancel reply

Recommended.

Lingokids presents space lessons in partnership with NASA to inspire young minds

Ubisoft and Immutable will pioneer a new era of Blockchain games

Average Americans Now Have Retirement Savings and 401(k) Options

TikTok fined £12.7m for breaching UK data protection law | Tik Tok

Las Vegas Strip casino has one last act before implosion

Categories

Important Links

Get daily news updates to your inbox!