Meet Marlin: an LLM FP16xINT4 inference core that can achieve near-ideal ~4x speedups up to medium batch sizes of 16 to 32 tokens
In computer science, there is a common challenge when trying to speed up the process of executing complex language models, ...