In recent research, a team of Mistral ai researchers presented Mixtral 8x7B, a language model based on the new Sparse Mixture of Experts (SMoE) model with open weights. Licensed as Apache 2.0 and as a dispersed network of a mix of experts, Mixtral simply serves as a decoder model.
The team has shared that the Mixtral feedforward block has been chosen from eight different parameter groups. Each layer and token has two groups of parameters, called experts, that are dynamically selected by the router network to process the token and combine their results additively. Since only a portion of the total parameters are used for each token, this method efficiently increases the model parameter space while preserving cost and latency control.
Mistral has been pre-trained using multilingual data with a token context size of 32k. It has performed equal to or better than the Llama 2 70B and GPT-3.5 in various benchmarks. One of its main advantages is the efficient use of parameters, allowing faster inference times in small batches and higher performance in large batches.
Mixtral substantially outperformed Llama 2 70B on tests including multilingual comprehension, code production, and mathematics. Experiments have shown that Mixtral can effectively retrieve data from its 32k token contextual window, regardless of the length and position of the data within the sequence.
To ensure a fair and accurate evaluation, the team reran the benchmarks using their evaluation process while comparing the Mixtral and Llama models in detail. The assessment consists of a wide range of problems divided into categories such as mathematics, coding, reading comprehension, common sense thinking, world knowledge, and popular aggregate findings.
Common sense reasoning tasks such as ARC-Easy, ARC-Challenge, Hellaswag, Winogrande, PIQA, SIQA, OpenbookQA, and CommonsenseQA have been evaluated in a 0-shot environment. World knowledge tasks tested in a 5-shot format include TriviaQA and NaturalQuestions. BoolQ and QuAC were the reading comprehension tasks that were assessed in a 0-shot environment. Mathematical tasks incorporated GSM8K and MATH, while code-related tasks encompassed Humaneval and MBPP. Consolidated popular findings for AGI Eval, BBH and MMLU have also been included in the research.
The studio has also introduced Mixtral 8x7B – Instruct, a conversation model optimized for instructions. Direct preference optimization and supervised fine-tuning were used in the procedure. In human review benchmarks, Mixtral – Instruct performed better than GPT-3.5 Turbo, Claude-2.1, Gemini Pro and Llama 2 70B – chat model. Benchmark indices such as BBQ and BOLD have shown less bias and a more balanced sentiment profile.
To promote broad accessibility and a variety of applications, Mixtral 8x7B and Mixtral 8x7B – Instruct are licensed under the Apache 2.0 license, allowing for both commercial and academic use. By adding CUDA cores from Megablocks for effective inference, the team modified the vLLM project.
In conclusion, this study highlights the exceptional performance of the Mixtral 8x7B, using a comprehensive comparison with Llama models across a wide range of benchmarks. Mixtral does exceptionally well in a variety of activities, from problems involving math and coding to reading comprehension, reasoning, and general knowledge.
Review the Paper and Code. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>