LLMs have grown notably in recent years, largely driven by global initiatives to expand both the size of models and data sets. From just a billion parameters five years ago, exemplified by GPT-2 with 1.5 billion parameters, LLMs now boast trillion-parameter architectures. This push arises from the perceived benefits of training larger models, as indicated by scaling laws. However, these laws traditionally presuppose a static data source, a notion challenged by the emergence of cutting-edge LLMs, which enable novel interactions with data.
Previous research on phi models has shown that combining LLM-based web data filtering with LLM-generated synthetic data produces levels of performance typically associated with much larger models. For example, phi-2, with 2.7 billion parameters, matched the performance of models 25 times its size trained on conventional data.
Microsoft researchers presented phi-3-mini, a new model with 3.8 billion parameters, trained on enhanced data sets exceeding 3.3 trillion tokens. Despite its smaller size, the phi-3-mini facilitates local inference on contemporary smartphones. The model adopts a transformer decoder architecture with a default context length of 4K, while its long context variant, phi-3-mini-128K, extends it to 128K using LongRope. Using the Llama-2 structure, it shares a similar block and tokenizer configuration with a vocabulary size of 320,641, allowing for seamless adaptation of Llama-2 packets. With 3072 hidden dimensions, 32 heads and 32 layers, the model is trained with 3.3 billion tokens using bfloat16. Optimized for mobile devices, the phi-3-mini can be quantized to 4 bits, taking up approximately 1.8GB of memory and achieving over 12 tokens per second on an iPhone 14 with the A16 Bionic chip.
The training methodology builds on previous work and focuses on high-quality training data to improve the performance of the small language model. Unlike previous approaches, it emphasizes data quality over computational efficiency or overtraining, filtering web data to align with the model's educational and reasoning goals. The performance of the model is compared with the Llama-2 models, illustrating its effectiveness near the “Optimal Data Regime”. Additionally, a larger model, phi-3-medium, with 14B parameters, is trained using similar methods, but shows less improvement, suggesting continued refinement of the data combination. Post-training involves fine-tuning supervised instructions and fine-tuning preferences with DPO, improving the model's chat capabilities, robustness, and security.
The researchers expanded their research by training phi-3-medium, a model with 14B parameters, using the same tokenizer and architecture as phi-3-mini. Trained on the same data for slightly longer epochs (4.8T tokens in total, similar to phi-3-small), phi-3-medium features 40 heads, 40 layers, and an embedding dimension of 5120. Interestingly, they noted that While certain benchmarks showed significant improvement from parameters 3.8B to 7B, the progress was less pronounced from parameters 7B to 14B. This observation suggests that further refinement of the data combination is necessary to achieve the “optimal data regime” for the 14B parameter model. Ongoing research on these benchmarks, including regression in HumanEval, indicates that the metrics reported for phi-3-medium should be viewed as a preliminary evaluation.
While the phi-3-mini achieves similar commendable language understanding and reasoning as larger models, its size limits the storage of extensive factual knowledge, leading to lower performance on tasks such as TriviaQA. Augmentation with a search engine could solve this problem. Furthermore, its predominantly English focus highlights the need to explore multilingual capabilities, showing initial promise in phi-3-small with aggregated multilingual data.
In conclusion, this research presents the phi-3-mini model, which shows the potential for smaller models to achieve performance comparable to their larger counterparts, but with inherent limitations. Further exploration of multilingual capabilities and augmentation with search engines could improve the effectiveness of smaller LLMs in addressing diverse linguistic tasks.
Review the Paper and HF Page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>