This is a guest post by Arash Sadrieh, Tahir Azim and Tengfui Xue of NinjaTech ai.
NinjaTech ai's mission is to make everyone more productive by taking care of complex, time-consuming tasks with fast and affordable artificial intelligence (ai) agents. We recently launched ai/” target=”_blank” rel=”noopener”>My Ninja.ai, one of the world's first multi-agent ai personal assistants, advancing our mission. MyNinja.ai was built from the ground up using specialized agents who are capable of completing tasks on your behalf, including scheduling meetings, conducting deep research from the web, generating code, and helping with writing. These agents can break down complicated multi-step tasks into branching solutions and are able to evaluate dynamically generated solutions while continually learning from past experiences. All of these tasks are performed completely autonomously and asynchronously, leaving you free to continue your day while Ninja works on these tasks in the background and chimes in when his input is required.
Since no large language model (LLM) is perfect for all tasks, we knew that creating a personal ai assistant would require several LLMs optimized specifically for a variety of tasks. To deliver the precision and capabilities needed to delight our users, we also knew we would need these multiple models to work together in tandem. Finally, we needed scalable and cost-effective methods to train these various models, a task that has historically been costly for most startups. In this post, we describe how we built our innovative NinjaLLM productivity agent, the backbone of MyNinja.ai, using AWS Trainium chips.
Building a data set
We recognized early on that to fulfill the mission of addressing tasks on behalf of a user, we needed multiple models optimized for specific tasks. Some examples are our Deep Researcher, Deep Coder and Advisor models. After testing the available open source models, we felt that the out-of-the-box capabilities and responses were insufficient with cue engineering alone to meet our needs. Specifically, in our testing with open source models, we wanted to ensure that each model was optimized for a ReAct/chain of thought prompting style. Additionally, we wanted to ensure that the model, when implemented as part of a Retrieval Augmented Generation (RAG) system, accurately cited each source, as well as any bias toward saying “I don't know” rather than generating false responses. To that end, we chose to fine-tune the models for the various downstream tasks.
When building our training data set, our goal was twofold: adapt each model for its appropriate subsequent task and persona (Researcher, Advisor, Coder, etc.) and adapt the models to follow a specific output structure. To do this, we follow the following steps: Approach to Lima to tune. We use a training sample size of approximately 20 million tokens, focusing on the format and tone of the result while using a diverse but relatively small sample size. To build our supervised fine-tuning dataset, we start by creating initial tasks for each model. With these initial tasks, we generate an initial synthetic dataset using Meta's Llama 2 model. We were able to use the synthetic data set to perform an initial round of tuning. To initially evaluate the performance of this tuned model, we collected user feedback to iteratively create more samples. We also used a series of benchmarks (internal and public) to evaluate model performance and continued to iterate.
Fine tuning in Trainium
We chose to start with Llama models for a pre-trained base model for several reasons: the most notable being the excellent out-of-the-box performance, strong multi-library ecosystem support, and truly open source and permissive license. At that point, we started with Llama 2, testing in various sizes (7B, 13B, and 70B). For training, we chose to use a cluster of trn1.32xlarge instances to take advantage of the Trainium chips. We use a cluster of 32 instances to efficiently parallelize training. We also use AWS ParallelCluster to manage cluster orchestration. Using a cluster of Trainium instances, each fine-tuning iteration took less than 3 hours, at a cost of less than $1,000. This fast iteration time and low cost allowed us to quickly tune and test our models and improve the accuracy of our model. To achieve the accuracies discussed in the following sections, we only had to spend about $30,000, saving hundreds of thousands, if not millions of dollars, if we were to train on traditional training accelerators.
The following diagram illustrates our training architecture.
Having established our fine-tuning processes based on Trainium, we were able to fine-tune and refine our models thanks to Neuron’s distributed training libraries. This was exceptionally helpful and timely, because prior to the launch of MyNinja.ai, Meta’s Llama 3 models were released. Llama 3 and Llama 2 share a similar architecture, so we were able to quickly upgrade to the newer model. This speed in switching allowed us to take advantage of the inherent gains in model accuracy and very quickly run another round of fine-tuning with the Llama 3 weights and get ready for launch.
Model evaluation
To evaluate the model, there were two objectives: to evaluate the model's ability to answer users' questions and to evaluate the system's ability to answer the questions with the provided sources, because this is the main interface of our personal ai assistant. . We select the HotPotQA and ai.google.com/research/NaturalQuestions” target=”_blank” rel=”noopener”>Open Natural Questions (NQ) datasets, which are a good fit due to their open benchmarking datasets with public leaderboards.
We calculated accuracy by matching the model response to the expected response, using the top 10 passages retrieved from a Wikipedia corpus. We perform content filtering and classification using ColBERTv2a recovery model based on BERT. We achieved accuracies of 62.22% on the NQ Open dataset and 58.84% on HotPotQA using our improved Llama 3 RAG model, demonstrating notable improvements over other benchmark models. The following figure summarizes our results.
Future work
Looking ahead, we are working on several developments to continue improving our model performance and user experience. First of all, we intend to use ORPHAN to fine-tune our models. ORPO combines traditional tuning with preference alignment, while using a single preference alignment data set for both. We believe this will allow us to better align models to achieve better results for users.
Additionally, we plan to create a custom set model from the various models we have perfected so far. Inspired by Mixture of Expert (MoE) model architectures, we intend to introduce a routing layer into our various models. We believe this will radically simplify our model scaling and delivery architecture, while maintaining the quality across various tasks that our users expect from our personal ai assistant.
Conclusion
Creating next-generation ai agents to make everyone more productive is NinjaTech ai's path to achieving its mission. To democratize access to this transformative technology, it is essential to have access to high-powered computing, open source models, and an ecosystem of tools that make training each new agent affordable and fast. AWS's purpose-built ai chips, access to best-in-class open source models, and training architecture make this possible.
To learn more about how we built NinjaTech ai's multi-agent personal ai, you can read our White paperYou can also try these ai agents for free at ai” target=”_blank” rel=”noopener”>MiNinja.ai.
About the authors
Arash Sadrieh is the co-founder and chief scientific officer of Ninjatech.ai. Arash co-founded Ninjatech.ai with the vision of making everyone more productive by taking care of time-consuming tasks with ai agents. This vision took shape during his tenure as a senior applied scientist at AWS, where he drove key research initiatives that significantly improved infrastructure efficiency over six years, earning him multiple patents for optimizing core infrastructure. His academic background includes a PhD in computer modeling and simulation, with collaborations with renowned institutions such as the University of Oxford, the University of Sydney and CSIRO. Prior to his industry role, Arash conducted postdoctoral research marked by publications in high-impact journals, including Nature Communications.
Tahir Azim is a staff software engineer at NinjaTech. Tahir focuses on NinjaTech's Inf2 and Trn1-based training and inference platforms, its unified gateway to access these platforms, and its RAG-based research ability. He previously worked at amazon as a senior software engineer, building data-driven systems for optimal utilization of amazon's global Internet infrastructure, reducing costs, congestion and latency. Before moving into the industry, Tahir earned a master's degree and a Ph.D. in Computer Science from Stanford University, he taught for three years as an assistant professor at NUST (Pakistan) and did a postdoc in fast data analysis systems at EPFL. Tahir is the author of several publications presented at top conferences such as VLDB, USENIX ATC, MobiCom and MobiHoc.
Teng Fei Xue is an applied scientist at NinjaTech ai. His current research interests include natural language processing and multimodal learning, particularly using large language models and large multimodal models. Tengfei completed his PhD studies at the University of Sydney's School of Computer Science, where he focused on deep learning for healthcare using various modalities. He was also a visiting PhD candidate at the Laboratory for Mathematics in Imaging (LMI) at Harvard University, where he worked on 3D computer vision for complex geometric data.