Gradient makes LLM benchmarking cost-effective and easy with AWS Inferentia

This is a guest post co-written with Michael Feil at Gradient.

Evaluating the performance of large language models (LLMs) is an important step in the pre-preparation and tuning process before deployment. The faster and more frequently you can validate performance, the greater the chances of improving model performance.

In ai” target=”_blank” rel=”noopener”>Degradedwe work on developing customized LLM and recently launched our ai/development-lab” target=”_blank” rel=”noopener”>ai development lab, which offers enterprise organizations an end-to-end custom development service to create private, custom LLMs and artificial intelligence (ai) co-pilots. As part of this process, we periodically evaluate the performance of our models (tuned, trained and open) against open and proprietary benchmarks. While working with the AWS team to train our models on AWS Trainium, we realized that we were restricted to both VRAM and GPU instance availability when it came to the primary tool for LLM evaluation. lm-evaluation-harness. This open source framework allows you to score different generative language models on various assessment tasks and benchmarks. It is used by leaderboards such as hugging face for public benchmarking.

To overcome these challenges, we decided to build and open source our solution, integrating AWS Neuronthe library behind AWS Inferentia and Trainium, in lm-evaluation-harness. This integration allowed us to compare ai/blog/alphatross-llm-now-available-on-hugging-face” target=”_blank” rel=”noopener”>v-alpha-tross, a first version of our Albatross modelcompared to other public models during and after the training process.

For context, this integration runs as a new model class within lm-evaluation-harness, abstracting token inference and log-likelihood estimation of sequences without affecting the actual evaluation task. The decision to move our internal testing process to amazon Elastic Compute Cloud (amazon EC2) Inf2 instances (powered by AWS Inferentia2) allowed us to access up to 384 GB of shared accelerator memory, effortlessly adapting to all of our current public architectures . By using AWS Spot Instances, we were able to take advantage of unused EC2 capacity in the AWS cloud, enabling cost savings of up to 90% off on-demand pricing. This minimized the time needed for testing and allowed us to test more frequently because we could test multiple instances that were available and release them when we were done.

In this post, we provide a detailed breakdown of our testing, the challenges we encountered, and an example of using the test harness in AWS Inferentia.

Benchmarking on AWS Inferentia2

The goal of this project was to generate scores identical to those shown in the LLM Open Leaderboard (for many CausalLM models available on Hugging Face), while maintaining the flexibility to run it with private benchmarks. To see more examples of available models, see AWS Inference and Trainium in Hug the face.

Code changes required to migrate a model from Hugging Face transformers to Hugging Face Optimal neuron The Python library was quite low. Because lm-evaluation-harness uses AutoModelForCausalLMthere is a drop in replacement using NeuronModelForCausalLM. Without a prebuilt model, the model is automatically built on the fly, which could add 15 to 60 minutes to a job. This gave us the flexibility to deploy tests for any AWS Inferentia2 instance and support the CausalLM model.

Results

Because of the way benchmarks and models work, we didn't expect scores to match exactly across different runs. However, they should be very close by standard deviation, and we have seen that consistently, as shown in the table below. The initial benchmarks we ran on AWS Inferentia2 were confirmed by the Hugging Face leaderboard.

In lm-evaluation-harnessThere are two main streams used by different tests: generate_until and loglikelihood. The gsm8k test mainly uses generate_until to generate responses as during inference. Loglikelihood It is primarily used in benchmarking and testing, and examines the probability of different outcomes occurring. Both work on Neuron, but the loglikelihood The SDK 2.16 method uses additional steps to determine probabilities and may take longer.

lm evaluation harness results
Hardware configuration	original system	AWS Inference inf2.48xlarge
Time with batch_size=1 to evaluate mistralai/Mistral-7B-Instruct-v0.1 in gsm8k	103 minutes	32 minutes
Score in gsm8k (get response – exact_match with std)	0.3813 – 0.3874 (± 0.0134)	0.3806 – 0.3844 (± 0.0134)

Get started with Neuron and lm-evaluation-harness

The code in this section can help you use lm-evaluation-harness and run it with supported models in Hugging Face. To see some available models, visit AWS Inference and Trainium in Hug the face.

If you are familiar with running models in AWS Inferentia2, you may notice that there are no num_cores past configuration. Our code detects how many cores are available and automatically passes that number as a parameter. This allows you to run the test using the same code regardless of the instance size you are using. You may also notice that we are referring to the original model, not a compiled version of Neuron. The harness automatically compiles the model as needed.

The following steps show you how to implement the gradient. gradientai/v-alpha-tross model we tested. If you want to test with a smaller example on a smaller instance, you can use the mistralai/Mistral-7B-v0.1 model.

The default quota for running On-Demand Inf instances is 0, so you must request an increase through Service Quotas. Add another request for all Inf Spot Instance requests so you can test with Spot Instances. You will need a quota of 192 vCPUs for this example using an inf2.48xlarge instance, or a quota of 4 vCPUs for a basic inf2.xlarge (if you are deploying the Mistral model). Quotas are AWS Region-specific, so be sure to request them at us-east-1 either us-west-2.
Decide your instance based on your model. Because v-alpha-tross is a 70B architecture, we decided to use an inf2.48xlarge instance. Implement an inf2.xlarge (for model 7B Mistral). If you are testing a different model, you may need to adjust your instance based on the size of your model.
Deploy the instance using Hugging Face DLAMI version 20240123, so that all necessary drivers are installed. (The price shown includes the cost of the instance and there is no additional software charge.)
Set the drive size to 600 GB (100 GB for Mistral 7B).
Clone and install lm-evaluation-harness in the instance. We specify a build so we know that any variations are due to model changes, not testing or code changes.

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
# optional: pick specific revision from the main branch version to reproduce the exact results
git checkout 756eeb6f0aee59fc624c81dcb0e334c1263d80e3
# install the repository without overwriting the existing torch and torch-neuronx installation
pip install --no-deps -e . 
pip install peft evaluate jsonlines numexpr pybind11 pytablewriter rouge-score sacrebleu sqlitedict tqdm-multiprocess zstandard hf_transfer

Run lm_eval with the hf-neuron model type and make sure you have a link to the path back to the model in Hugging Face:

# e.g use mistralai/Mistral-7B-v0.1 if you are on inf2.xlarge
MODEL_ID=gradientai/v-alpha-tross

python -m lm_eval --model "neuronx" --model_args "pretrained=$MODEL_ID,dtype=bfloat16" --batch_size 1 --tasks gsm8k

If you run the above example with Mistral, you should receive the following output (on the smaller inf2.xlarge, the run could take 250 minutes):

███████████████████████| 1319/1319 (32:52<00:00,  1.50s/it)
neuronx (pretrained=mistralai/Mistral-7B-v0.1,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|  Filter  |n-shot|  Metric   |Value |   |Stderr|
|-----|------:|----------|-----:|-----------|-----:|---|-----:|
|gsm8k|      2|get-answer|     5|exact_match|0.3806|±  |0.0134|

Clean

When you're done, be sure to stop the EC2 instances through the amazon EC2 console.

Conclusion

The Gradient and Neuron teams are excited to see broader adoption of the LLM assessment with this release. Try it yourself and run the most popular evaluation framework on AWS Inferentia2 instances. You can now benefit from the on-demand availability of AWS Inferentia2 when you use ai/development-lab” target=”_blank” rel=”noopener”>Gradient Custom LLM Development. Start hosting models on AWS Inferentia with these tutorials.

About the authors

Michael Feil is an artificial intelligence engineer at Gradient and previously worked as a machine learning engineer at Rodhe & Schwarz and a researcher at the Max-Plank Institute for Intelligent Systems and Bosch Rexroth. Michael is a leading contributor to several open source inference libraries for LLM and open source projects such as StarCoder. Michael has a bachelor's degree in mechatronics and computer science from KIT and a master's degree in robotics from the Technical University of Munich.

Jim Burton He is a Senior Startup Solutions Architect at AWS and works directly with startups like Gradient. Jim is a CISSP, part of the AWS ai/ML technical field community, a Neuron ambassador, and works with the open source community to enable the use of Inferentia and Trainium. Jim has a bachelor's degree in mathematics from Carnegie Mellon University and a master's degree in economics from the University of Virginia.