The Enigma for ChatGPT: PUMA is an AI Approach That Proposes a Fast and Secure Way for LLM Inference

Large Language Models (LLMs) have started a revolution in the artificial intelligence domain. The release of ChatGPT has sparked the ignition for the era of LLMs, and since then, we have seen them ever improving. These models are made possible with massive amounts of data and have impressed us with their capabilities, from mastering language understanding to simplifying complex tasks.

There have been numerous alternatives proposed to ChatGPT, and they got better and better every day, even managing to surpass ChatGPT in certain tasks. LLaMa, Claudia, Falcon, and more; the new LLM models are coming for the ChatGPT’s throne.

However, there is no doubt that ChatGPT is still by far the most popular LLM out there. There is a really high chance that your favorite AI-powered app is probably just a ChatGPT wrapper, handling the connection for you. But, if we step back and think about the security perspective, is it really private and secure? OpenAI ensures protecting API data privacy is something they deeply care about, but they are facing numerous lawsuits at the same time. Even if they work really hard to protect the privacy and security of the model usage, these models can be too powerful to be controlled.

So how do we ensure we can utilize the power of LLMs without concerns about privacy and security arising? How do we utilize these models’ prowess without compromising sensitive data? Let us meet with PUMA.

PUMA is a framework designed to enable secure and efficient evaluation of Transformer models, all while maintaining the sanctity of your data. It merges secure multi-party computation (MPC) with efficient Transformer inference.

At its core, PUMA introduces a novel technique to approximate the complex non-linear functions within Transformer models, like GeLU and Softmax. These approximations are tailored to retain accuracy while significantly boosting efficiency. Unlike previous methods that might sacrifice performance or lead to convoluted deployment strategies, PUMA’s approach balances both worlds – ensuring accurate results while maintaining the efficiency necessary for real-world applications.

PUMA introduces three pivotal entities: the model owner, the client, and the computing parties. Each entity plays a crucial role in the secure inference process.

The model owner supplies the trained Transformer models, while the client contributes the input data and receives the inference results. The computing parties collectively execute secure computation protocols, ensuring that data and model weights remain securely protected throughout the process. The underpinning principle of PUMA‘s inference process is to maintain the confidentiality of input data and weights, preserving the privacy of the entities involved.

Secure embedding, a fundamental aspect of the secure inference process, traditionally involves the generation of a one-hot vector using token identifiers. Instead, PUMA proposes a secure embedding design that adheres closely to the standard workflow of Transformer models. This streamlined approach ensures that the security measures do not interfere with the inherent architecture of the model, simplifying the deployment of secure models in practical applications.

Moreover, a major challenge in secure inference lies in approximating complex functions, such as GeLU and Softmax, in a way that balances computational efficiency with accuracy. PUMA tackles this aspect by devising more accurate approximations tailored to the properties of these functions. By leveraging the specific characteristics of these functions, PUMA significantly enhances the precision of the approximation while optimizing runtime and communication costs.

Finally, LayerNorm, a crucial operation within the Transformer model, presents unique challenges in secure inference due to the divide-square-root formula. PUMA addresses this by smartly redefining the operation using secure protocols, thus ensuring that the computation of LayerNorm remains both secure and efficient.

One of the most important features of PUMA is its seamless integration. The framework facilitates end-to-end secure inference for Transformer models without necessitating major model architecture modifications. This means you can leverage pre-trained Transformer models with minimal effort. Whether it’s a language model downloaded from Hugging Face or another source, PUMA keeps things simple. It aligns with the original workflow and doesn’t demand complex retraining or modifications.

Check out the Paper and Github link. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled “Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning.” His research interests include deep learning, computer vision, video encoding, and multimedia networking.

🚀 CodiumAI enables busy developers to generate meaningful tests (Sponsored)