Editor’s Image
In recent years, generative ai research has evolved in a way that has changed the way we work. From developing content, planning our work and finding answers to creating works of art, everything is now possible with generative ai. However, each model usually works for certain use cases, for example, GPT for text to text, Stable Broadcast for text to image, and many others.
The model capable of performing multiple tasks is called a multimodal model. Much of the cutting-edge research is moving in the multimodal direction, as it has proven useful in many conditions. That’s why one of the most interesting pieces of research about multimodal people that you should know about is the NEXT-GPT.
NExT-GPT is a multimodal model that could transform anything into anything. So how does it work? Let’s explore it further.
NExT-GPT is a multimodal LLM that can handle four different types of input and output: text, images, videos, and audio. The investigation was initiated by the research group called NExT++ from the National University of Singapore.
The general representation of the NExT-GPT model is shown in the following image.
Model NExT-GPT LLM (Wu et al. (2023))
The NExT-GPT model consists of three parts of work:
- Establish encoders for inputs of various modalities and represent them in a language-like input that LLM can accept.
- Using the open source LLM as a core to process the input for both semantic understanding and reasoning with an additional single modality signal,
- Provide multimodal signal to different encoders and output the result in the appropriate modalities.
In the following image you can see an example of the NExT-GPT inference process.
NExT-GPT inference process (Wu et al. (2023))
We can see in the image above that depending on the tasks we want, the encoder and decoder will switch to the appropriate modalities. This process can only occur because NExT-GPT uses a concept called modality change instruction adjustment so that the model can adjust to the user’s intent.
Researchers have attempted to experiment with various combinations of modalities. Overall, the performance of NExT-GPT can be summarized in the following graph.
Overall performance result of NExT-GPT (Wu et al. (2023))
The best performance of NExT-GPT is text and audio input to produce images, followed by text, audio and image input to produce image results. The lowest performing action is text and video input to produce video output.
An example of the NExT-GPT capability is shown in the image below.
Text to Text+Image+Audio from NExT-GPT (Source: NExT-GPT Web)
The above result shows that interaction with NExT-GPT can produce audio, text, and images appropriate to the user’s intent. It has been shown that NExT-GPT can perform quite well and is quite reliable.
Another example of NExT-GPT is shown in the image below.
Text+Image-to-Text+Audio from NExT-GPT (Source: NExT-GPT Web)
The image above shows that NExT-GPT can handle two types of modalities to produce text and audio output. It shows how the model is versatile enough.
If you want to test the model, you can configure the model and environment from your GitHub page. Also, you can try the demo in the following page.
NExT-GPT is a multimodal model that accepts input data and produces output in text, image, audio, and video. This model works by using a modality-specific encoder and switching to the appropriate modalities based on the user’s intent. The result of the performance experiment shows a good result and promising work that can be used in many applications.
Cornellius Yudha Wijaya He is an assistant data science manager and data writer. While working full-time at Allianz Indonesia, she loves sharing Python tips and data through social media and print media.