Recent advances in generative ai have enabled the development of large multimodal models (LMMs) that can process and generate different types of data, such as text, images, audio, and video.
LMMs share with “standard” large language models (LLMs) the generalization and adaptation capacity typical of large base models. However, LMMs are capable of processing data beyond text, including images, audio, and video.
One of the most prominent examples of large multimodal models is GPT4V(ision), the latest version of the pre-trained generative transformer (GPT) family. GPT-4 can perform various tasks that require both natural language understanding and computer vision, such as image captioning, visual question answering, text-to-image synthesis, and image-to-text translation.
The GPT4V (along with its newer version, the GPT-4-turbo vision) has demonstrated extraordinary capabilities, including:
- Mathematical reasoning about numerical problems:
- Generating code from sketches:
- Description of artistic heritage:
And many others.
In this article, we will focus on the vision capabilities of LMMs and how they differ from standard computer vision algorithms.
What is computer vision?
Computer vision (CV) is a field of artificial intelligence (ai) that allows computers and systems to derive…