A picture speaks a million words! We come across countless images every day while scrolling through social media or browsing the web. Some of them make us think, some make us laugh, and some mesmerize us, making us wonder what’s the story behind them. Large language models (LLMs) can help us better understand images, explaining their visual setting, context, and meaning. When it comes to image analysis, Llama 3.2 90B Vision and GPT-4o are two powerful LLMs with the potential to understand and analyze complex visual information, delivering highly accurate outputs. In this blog, we’ll compare the outputs of Llama 3.2 90B vs GPT-4o to uncover which one comes out on top.
Image Analysis with Llama 3.2 90B vs GPT 4o
Get ready to explore an exciting head-to-head comparison of the Llama 3.2 90B vs GPT-4o models across diverse image types. From the raw beauty of wildlife photography to the intricate details of technical diagrams, we’ll see how these models interpret and analyze visuals.
We will be specifically analyzing outputs of these models on these 10 image categories:
- Wildlife Photography
- Medical Images
- Natural Landscape
- Technical Diagram
- Natural Phenomenon
- Food Photography
- Infographics
- Sports Photography
- Cartoons
- Architectural Designs
Let’s begin with Wildlife Photography and uncover how these models interpret the beauty and dynamics of the animal kingdom.
1. Wildlife Photography
Using the above image, let me show you how to provide a prompt for image analysis in Llama 3.2 90B and GPT-4.0. Let’s start with Llama 3.2 90B
Image Analysis Using Llama 3.2 90B
1. Go to https://build.nvidia.com/meta/llama-3.2-90b-vision-instruct. You can access it directly for image analysis, without having to log in.
2. At the bottom left, you will see an option to upload images. Here, you can upload your image in .png, .jpg, or .jpeg format.
3. Enter the prompt in the ‘Type text here’ section of the same box.
Now, for our example, we will use the following prompt:
“Describe the animal in the image, including details about its posture, movement, and expressions. Also, highlight the environment depicted in the image.”
- After typing in your prompt, simply click on Send and wait for the response.
You will see the response in the top-left box on the screen.
Now let’s try the same prompt on GPT 4o.
Image Analysis Using GPT 4o
- Visit https://openai.com/index/chatgpt/.
- Select try ChatGPT and log in using your Gmail account. You will receive 10 free questions for ChatGPT4o every day.
- Click on the paperclip icon and attach the same wildlife photo.
- Type in the same prompt used earlier in Llama 3.2 90B and click on Upward arrow.
You will get the following response.
Also Read: 3 Ways to Use GPT 4o Like a Pro with Canvas
Now let’s compare the responses we got from both these models.
Llama 3.2 90B | GPT 4o |
“The image depicts a tiger walking on a dirt path in a wooded area.
The tiger is walking towards the camera, its head turned to the left. It has orange fur with black stripes, and its body is facing the left side of the image. The tiger’s posture suggests that it is alert and aware of its surroundings. Its movement appears slow and deliberate as if it is stalking its prey or navigating through the forest. The environment depicted in the image is a wooded area with a dirt path running through it. The path is surrounded by trees and greenery, creating a sense of seclusion and isolation. In the background, there are several trees and rocks visible, which add to the natural and rustic atmosphere of the image. Overall, the image presents a serene and peaceful scene, with the tiger being the main focus of attention. The surroundings create a sense of harmony between the animal and its environment, highlighting the beauty and majesty of nature.” |
“The animal in the image is a tiger, identifiable by its distinctive orange coat with black stripes and white underside. Description: Posture and Movement: The tiger is standing on all four legs and appears to be in mid-stride, with its body slightly angled and its tail curved upward. This suggests it is actively walking or preparing to move further. Expression: The tiger’s facial expression is calm yet focused, with its eyes partially closed, possibly due to sunlight or a moment of reflection. Environment: The setting is a naturalistic outdoor area with patches of green grass, soil, and rocks. There are scattered trees and shrubs in the background, providing a semi-wild, forest-like ambience. The lighting is bright, indicating it might be daytime, with sunlight casting shadows around the tiger.” |
Comparative Analysis
Llama 3.2 offers a broad understanding of the image, focusing on the harmony between the tiger and its environment with a factual storytelling approach. In contrast, GPT-4.0 excels in capturing specific micro-details, such as the tiger’s tail position, facial expression, and the interplay of lighting, resulting in more engaging and well-structured responses aligned with the prompt. I would prefer GPT 4o’s output in this case.
Now let’s analyze some other types of images. With each image, we will be comparing the responses of both models.
2. Medical Images
Prompt: What injury does this medical image depict, and how can it be diagnosed?
Llama 3.2 90B | GPT 4o |
“Step 1: Identify the type of medical image This is an x-ray of the lumbar spine, showing the vertebrae in the lower back. Step 2: Identify the injury The x-ray shows a fracture of the L1 vertebra, which is located in the lower back. The fracture is a compression fracture, which means that the vertebra has been compressed and is no longer its normal height. Step 3: Diagnose the injury The diagnosis of this injury is a compression fracture of the L1 vertebra. This type of fracture is typically caused by osteoporosis, which is a condition that weakens the bones and makes them more susceptible to fractures. Step 4: Conclusion The medical image depicts a compression fracture of the L1 vertebra, which is a common injury in people with osteoporosis. The diagnosis of this injury can be made using an x-ray, which shows the fracture and the compression of the vertebra. Answer: Compression fracture of the L1 vertebra.” |
“The medical image shows x-rays of the lumbar spine, likely indicating an injury or abnormality in the vertebral column.
Possible Injuries Depicted: Spinal Fractures: Diagnosis: Clinical Assessment: Would you like a detailed explanation of a specific injury or condition visible in such x-rays?” |
Comparative Analysis
Llama 3.2 90B focuses on precision, clearly identifying the injury and its cause. Its response is ideal for readers seeking concise and direct answers. In contrast, GPT-4.0 offers an engaging and conversational response with a broader narrative style. It explores multiple possibilities and diagnostic methods, making it comprehensive but less precise. GPT-4.0’s approach suits those who want detailed insights and contextual explanations. For this image, I would prefer the response of Llama 3.2 90B.
3. Natural Landscape
Prompt: Describe the landscape image, focusing on the terrain, weather, and time of day or night. Highlight the interplay of colors and lighting to bring the scene to life.
Llama 3.2 90B | GPT 4o |
“The image features a sun-drenched mountain range adorned with vibrant green grass and trees. The transparent blue sky is dotted with a few wispy clouds, while the meadow boasts sprawling yellow and green grassland and almost endless trees. Several small wooden buildings are scattered in front of the mountains. The sun appears to be at its highest point for the day. The sky and mountain ranges blend seamlessly into a stunning backdrop. The image appears to be a high-quality photo rather than a painting.” | “The landscape in the image showcases a breathtaking alpine setting with a rich interplay of natural elements, vibrant colors, and serene lighting.
Terrain: The foreground features a lush, rolling meadow dotted with scattered wooden cottages, hinting at a peaceful, rural environment. Weather and Time of Day: The weather appears to be clear and sunny, with a bright blue sky punctuated by a few soft, white clouds. This suggests a warm summer day, perfect for outdoor exploration. The sunlight bathes the landscape, highlighting the greenery and casting gentle shadows that add dimension and texture to the scene. The lighting implies it is mid-morning or early afternoon when the sun is high but not directly overhead. Interplay of Colors: The vivid greens of the meadows and forests create a harmonious base, blending naturally with the muted browns and greys of the mountain rocks. Lighting: The sunlight enhances the colors of the scene, making the greens more vibrant and the mountains’ textures more pronounced. This idyllic alpine landscape perfectly balances tranquility and grandeur with its vivid colors, crisp lighting, and harmonious blend of natural and man-made elements. It evokes a sense of peace and wonder, inviting the viewer to immerse themselves in its beauty.” |
Comparative Analysis
GPT-4.0 showcases a stronger contextual understanding by capturing intricate details such as terrain, lighting, weather, and the interplay of colors between elements, offering a richer depiction of the scene. It also excels in creativity. In contrast, Llama 3.2 focuses more on the primary components, lacking the vividness that makes descriptions engaging. I would prefer the response of GPT4 o here.
4. Technical Diagrams
Prompt: Explain this circuit diagram and identify all the symbols of the electrical components used.
Llama 3.2 90B | GPT 4o |
“Circuit Diagram Explanation
The accompanying circuit diagram illustrates a fundamental electrical setup, comprising a power source, a switch, and a load. The diagram shows: A DC power source, denoted by the (+) and (-) symbols, which generates an electric potential difference to drive the circuit. Electrical Component Symbols DC Power Source: (+) and (-) What to Note Crucial information about a circuit includes the direction of the current, the type of current (AC or DC), and the potential difference (voltage) generated by the power source. The diagram also includes symbols for an ammeter, measuring current, and a ground, indicating where the circuit is earthed. This diagram and accompanying symbols provide a clear and concise representation of a simple electrical circuit, straddling fundamental electrical concepts such as potential difference, current flow, and circuit components.” |
“The circuit diagram shown represents a simple closed electrical circuit. Below is an explanation of the circuit and the identification of the electrical components:
Explanation of the Circuit Current Flow: The circuit demonstrates the flow of electric current (denoted as III) through the components in a loop. Key/Switch: The symbol “K” represents a switch that can open or close the circuit. In this diagram, it appears to be closed, allowing current to flow. Direction of Current: The arrows in the diagram indicate the direction of conventional current flow (from positive to negative terminal of the battery). Identification of Electrical Components Battery (Symbol: +H−+H-+H−): Summary This is a basic electrical circuit where the current flows from the battery through the resistor and ammeter, with the flow controlled by the switch. It is a fundamental representation of how electrical components are connected and work together in a circuit.” |
Comparative Analysis
Neither of the models has identified the bulb symbol ( electronic component shown in circuit diagram). Llama 3.2 90B explains concepts in detail but struggles to accurately identify some electronic components, such as the switch, battery, and load. In contrast, GPT-4o recognises most of the components apart from the bulb. It delivers clearer and more accurate responses making it my choice for this image.
5. Natural Phenomenon
Prompt: What natural phenomenon is depicted in this image, and what are the factors that cause it?
Llama 3.2 90B | GPT 4o |
“The image depicts the aurora borealis, also known as the northern lights, which is a natural phenomenon that occurs when charged particles from the sun interact with the Earth’s magnetic field and atmosphere. The factors that cause the aurora borealis include:
Solar winds: The sun emits solar winds, which are streams of charged particles that travel through space. The combination of these factors results in the spectacular display of the aurora borealis, which can be seen in the polar regions of the Northern Hemisphere.” |
“The image depicts the Aurora Borealis, commonly known as the Northern Lights, a stunning natural phenomenon.
What Causes the Aurora Borealis? Solar Wind: The phenomenon begins with the sun emitting charged particles (solar wind) during solar flares or coronal mass ejections. Why is it Special? The interplay of colors, patterns, and motion creates an ethereal display that varies depending on solar activity, geographic location, and atmospheric conditions. This makes auroras a breathtaking spectacle and a bucket-list experience for many.” |
Comparative Analysis
Llama 3.2 90B focuses more on scientific factors (e.g., density and emission) and is better suited for technical or academic audiences. Its response provides strong technical depth but lacks a detailed visual focus. On the other hand, GPT-4.0 delivers a more descriptive and engaging explanation, painting a vivid picture of the aurora’s beauty while explaining the reasons behind the different colors in the Aurora Borealis. GPT-4.0 strikes a balance by emphasizing both scientific accuracy and visual appeal.I would prefer the response of Llama 3.2 90B because of this scientific clarity.
6. Food Photography
Prompt: Identify the food in the image and list its main ingredients. Provide step-by-step instructions for preparing it, including details about techniques, ingredient measurements, and tips to enhance its flavor and presentation. Ensure the explanation is clear, engaging, and highlights what makes the dish special.
Llama 3.2 90B | GPT 4o |
“Pancakes with Strawberries and Caramel Sauce
Description Ingredients 1 cup all-purpose flour Instructions Whisk together flour, sugar, baking powder, and salt. Tips Use fresh strawberries for the best flavor and texture. |
“The dish in the image is a stack of pancakes topped with caramel sauce, fresh strawberries, nuts, powdered sugar, and a sprig of mint, served alongside whipped cream. Pancakes are a classic breakfast or brunch treat, loved for their fluffy texture and versatility.
Main Ingredients: For the Pancakes: 1 cup (120g) all-purpose flour For Toppings: Fresh strawberries (sliced) Preparation Instructions: Step 1: Prepare the Batter Mix Dry Ingredients: In a large bowl, whisk together the flour, sugar, baking powder, baking soda, and salt. Step 2: Cook the Pancakes Heat a non-stick skillet or griddle over medium heat. Step 3: Prepare the Toppings Warm the caramel sauce slightly to make it pourable. Step 4: Assemble and Serve Stack the pancakes on a plate (around 3-5 per stack). Tips for Enhanced Flavor and Presentation: Buttermilk Substitute: Replace regular milk with buttermilk for extra fluffy pancakes and a subtle tang. This dish stands out for its balance of warm, fluffy pancakes and the sweet, nutty, and fruity toppings, making it a visually stunning and delicious treat!” |
Comparative Analysis
Llama 3.2 focuses on practical recipe creation and preparation guidance, making it ideal for readers who want straightforward instructions to recreate the dish. In contrast, GPT-4o goes beyond functionality, offering visually rich descriptions that highlight the dish’s aesthetic and culinary artistry, creating a more engaging and immersive experience for readers. Additionally, GPT-4.0 provides separate, detailed instructions for each component of the dish, enabling greater customization and flexibility. So, i would take up the response of GPT 4o.
7. Infographics
For this example, I am using Canadian Natural Resources Ltd.’s stock data from the past year. I’ve sourced it from Google Finance, a website hosted by Google that provides real-time financial data.
Prompt: Explain the company’s stock graph for a beginner, highlighting key trends, patterns, and significant price changes, and provide insights for potential investors.
Llama 3.2 90B | GPT 4o |
“The stock graph for Canadian Natural Resources Ltd. shows a significant increase in the stock price over the past year, with some fluctuations. Here’s a breakdown of the key trends, patterns, and significant price changes: Overall Trend: The stock price has increased by approximately 33% over the past year, indicating a strong upward trend. Key Patterns: Uptrend: The stock price has been consistently increasing over the past year, with some minor corrections. Volatility: The stock price has experienced some volatility, with significant price swings in both directions. Support and Resistance: The stock price has formed a strong support level around $30 and a resistance level around $35. Significant Price Changes: Recent Rally: The stock price has experienced a significant rally over the past few months, with a gain of over 10%. Correction: The stock price experienced a correction in the middle of the year, with a decline of around 5%. Breakout: The stock price broke out above the resistance level of $35 in the recent rally, indicating a potential continuation of the uptrend. Insights for Potential Investors: Strong Fundamentals: The company’s strong financial performance and increasing demand for its products suggest a positive outlook for the stock. Volatility: The stock’s volatility may make it suitable for investors who are comfortable with taking on more risk. Support and Resistance: The strong support and resistance levels may provide opportunities for investors to buy or sell the stock at favorable prices. Overall, the stock graph for Canadian Natural Resources Ltd. suggests a strong upward trend with some volatility. Potential investors should consider the company’s fundamentals, volatility, and support and resistance levels when making investment decisions.” |
“Analysis of the Stock Graph (Canadian Natural Resources Ltd):
1. Key Trends and Patterns: Current Stock Price: The stock closed at $33.78, showing an increase of 0.33% (+$0.11) from the previous close of $33.67. 2. Significant Price Changes: Morning Movement: After the market opened, the stock showed initial volatility, with some quick upward spikes followed by brief declines. 3. Insights for Potential Investors: Consistency: The overall price movement suggests a relatively stable stock with low volatility, making it potentially appealing to risk-averse investors. Would you like a deeper analysis of this company’s historical trends or broader market indicators?” |
Comparative Analysis
This chart is an intra-day chart for the stock price of Canadian Natural Resources. Llama 3.2 90B fails to focus just on the image and instead gives us stock analysis for the past years. In contrast, GPT-4o provides precise day-specific analysis along with actionable insights for both short-term and long-term investors, making it ideal for beginners or casual investors. Therefore, I would prefer GPT 4o.
8. Sports Photography
Prompt: Identify the sport depicted in the image. Name five popular international players of this sport and the countries they represent
Llama 3.2 90B | GPT 4o |
“The sport depicted in the image is soccer. The following are five popular international soccer players along with the countries they represent:
Lionel Messi – Argentina |
“The sport depicted in the image is football (soccer).
Five Popular International Football Players and Their Countries: |
Comparative Analysis
Both Llama 3.2 90B and GPT 4o provide a very similar analysis in this case. Both of them are able to identify the sport’s name and list down the popular international players. One name differs in the output but that can be subjective. I will call it a draw as is often the case with football.
9. Cartoons
Prompt: Identify the character in the image and list its movies with their release years.
Llama 3.2 90B | GPT 4o |
“The character is Bob, a Minion from the Despicable Me franchise. The films that include the Minion characters are:
Despicable Me (2010) |
“The characters in the image are Minions, the iconic yellow creatures from the Despicable Me franchise.
Movies Featuring Minions and Their Release Years: Would you like more information about these movies?” |
Comparative Analysis
Llama 3.2 90B performs better in specificity (naming “Bob”) making it more precise for identification but did not recognize the other two characters in the image. On the other hand, GPT 4o gives a broader perception (mentions minions as a group) and adds creativity through a conversational tone and descriptive language. Since Llama 3.2 90B was able to find one character, I would go with it.
10. Architectural Designs
Prompt: Describe the architectural style depicted in the image. Mention key features, materials, and design elements that define it.
Llama 3.2 90B | GPT 4o |
“The architectural style depicted in the image is Islamic architecture, specifically Ottoman architecture. The key features of this style include:
Domed ceilings: The image shows a large dome with intricate designs and patterns, which is a characteristic feature of Islamic architecture. In terms of materials, Islamic architecture often employs: Design elements that define Islamic architecture include: Overall, the image depicts a beautiful example of Islamic architecture, specifically Ottoman architecture, with its characteristic domed ceilings, arches, minarets, calligraphy, geometric patterns, and tile work.” |
“The architectural style depicted in the image is Islamic architecture, characterized by its intricate and ornate design elements, spiritual ambience, and attention to symmetry and geometry. Below is a detailed description of its key features, materials, and design elements:
Key Features: 1. Geometric Patterns and Symmetry: 2. Arches and Windows: 3. Dome Design: 4. Calligraphy: 5. Color Palette: 6. Flooring and Carpeting: Materials: Stained Glass: Used in the windows to create colorful reflections and enhance the ambience. Design Elements: Spiritual Geometry: Reflecting divine order and infinite complexity. This design epitomizes the beauty of Islamic architecture, blending functionality, artistry, and spiritual symbolism to create an awe-inspiring space.” |
Comparative Analysis
Llama 3.2 90B precisely identifies the architectural style as Ottoman, while GPT-4.0 generalizes it as Islamic architecture. Llama demonstrates a deeper understanding of specific elements, focusing on key features, materials, and design details, making it ideal for readers seeking a straightforward and accurate explanation. On the other hand, GPT-4.0 provides an engaging and holistic description, blending factual details with insights into the spiritual and artistic significance of the features. This comparison highlights Llama’s precision in technical identification versus GPT’s ability to offer a broader contextual and interpretive narrative. I would prefer Llama 3.2 90B for this image.
Final Score
Llama 3.2 90B : 4
GPT 4o: 5
Draw : 1
Conclusion
Both Llama 3.2 90B and GPT-4o excel in distinct areas of image analysis. Llama 3.2 focuses on precision and functionality, offering factual and straightforward descriptions ideal for technical or academic contexts (Exception: Circuit diagram). It provides strong technical depth but lacks creative engagement. In contrast, GPT-4o excels in visually rich and imaginative descriptions. While Llama 3.2 90B suits readers seeking clarity and concise explanations, GPT-4o engages a broader audience with its conversational tone. Together, these models cater to diverse needs, showcasing the versatility of ai in image analysis and interpretation.
Frequently Asked Questions
A. Llama 3.2 90B vision model is an open-source model. With its 90 billion parameters, it excels in tasks such as image understanding, visual reasoning, and image-text retrieval, positioning it as a strong alternative to GPT-4o, whose exact parameter count remains unknown.
A. The maximum input image size is 20 MB in GPT-4. The <a target="_blank" href="https://www.jetson-ai-lab.com/llama_vlm.html?utm_source=chatgpt.com” target=”_blank” rel=”nofollow noopener”>Llama 3.2 90B Vision model supports high-resolution image inputs up to 1120×1120 pixels.
A. Llama 3.2 90B’s unique architecture integrates text and image processing, enabling advanced visual reasoning and image-text retrieval. It has 90 billion parameters, a vision encoder for high-resolution images, and supports up to 128,000 tokens.
A. GPT-4 supports PNG (.png), JPEG (.jpeg and .jpg), and non-animated GIF (.gif), while Llama 3.2 90B Vision only supports png, .jpg, or .jpeg format.
A. Llama 3.2 90B and GPT-4o offer precise, step-by-step explanations for identifying medical conditions. However, they may occasionally produce hallucinated outputs, so human evaluation is essential.