Computer Vision Analysis with Language Models Demo
Demonstrates how language models can analyze images using computer vision to identify objects, count people, and extract details.
Transcript:
When we think about language models a lot of the time, we think about how it’s dealing with text, but something that’s really useful with language models is also dealing with images. Both chat GPT and Clawed and other language models use a tool called computer vision, which enables it to interpret an image in some form. So in this case, I found an image on the Wander and we’re going to use computer vision to do an analysis of the image. So I’ve asked Clawed in this case to describe the key elements in the image, identify any text that appears, explain the relationships between objects and unusual, on notable aspects, and to specifically be aware of objects and their attributes, text and labels, spatial relationships and context clues. In this case, the analysis is quite detailed. It’s told us that it can see market stalls with green and teal awning labeled are a market. It’s identified that there’s vendors and shoppers and visitors and there’s historic architecture visible in the background. And then the market is under some metal beams. It’s identified relationships between crowds of people navigating the stalls, the historic buildings, the trees and the greenery and the merchandise that’s displayed up to. Now in this case, this isn’t amazingly useful information unless our job was analyzing images, but it may be useful because it allows us to get more detail about a specific aspect of the image. In this case, I’ve asked can you count the number of people in the image and it’s identified approximately 15 to 20 people and told me why it’s hard to do that. I’ve then pressed it for an exact count and you can see what it’s done is it’s identified 16 distinctly visible people in the image. In isolation, analyzing a single image on its own, not amazingly useful, but you can see the power available when it comes to analyzing dozens or hundreds of these sort of images. In this case, we looked at a photograph, but it could be a screenshot of a website to a diagram from a textbook, really anything that you wanted to analyze that has an image.