Object Recognition

Object Recognition

The first problem in high-level vision that we’ll consider is the problem of object recognition. What I mean by object recognition is the ability to look at an image of something and say what it is. Are you looking at a cat, a dog, a person, a car, or something new that you’ve never seen before? Coming up with some kind of label to assign to a part of an image based on what you think it is more or less encompasses what object recognition is.

There are several different kinds of labels we might try to assign to the things we look at, and it’s worth distinguishing between these before we start to think about this problem. First, we might think about naming objects using the sort of common labels you use all the time to refer to the things around you: mug, pencil, shoe, etc. Identifying objects using these kinds of labels is what we call basic-level categorization. The definition of what the basic level is is honestly a bit hand-wavy: We usually define it as the first label that comes to mind when you see an object. For this reason, it’s sometimes referred to as entry-level categorization, too. An alternative to basic-level categorization is recognizing objects at what we call a subordinate level. This means coming up with a label that is more specific than the entry-level, like a person’s name, or recognizing that a car is your car rather than just any old car. The other kind of recognition we might try to carry out sometimes is superordinate categorization. This refers to categorizing objects using labels that are more general than the basic level: Recognizing a dog as an animal, for example, or a shirt as a garment. Each kind of recognition requires a different type of generalization of the label over images you might encounter. For example, recognizing objects as animals means you have to be able to look at a giraffe, a mouse, and a person, and assign them all the same label. Recognizing objects a people means you have to assign the same label to all the humans you encounter, while recognizing an object as your friend Dave means assigning the same label to all the images of Dave you might encounter.

Figure 1 - We could label this object as a cat (basic-level category), as an animal (superordinate category), or as my (awful) cat Leila (subordinate category). Each label involves generalizing over a different class of images.

This last description of object recognition hints at the computational problem at the heart of this aspect of high-level vision. Recognizing objects is really all about learning how to generalize in the right ways. A good recognizer can work out that different images of the same thing should get the same label, while images of different things should get different ones. This might sound easy, but it turns out to be quite hard. The specific reason that it’s so difficult has to do with the way that some concepts we’ve introduced earlier might be applied to this task, and how they fail when we try do so.

To be more specific, I want you to remember the way we compared cones’ responses to different colors way back in our discussions of low-level vision: We came up with LMS-numbers for each color we saw and then used the distance between those colors to talk about how similar they would be to each other. Being far apart in LMS-space meant that you’d look very different. Being close meant that you’d look the same. Object recognition could be pretty easy if we knew how to use a trick like that to talk about how similar images were to each other! For example, if we had two pictures that we thought might be the same object, we could try to measure something like the distance between them and use that to decide if we should assign the same label to each one. If the images were close enough, maybe they should be called the same thing.

Figure 2 - If we describe object images using numbers, we can calculate distances between images and use that to infer similarity. If we're lucky, similar images will depict the same object.

Believe it or not, we can try to do this and see how it goes: Remember that we had a language for describing images back when we were talking about the LGN and V1. We can turn a pattern of light into an array of numbers by assigning large numbers to places where there’s lots of light, and small numbers to places where there’s not very much. This means that each image we have turns into a big list of numbers, which we can also think about as a point in a space with many, many dimensions. If this sounds hard to picture, it is. We don’t really need to try, though, because we can just use this list of numbers to calculate a distance between our images. Specifically, we can extend the Pythagorean distance formula so that it has as many terms under the square root sign as we have numbers in our list. The end result is that can calculate distances between images and see if they help us assign labels to objects at all. So does it work?

Sadly, no, and here’s why: (1) Different images of the same object can be very dissimilar, and (2) Images of different objects can be very similar. If you don’t believe me, take a look at the three pictures below. Two of these images depict the same person, and the third is someone different? Which one is which? I’m guessing you worked this out pretty easily. The two pictures on the left depict the same person, but he’s shown in profile in one image and in a frontal view in the other. Here’s the thing: If you calculate distances between these images the way I described above, you’ll find out that the two frontally-viewed faces are far more similar to each other than any other pair. Looking at them, it’s kind of easy to see why – they both have similar patterns of where light and dark pixels are, and that pattern’s very different from the face that is viewed in profile.

Figure 3 - We are not lucky. Images of the same object (left two images) can be very far apart when we calculate distances using pixel values. Images of different objects (right two images) can be very close when do the same thing.

So here’s where we start. We wish we could use distances to measure how similar object images are, but it just doesn’t lead to good answers regarding what we’re looking at. Our visual system must be doing something else to help us recognize the objects around us, but what? In the rest of this post, I’m going to introduce two different ideas about what that something else might be. In both cases, the big idea is that the visual system must use a different description of what an image looks like to carry out recognition. The difference between these two theories is that the nature of that description is very different depending on what theory you’re using. We’ll start by discussing a class of models called structural description models, and a particular account of object recognition called Recognition By Components.

Structural Description Models.

Before I start describing what structural description models are, let’s remind ourselves what problem we’re trying to solve. We want to be able to recognize objects correctly even though images of the same object can look very different, and images of different objects can look very similar. To cast this in terms we’ve used earlier, we want to achieve perceptual constancy for different images of the same object. That is, even though the images we see of an object may be very different from one another, we’d like to be able to produce a response that is stable.

Figure 4 - We would really like to be able to do this when we recognize objects: Maintain the same response (e.g. "Mug") to many different images of the same object.

One way to try and do this is to develop a description of an object that is unlikely to change much when we see different images of the object. If we could find a way to get a description like that, we’d pretty much have the problem solved, right? Structural Description Models are based on the idea that one class of description that should work this way are descriptions based on the 3-D form of an object rather than it’s 2D appearance. That is, if we can describe an object image in terms of the 3-D parts that we think the object is made of, we’ll have the problem of object constancy solved. To be more specific about how this should work, here’s a sort of recipe for trying to recognize an object using an SDM:

1) Look at an image of an object.

2) Use that image to describe what 3D shapes make up the object.

3) While you’re at it, describe how those parts go together.

4) Now you have a structural description of the object, if you see another image, repeat steps 1-3 and see if the descriptions match. If they do, they’re the same thing!

Figure 5 - If we can look at images of objects and recognize parts of objects and how they go together, we could end up with a 3D description of each object we see that's based on it's structure rather than its appearance. This might make it easier to recognize objects in an invariant way: If we can always get to the same 3D model of an object, it won't matter what view we see.

This sounds good, but there are a lot of things for us to think about. First of all, what kinds of 3D shapes make up objects? This recipe assumes that we have some kind of list of parts we can use to make these structural descriptions, but what should those be? Second, whatever those parts are, they need to be easy to measure in an image. Measuring the 3D shape of a whole object is hard, but maybe it’s easier to measure the 3D shapes of simple parts. We better think about this as we choose what parts to use. Finally, we’ll need to think about how to describe the way parts go together. What spatial relationships should we use to talk about how to assemble a complex object out of simpler parts?

One approach to answering some of these questions is a theory called Recognition-by-Components, or RBC for short. This theory is a particular kind of structural description model that relies on a specific set of 3D parts and spatial relationships between them. Specifically, it relies on a set of 3D parts called geons, which is short for generalized polyhedrons. Geons are really a class of objects called generalized cylinders. When you hear the word “cylinder,” you may think of something like a soda can. To make that kind of shape more general, we can change a number of things about such a cylinder to make some different 3D shapes. For example, what if the base of the can was a square instead of a circle? That base could have any shape we like, which leads to different parts. A soda can can be thought of as the shape you get by putting a circle on the table, lifting it up, and keeping the volume swept out by the circle. What if we didn’t pull that circle (or other shape) straight up, but let it curve around as we went? That would lead to still more 3D shapes. Finally, what if we let the circle (or other base shape) change size as we pulled it up off the table? This would lead to still more shapes that tapered or expanded along their long axis. The figure below depicts a bunch of different cylinders made by varying these properties – these are a start at coming up with some geons to try and build objects with.

Figure 6 - Generalized cylinders are made by moving a cross-section through a path. We can change the shape of the cross-section, the path it takes, and the size of the cross-section as we move it to make many different object parts. These are called geons.

Why should you use these to recognize objects? What makes these good parts? One answer to this question is that they seem pretty flexible – we can probably build a lot of different objects out of these things. See below for some examples:

Figure 7 - Examples of objects built from geons. A structural description of each object would have to include the list of geons as well as the way they go together (Geon 4 ON-TOP-OF Geon 3, e.g.).

Perhaps more importantly, though, they turn out to have some important properties that mean we can possibly use 2D images of objects to measure the presence of these specific 3D parts. Specifically, geons have a number of non-accidental properties. This refers to arrangements of edges, curves, and corners in the image that are consistent across different 2D views of the same 3D object. Another way to think about this is to say that there are properties of the 2D images of these parts that reliably signal specific things about their 3D shape. For example, if you see parallel lines or curves in the image of an obejct, that almost certainly means that there are really parallel 3D lines on that object. Similarly, if we see three lines meet at a corner, that almost certainly means that there really is a 3D vertex there on an object. These properties are called non-accidental because they’re very unlikely to have happened “by accident.” Instead, they probably happened because there’s a 3D object there that looks this way. This means that when we see some of these things in an image, we can be pretty certain that a specific 3D geon is there somewhere. In turn, this means that we can start actually carrying out some of the steps of the recipe described above.

Figure 8 - Some 2D things we see in images of geons are good predictors of their 3D shape. This means that we can use images of geons to make good guesses about the real 3D parts that make up an object. This might mean that we can be invariant to geon appearance even if we have trouble being invariant to entire objects.

Does it work? This is where we have to think about how we’d know if a theory like this explains what people do. One good way to check if a theory like this is right is to ask if it helps explain cases where object recognition is easy and other cases where it is hard. In particular, RBC makes a specific prediction about when object recognition should be difficult. If you can’t measure the geons, you can’t recognize the object. When would be unable to measure the geons? Recognizing the parts of an object depended on being able to see these non-accidental properties that signaled the presence of specific geon. Are there some images of objects that make it hard to see these? As it turns out, yes there are. So-called “Accidental” or non-canonical views of a geon are images that make it hard or impossible to see the features you need to guess what 3D shape the part is. One good example of this is to image what a pencil looks like if you point it straight at you – instead of a long, tapered, cylinder, you see a small circle. Images of objects that include lots of foreshortened object parts like this might therefore be hard to recognize. This turns out to be true, which suggests that maybe RBC is on the right track! We can also try masking specific parts of an object, which refers to blocking them out somehow, to see if removing the non-accidental features hurts object recognition a great deal. This also turns out to be true, which again, suggests that maybe RBC captures some aspects of how object recognition works.

Figure 9 - It's hard to recognize geons when they're oriented so that we can't see the features we need. Likewise, it's hard to recognize real objects when we see them from certain strange (or non-canonical) views. This might be because these views make it hard to measure the 3D parts of an object.

Figure 10 - It's also hard to recognize an object when we cover up some of the features you need to recognize the 3D parts that it's made of. The picture at left is missing these parts, while the picture at right has the same amount of stuff covered up, but leaves in the vertices, parallel lines, and other image-based clues to 3D shape. Which one looks more like a flashlight to you?

View-based models

Ah, but there are some problems with structural description models, too. First of all, how do you pick geons to recognize a face? Or your car rather than just a car? It’s a little hard to image how this would work, which is worrying. The real problem however, goes a little deeper and strikes at the heart of the problem RBC and other structural models are trying to solve.

Specifically, consider the objects below. These are little chains of cubes strung together to make a sort of squared-off coiled object. These little objects should be very easy to describe with geons. There are loads of non-accidental features to see (corners, parallel lines, etc.) and simple relationships between parts we can use to describe how they go together. If I asked you to learn the difference between two different cube-chains like this by showing you some images of them, RBC would predict that you could measure the geon structural description in each image, and use it to recognize the objects again using any image of them that makes it possible to measure the parts. Y’know what? Let’s try it – let’s teach some people about these objects by showing them some images, and then let’s test them by showing them new images of the objects that they haven’t seen yet, but should make it easy to measure geons. They’ll be good at this, right?

Figure 11 - The shapes on the left should be easy to describe with geons, but if I only let you see the images in the leftmost column, you may struggle to recognize the images in the next column over. RBC doesn't predict this, but a view-based model does.

As it turns out, they’re not good at this. To be more precise, they’re good at the task when we show them images they’ve seen before, but they get worse when we show them images they haven’t seen before. That is, under some circumstances, real people don’t achieve object constancy! In a sense, RBC is too good. As you can see below, people do worse as images get more different from what they’ve learned, but RBC should let you keep doing just fine with new images.

So what are you doing instead? An alternative to RBC in the specific, and structural description models in general, is to go back to our starting point: What if you use something like the distances between images to recognize objects? We already saw that this doesn’t work, but what if we change how this works just a little? Specifically, what if you don’t just use one image of an object to try and recognize it, but you use lots of images instead? This key idea is the basis for view-based models of object recognition.

The recipe for recognizing objects in a view-based framework goes something like this:

1) Learn about objects by looking at them a lot.

2) Store a lot of pictures of each object in a sort of photo album dedicated to that object.

3) When you see a new image, check your photo albums to see if there’s anything in there that’s similar (based on distance!) to what you’re seeing now.

4) If you find a decent match, check which album it was in. That’s what this object is!

Figure 12 - In a view-based framework, we imagine that you don't store 3D models of objects you want to recognize. Instead, you store a sort of 'photo album' for each object that contains many different pictures of that object. Now, you can try to recognize new object images by looking for a match in your various photo albums.

These kinds of models save us from some hard problems. We don’t have to try and measure 3D things in 2D images, for example. We also have a good explanation for why you have a hard time with images of those cube-y objects you haven’t seen before: There’s no photos in your photo albums that are good matches! The trade-off is that we have some new hard problems to think about. How many photo albums do you have to keep around, and how many images have to be in each? Maybe more importantly, how on Earth do you look through all of those albums to try and find a match?

There are not easy answers to these problems. View-based models help us explain some of the stuff people do, but those structural description models also helped explain a different set of behaviors pretty handily. The fact of the matter is that there are trade-offs involved in adopting either framework, and we have to keep thinking carefully about what tasks should and shouldn’t be easy if you used each strategy.

Figure 13 - Both structural description models and view-based models have advantages and disadvantages. Understanding what the brain actually does will require much more work to examine the circumstances under which object recognition is easy and difficult. (Cartoon courtesy of Pawan Sinha).

A lot of current research in object recognition is focused on understanding how other kinds of object descriptions might help us get closer to the “object code” your visual system really uses, but we’re still a long way off. For now, I hope I’ve given you a sense of what makes object recognition a tough nut to crack, and what kinds of ideas we’ve developed over the last thirty years or so to try and organize our thoughts about how it might be carried out.

Seeing and Perceiving

Search This Blog

Object Recognition

Comments

Post a Comment

Popular posts from this blog

Monocular cues for depth perception

What is Light?

Observing the retina (and what it can do)