Object Recognition
The first problem in high-level vision that we’ll consider
is the problem of object recognition.
What I mean by object recognition is the ability to look at an image of
something and say what it is. Are you looking at a cat, a dog, a person, a car,
or something new that you’ve never seen before? Coming up with some kind of
label to assign to a part of an image based on what you think it is more or
less encompasses what object recognition is.
There are several different kinds of labels we might try to
assign to the things we look at, and it’s worth distinguishing between these
before we start to think about this problem. First, we might think about naming
objects using the sort of common labels you use all the time to refer to the
things around you: mug, pencil, shoe, etc. Identifying objects using these
kinds of labels is what we call basic-level
categorization. The definition of what the basic level is is honestly a bit
hand-wavy: We usually define it as the first label that comes to mind when you
see an object. For this reason, it’s sometimes referred to as entry-level categorization, too. An
alternative to basic-level categorization is recognizing objects at what we
call a subordinate level. This means
coming up with a label that is more specific than the entry-level, like a
person’s name, or recognizing that a car is your
car rather than just any old car. The other kind of recognition we might try to
carry out sometimes is superordinate
categorization. This refers to categorizing objects using labels that are
more general than the basic level: Recognizing a dog as an animal, for example,
or a shirt as a garment. Each kind of recognition requires a different type of
generalization of the label over images you might encounter. For example,
recognizing objects as animals means you have to be able to look at a giraffe,
a mouse, and a person, and assign them all the same label. Recognizing objects
a people means you have to assign the same label to all the humans you
encounter, while recognizing an object as your friend Dave means assigning the
same label to all the images of Dave you might encounter.
Figure 1 - We could label this object as a cat (basic-level category), as an animal (superordinate category), or as my (awful) cat Leila (subordinate category). Each label involves generalizing over a different class of images.
This last description of object recognition hints at the
computational problem at the heart of this aspect of high-level vision.
Recognizing objects is really all about learning how to generalize in the right
ways. A good recognizer can work out that different images of the same thing
should get the same label, while images of different things should get
different ones. This might sound easy, but it turns out to be quite hard. The
specific reason that it’s so difficult has to do with the way that some
concepts we’ve introduced earlier might be applied to this task, and how they
fail when we try do so.
To be more specific, I want you to remember the way we
compared cones’ responses to different colors way back in our discussions of
low-level vision: We came up with LMS-numbers for each color we saw and then
used the distance between those colors to talk about how similar they would be
to each other. Being far apart in LMS-space meant that you’d look very
different. Being close meant that you’d look the same. Object recognition could
be pretty easy if we knew how to use a trick like that to talk about how
similar images were to each other! For example, if we had two pictures that we
thought might be the same object, we could try to measure something like the
distance between them and use that to decide if we should assign the same label
to each one. If the images were close enough, maybe they should be called the
same thing.
Believe it or not, we can try to do this and see how it
goes: Remember that we had a language for describing images back when we were
talking about the LGN and V1. We can turn a pattern of light into an array of
numbers by assigning large numbers to places where there’s lots of light, and
small numbers to places where there’s not very much. This means that each image
we have turns into a big list of numbers, which we can also think about as a
point in a space with many, many dimensions. If this sounds hard to picture, it
is. We don’t really need to try, though, because we can just use this list of
numbers to calculate a distance between our images. Specifically, we can extend
the Pythagorean distance formula so that it has as many terms under the square
root sign as we have numbers in our list. The end result is that can calculate
distances between images and see if they help us assign labels to objects at
all. So does it work?
Sadly, no, and here’s why: (1) Different images of the same object can be very dissimilar, and (2)
Images of different objects can be very similar. If you don’t believe me,
take a look at the three pictures below. Two of these images depict the same
person, and the third is someone different? Which one is which? I’m guessing
you worked this out pretty easily. The two pictures on the left depict the same
person, but he’s shown in profile in one image and in a frontal view in the
other. Here’s the thing: If you calculate distances between these images the
way I described above, you’ll find out that the two frontally-viewed faces are
far more similar to each other than any other pair. Looking at them, it’s kind
of easy to see why – they both have similar patterns of where light and dark
pixels are, and that pattern’s very different from the face that is viewed in
profile.
Figure 3 - We are not lucky. Images of the same object (left two images) can be very far apart when we calculate distances using pixel values. Images of different objects (right two images) can be very close when do the same thing.
So here’s where we start. We wish we could use distances to measure
how similar object images are, but it just doesn’t lead to good answers
regarding what we’re looking at. Our visual system must be doing something else
to help us recognize the objects around us, but what? In the rest of this post,
I’m going to introduce two different ideas about what that something else might
be. In both cases, the big idea is that the visual system must use a different
description of what an image looks like to carry out recognition. The
difference between these two theories is that the nature of that description is
very different depending on what theory you’re using. We’ll start by discussing
a class of models called structural
description models, and a particular
account of object recognition called Recognition
By Components.
Structural Description
Models.
Before I start describing what structural description models
are, let’s remind ourselves what problem we’re trying to solve. We want to
be able to recognize objects correctly even though images of the same object
can look very different, and images of different objects can look very similar.
To cast this in terms we’ve used earlier, we want to achieve perceptual constancy for different
images of the same object. That is, even though the images we see of an object
may be very different from one another, we’d like to be able to produce a
response that is stable.
Figure 4 - We would really like to be able to do this when we recognize objects: Maintain the same response (e.g. "Mug") to many different images of the same object.
One way to try and do this is to develop a description of an
object that is unlikely to change much when we see different images of the
object. If we could find a way to get a description like that, we’d pretty much
have the problem solved, right? Structural
Description Models are based on the idea that one class of description that
should work this way are descriptions based on the 3-D form of an object rather
than it’s 2D appearance. That is, if we can describe an object image in terms
of the 3-D parts that we think the object is made of, we’ll have the problem of
object constancy solved. To be more
specific about how this should work, here’s a sort of recipe for trying to
recognize an object using an SDM:
1) Look at an image of an object.
2) Use that image to describe what 3D
shapes make up the object.
3) While you’re at it, describe how
those parts go together.
4) Now you have a structural description of the object, if
you see another image, repeat steps 1-3 and see if the descriptions match. If
they do, they’re the same thing!
Figure 5 - If we can look at images of objects and recognize parts of objects and how they go together, we could end up with a 3D description of each object we see that's based on it's structure rather than its appearance. This might make it easier to recognize objects in an invariant way: If we can always get to the same 3D model of an object, it won't matter what view we see.
This sounds good, but there are a lot of things for us to
think about. First of all, what kinds of 3D shapes make up objects? This recipe
assumes that we have some kind of list of parts we can use to make these
structural descriptions, but what should those be? Second, whatever those parts
are, they need to be easy to measure in an image. Measuring the 3D shape of a
whole object is hard, but maybe it’s easier to measure the 3D shapes of simple
parts. We better think about this as we choose what parts to use. Finally,
we’ll need to think about how to describe the way parts go together. What spatial
relationships should we use to talk about how to assemble a complex object out
of simpler parts?
One approach to answering some of these questions is a
theory called Recognition-by-Components, or
RBC for short. This theory is a
particular kind of structural description model that relies on a specific set
of 3D parts and spatial relationships between them. Specifically, it relies on
a set of 3D parts called geons, which
is short for generalized polyhedrons.
Geons are really a class of objects called generalized
cylinders. When you hear the word “cylinder,” you may think of something
like a soda can. To make that kind of shape more general, we can change a
number of things about such a cylinder to make some different 3D shapes. For
example, what if the base of the can was a square instead of a circle? That
base could have any shape we like, which leads to different parts. A soda can
can be thought of as the shape you get by putting a circle on the table,
lifting it up, and keeping the volume swept out by the circle. What if we
didn’t pull that circle (or other shape) straight up, but let it curve around
as we went? That would lead to still more 3D shapes. Finally, what if we let
the circle (or other base shape) change size as we pulled it up off the table?
This would lead to still more shapes that tapered or expanded along their long
axis. The figure below depicts a bunch of different cylinders made by varying
these properties – these are a start at coming up with some geons to try and
build objects with.
Why should you use these to recognize objects? What makes
these good parts? One answer to this question is that they seem pretty flexible
– we can probably build a lot of different objects out of these things. See below for some examples:
Figure 7 - Examples of objects built from geons. A structural description of each object would have to include the list of geons as well as the way they go together (Geon 4 ON-TOP-OF Geon 3, e.g.).
Perhaps
more importantly, though, they turn out to have some important properties that
mean we can possibly use 2D images of objects to measure the presence of these
specific 3D parts. Specifically, geons have a number of non-accidental properties. This refers to arrangements of edges,
curves, and corners in the image that are consistent across different 2D views
of the same 3D object. Another way to think about this is to say that there are
properties of the 2D images of these parts that reliably signal specific things
about their 3D shape. For example, if you see parallel lines or curves in the
image of an obejct, that almost certainly means that there are really parallel
3D lines on that object. Similarly, if we see three lines meet at a corner,
that almost certainly means that there really is a 3D vertex there on an
object. These properties are called non-accidental
because they’re very unlikely to have happened “by accident.” Instead, they
probably happened because there’s a 3D object there that looks this way. This
means that when we see some of these things in an image, we can be pretty
certain that a specific 3D geon is there somewhere. In turn, this means that we
can start actually carrying out some of the steps of the recipe described
above.
Does it work? This is where we have to think about how we’d
know if a theory like this explains what people do. One good way to check if a
theory like this is right is to ask if it helps explain cases where object
recognition is easy and other cases where it is hard. In particular, RBC makes
a specific prediction about when object recognition should be difficult. If you can’t measure the geons, you can’t
recognize the object. When would be unable to measure the geons?
Recognizing the parts of an object depended on being able to see these
non-accidental properties that signaled the presence of specific geon. Are
there some images of objects that make it hard to see these? As it turns out,
yes there are. So-called “Accidental” or non-canonical
views of a geon are images that make it hard or impossible to see the features
you need to guess what 3D shape the part is. One good example of this is to
image what a pencil looks like if you point it straight at you – instead of a
long, tapered, cylinder, you see a small circle. Images of objects that include
lots of foreshortened object parts like this might therefore be hard to
recognize. This turns out to be true, which suggests that maybe RBC is on the
right track! We can also try masking
specific parts of an object, which refers to blocking them out somehow, to see
if removing the non-accidental features hurts object recognition a great deal.
This also turns out to be true, which again, suggests that maybe RBC captures
some aspects of how object recognition works.
Figure 9 - It's hard to recognize geons when they're oriented so that we can't see the features we need. Likewise, it's hard to recognize real objects when we see them from certain strange (or non-canonical) views. This might be because these views make it hard to measure the 3D parts of an object.
Figure 10 - It's also hard to recognize an object when we cover up some of the features you need to recognize the 3D parts that it's made of. The picture at left is missing these parts, while the picture at right has the same amount of stuff covered up, but leaves in the vertices, parallel lines, and other image-based clues to 3D shape. Which one looks more like a flashlight to you?
View-based models
Ah, but there are some problems with structural description
models, too. First of all, how do you pick geons to recognize a face? Or your car rather than just a car? It’s a little hard to image how
this would work, which is worrying. The real problem however, goes a little
deeper and strikes at the heart of the problem RBC and other structural models
are trying to solve.
Specifically, consider the objects below. These are little
chains of cubes strung together to make a sort of squared-off coiled object.
These little objects should be very easy to describe with geons. There are
loads of non-accidental features to see (corners, parallel lines, etc.) and
simple relationships between parts we can use to describe how they go together.
If I asked you to learn the difference between two different cube-chains like
this by showing you some images of them, RBC would predict that you could
measure the geon structural description in each image, and use it to recognize
the objects again using any image of them that makes it possible to measure the
parts. Y’know what? Let’s try it – let’s teach some people about these objects
by showing them some images, and then let’s test them by showing them new
images of the objects that they haven’t seen yet, but should make it easy to
measure geons. They’ll be good at this, right?
Figure 11 - The shapes on the left should be easy to describe with geons, but if I only let you see the images in the leftmost column, you may struggle to recognize the images in the next column over. RBC doesn't predict this, but a view-based model does.
As it turns out, they’re not good at this. To be more
precise, they’re good at the task when we show them images they’ve seen before,
but they get worse when we show them images they haven’t seen before. That is,
under some circumstances, real people don’t achieve object constancy! In a
sense, RBC is too good. As you can
see below, people do worse as images get more different from what they’ve
learned, but RBC should let you keep doing just fine with new images.
So what are you doing instead? An alternative to RBC in the
specific, and structural description models in general, is to go back to our starting
point: What if you use something like the distances between images to recognize
objects? We already saw that this doesn’t work, but what if we change how this
works just a little? Specifically, what if you don’t just use one image of an
object to try and recognize it, but you use lots
of images instead? This key idea is the basis for view-based models of
object recognition.
The recipe for recognizing objects in a view-based framework
goes something like this:
1) Learn about objects by looking at
them a lot.
2) Store a lot of pictures of each
object in a sort of photo album dedicated to that object.
3) When you see a new image, check
your photo albums to see if there’s anything in there that’s similar (based on
distance!) to what you’re seeing now.
4) If you find a decent match, check
which album it was in. That’s what this object is!
Figure 12 - In a view-based framework, we imagine that you don't store 3D models of objects you want to recognize. Instead, you store a sort of 'photo album' for each object that contains many different pictures of that object. Now, you can try to recognize new object images by looking for a match in your various photo albums.
These kinds of models save us from some hard problems. We
don’t have to try and measure 3D things in 2D images, for example. We also have
a good explanation for why you have a hard time with images of those cube-y
objects you haven’t seen before: There’s no photos in your photo albums that
are good matches! The trade-off is that we have some new hard problems to think
about. How many photo albums do you have to keep around, and how many images
have to be in each? Maybe more importantly, how on Earth do you look through
all of those albums to try and find a match?
Figure 13 - Both structural description models and view-based models have advantages and disadvantages. Understanding what the brain actually does will require much more work to examine the circumstances under which object recognition is easy and difficult. (Cartoon courtesy of Pawan Sinha).
A lot of current
research in object recognition is focused on understanding how other kinds of
object descriptions might help us get closer to the “object code” your visual system
really uses, but we’re still a long way off. For now, I hope I’ve given you a
sense of what makes object recognition a tough nut to crack, and what kinds of
ideas we’ve developed over the last thirty years or so to try and organize our
thoughts about how it might be carried out.
Comments
Post a Comment