Relative Depth from Binocular Vision
Discussing the phenomenology of color constancy led us to
consider something new about visual perception. Besides simply measuring
patterns of light with the various parts of your visual system, you also use
those patterns of light to make inferences
about the light sources, objects and surfaces in the world that produced them.
In the case of color constancy, this meant thinking about how we might observe
the pattern of light coming from an object or surface and use that to make a
guess about its reflectance and the illumination shining on it. The former
property should be a stable property of the object, while the latter might
change under different conditions. We saw that there were techniques for
estimating the reflectance of an object, but these relied on making assumptions
about the world that might not always be true. In general, this is going to be
the case as we continue to think about how we can recover properties of the
world from patterns of light. The relationship between these two things tends
to be underconstrained, meaning we
don’t have enough information to come up with a unique solution. The only way
forward is to make some guesses about the problem before we even start.
Our next step is to think about a new inverse problem that
involves a different property of objects that we might like to try and infer
from an image: Namely, depth. What I
mean by depth is something very simple – how far away is an object that we are
looking at? Actually, we’re going to be thinking about a slightly different
problem instead. What we’ll be interested in is determining the relative depth of objects that we are
looking at. This means we’d like to be able to look at (or fixate) some object
in a scene and determine which other objects are closer to us than that one or
further away. When you consider your own visual experience, this probably (but
not necessarily!) seems like a fairly easy thing to make judgments about.
Still, this also turns out to be a tricky inverse
problem that is underconstrained. To understand some of the difficulties,
consider the image below and imagine that you’re looking at the green circle.
What is the relative depth of the red and yellow circles? Are they closer to
you than the green circle, or further away?
Figure 1 - Which circle is closest to you? It's probably tough to decide on an answer from just this image.
My guess is that you don’t feel very confident about
answering that question. From this image alone, you don’t have enough
information to say what’s going on! Another way to say this is to point out
that there are many different versions of these circles in the real world that
could have given rise to this image. What if the red circle is incredibly small
but very close to you? Alternatively, what if the yellow one is extremely large
but far away? Any of these things could be true for any of the circles, so you
can’t say what the relative position of any of them might be. The key problem
here is that the real 3-D world has been projected
onto the 2-D surface of your retina, which means that the information about the
distance between you and the objects (and the distances between the objects and
each other) along what we usually refer to as the z-axis is gone (Figure 2).
Figure 2 - The light from 3D objects in the world ends up projected onto the 2D plane of your retina. This means we lose information about the z-axis.
It’s easy to figure out how 3D objects get projected onto a
2D plane, but we want to work backwards: What 3D objects (in which relative
positions) produced the 2D image we want to look at? There’s a few things we
can do with just one 2D image to help us make some guesses about this, but for
now we’ll focus on a fact about your visual system that we haven’t talked about
yet: You don’t just have one 2D image to work with, you have two of
them.
Having two eyes is good for a number of reasons, but we’re
going to focus on two aspects of having two eyes (or binocular vision) that give us some traction on the problem we’ve
set ourselves: Working out the relative depth of objects. The first of these
has to do with a physiological phenomenon called vergence. The second has to do with an optical/geometrical property
of the images in your two eyes called binocular
disparity. In both cases, we’ll see that these two features of binocular
vision allow us to do some amount of work calculating the relative depth of
objects in a scene.
We’ll start with vergence,
which refers to the way your two eyes move in a coordinated fashion to ensure
that an object you’re fixating on ends up in the center (or fovea) of both retinas. To see vergence
in action, hold up a pencil or pen in front of a friend’s face and ask them to
look right at the tip of it as you move it nearer to their face and then
further away. You should notice that their eyes cross more and more as the
object moves closer to them, and then diverge as you move the object further
away. That coordinated turning of the eyes inward or outward is what we mean by
vergence. How does vergence give us a cue to depth? If we can measure the angle
that the eyes have to turn through to fixate the object we’re interested in, we
can use the distance between our two eyes (or a decent estimate of this) to
solve for the distance to the object with a little basic trigonometry (Figure 3). To keep it simple, turning your
eyes a lot means that an object must be close, while turning them just a little
must mean that an object is further away. We can be more precise if we want to,
but that’s the gist of how we can use this simple visual behavior to make a
guess about depth using your two eyes.
Figure 3 - Your eyes cross a little bit when you try to fixate on an object close to you. By measuring the angle your eyes turn inward to do so, we can estimate how far away something is.
To understand our second cue to depth, binocular disparity, try the following little demo. Hold one finger
upright in front of you, and hold a pen or pencil upright some distance behind
your finger. While keeping your eyes on the tip of your finger, alternately
close your left eye and then your right eye? What do you see? Chances are that
you see your finger holding still, but the pencil should jump around as you
alternate between your two eyes. Now bring the pencil forward so that it’s in
front of your finger and try the same thing. You still should have seen the
pencil jumping around while your finger held still. Why is this happening? The
most basic thing to observe about this is that your two eyes see different versions of the same scene. The object
that you’re fixating on stays in the same position on your retina as you
alternate which eye you’re using to see, but other objects end up at one place
on one retina and somewhere different on the other. To see why, take a look at
the diagram below where we’ve drawn a little sketch of this situation.
Figure 4 - Light from objects at different depths relative to a fixation point (F) ends up at different places in the two retinas depending on how near or far the objects were.
With your eyes turned inward to look at your finger, light
from the pencil can only get to each retina through your eye’s pinhole (or
pupil). This means we can draw a straight line from the pencil through the
pupil and onward to the retina for each eye. You’ll notice that when we do this
for a pencil that’s further away than your finger, the pencil’s light ends up
to the right of the fovea in the left eye, and to the left of the fovea in the
right eye. When we do the same thing for a pencil that’s closer than your
finger, the same procedure leads to the opposite outcome – the pencil’s light
is to the left of the fovea in the left eye, and to the right of the fovea in
the right eye. The difference in the pencil’s position on the left and right
retina is what we mean when we talk about binocular
disparity. Specifically, binocular disparity refers to the difference in
horizontal position for the same object in the two retinas.
Now here’s the neat part: With your eyes turned inward to
look at some point of fixation (which we’ll call F), we can easily draw lines from any point in 3D space into each
retina. This would be the forward problem
of relative depth from binocular disparity. The situation that the visual
system finds itself in is the opposite, however: It gets the two images on the
retina and has to try and work out the depth of an object relative to F. Luckily,
we can still draw the same lines to solve this inverse problem as we
drew to solve the forward problem. Starting with the left eye, we can draw a
line from the object’s location on the retina, out through the pupil and out
into 3D space. At this point, all we can say is that the object has to be
somewhere on this line, but we don’t know where! If we do the same trick with
our right eye image, however, we get a second line that must cross the first
one somewhere, which tells us where the object must be (Figure 5)
Figure 5 - The two views of an object in the left and right eye can be examined for their disparity, then used to estimate the relative depth of objects relative to a point of fixation.
This is neat, but a quick caveat about being precise about
what we can say here. What we’re really learning is something about the
position of this object relative to the point of fixation – is it closer to us
than F, or further from us than F? We’re going to make one more observation
about the geometry of this set-up that will make this point more clear. We’ve
assumed so far that objects you’re not looking at (objects that aren’t F) will
end up at different places on your two retinas. But what would happen if there was
a different object that was at the same position on both retinas? We can still
draw our two lines, and we can still identify a position for this object in 3D,
but because it doesn’t have disparity (there’s no difference in its position on
the two retinas), we must see it at the
same depth as F. All of the points in 3D space that have zero disparity lie
on a circular curve we call the horopter
(Figure 6). Again, everything on the horopter will look to you like it’s at
the same depth as F, so they neither look closer or further. The consequences
of this for our earlier diagrams are this: What we really get to say when we
find the position of a point in space from our two rays of light is whether it
lies inside the horopter, or outside of it. The former case we refer to as an
object having crossed disparity
(you’d have to cross your eyes to fixate on it), and the latter case we refer
to as having uncrossed disparity.
Remember: What this relationship between binocular disparity
on the retinas and relative depth in the world means is that your visual system
can estimate depth from just the two images in your eyes. The visual system
doesn’t actually draw lines the way we do, but it still uses the position of
the same object in the two eyes to make a guess about depth. Seeing the same
object at different positions in the two eyes leads to the perception of an
object being closer or further from us. Something that follows from this is
that if we can create 2D images that have the right disparity information in
them, we can make pictures that look 3D as long as we get the right
relationships between binocular disparity and real-world depth. That is, if we
could make an image of objects in the world that looks like what your left eye
would see AND an image that looks like what your right eye would see, all we
have to do is make sure each eye sees the right picture to make you see in 3D.
This is exactly what’s happening with 3D glasses, whether they’re red/cyan,
polarized, or some other type of technology. In all these cases, you’re
watching two images superimposed on each other in such a way that you only see
one of these images with your left eye and only see the other with your right
eye. The combination of the disparity information in the two images and the
separate presentation of each image to one of your eyes leads your visual
system to infer that there really must be depth in the scene you’re looking at.
Check out the figure below to see that this really works.
Figure 7 - 3D anaglyphs work by layering separate images (here in red and cyan) that have the right disparity information to signal real-world depth. By wearing glasses, you guarantee that each image is only delivered to one eye, after which your visual system uses the disparity cues to estimate depth.
Binocular disparity is pretty neat – it gives us a means of
understanding how your brain perceives relative depth and allows us to make
cool movies. That said, I’ve made it sound much easier than it is, so I’d like
to close by pointing out something that’s quite difficult about using binocular
disparity to compute depth in real scenes. Specifically, there’s an assumption
hidden behind everything we’ve said above about how to compute depth from disparity.
We’ve assumed throughout that we know how to match up the image of an object in
the left eye with the image of the same object in the right eye. This might
sound easy, but in general it’s quite hard. Formally, we call this the correspondence problem, and to see why
it’s hard, consider the figure below. If you see four dots in the left eye and
four in the right, surely they must match up in the obvious way, right? Not
nccessarily! What if they don’t all have a match in the other eye? Maybe the 3
rightmost dots in the left eye go with the 3 leftmost in the right eye, for
example. Or maybe it’s only the 2 right/leftmost dots that go together. Each of
these choices would lead to different solutions for the relative depth of the
dots, so what one do we pick? There’s not really a single good answer to this
question, but there are more assumptions we can make to help your visual system
along. We won’t talk about those now, but I want you to at least know that they
exist.
Figure 8 - Using disparity for depth estimates relies on knowing which features from the left eye match up with things on the right eye. In general, this isn't always easy to solve.
Here’s something you might be thinking, though: Why should
it be hard to match objects up between the left eye and the right eye? Why
can’t you just recognize each object in one eye (here’s a cat) and find it in
the other (there’s a cat)? That is, maybe you recognize shapes and objects
first, and do all your binocular disparity work afterwards to take advantage of
knowing what everything is. Not a bad idea (assuming you know how to recognize
shapes, but we’ll deal with that later), but apparently not how your visual system
does it! To be convinced of that, take a look at the image below with some
red/cyan glasses – even though there’s not any objects there for you to
recognize, your visual system still uses the disparity information to solve for
depth. This means you mustn’t need shape recognition to match objects up for
disparity computations. Instead, you can do something else to solve the
correspondence problem, but for our discussion we’re going to leave it at that.
Figure 9 - The fact that you can still see a floating square in the image at the top suggests that depth perception works more like the flow chart on the right than on the left. You don't need to solve for shapes and objects to solve the correspondence problem.
I can’t leave you without pointing out that best application
of the above phenomenon (depth from images without recognizable structure) is
the autostereogram, or Magic Eye image. These abstract images are actually just
two images placed side by side that differ in terms of disparity – some
features are in slightly different places in the left and right image. If you
can either cross or uncross your eyes by the right amount, you can make those
two halves of the larger picture overlap, which will leave you with the
disparity information you need to see the 3D information that reveals the
object hiding in the autostereogram. This trick of crossing or uncrossing your
eyes by the right amount so that you can separately deliver the two images to
the right eye is called free-fusing
and it’s not so hard to learn to do! If you can, practice it a bit until you
see a few autostereograms, and you’ll have a new appreciation of how much your
visual system gets out of binocular vision.
Figure 10- Autostereograms are just two images with different disparity information baked into them set side by side. By fusing the two halves, you can see a hidden 3D object.
Now would be a great time to try out Lab #7, which involves
making some 3D images by introducing disparities into images you draw. Check it
out, and we’ll come back to talk about some more cues to depth in complex
images.
Comments
Post a Comment