Visual Search - What makes it hard to find things?
For our last post (at least, I think it is), we’re going to
discuss another problem in high-level vision: visual search. By visual search, I mean more or less what
you probably think: The problem of searching for something in a cluttered
display. For example, where is “Waldo” in the image below?
Figure 1 - Finding an object in clutter can be challenging. "Where's Waldo?" books play with search difficulty by manipulating a number of properties of search displays.
Naively, you might think that a problem like this more or
less boils down to carrying out your procedures for object recognition a bunch
of times. To look for Waldo (or your keys, or a particular street corner on a
map), don’t you just have to look around a bunch within the scene and try to
recognize him as you go? To some extent, yes. However, there are several ways
in which visual search seems to have different properties than we’d expect if
we were really just using our object recognition over and over again. Our goal
in this post is to do a couple of different things to discuss this in more
depth: (1) Define ways to measure search performance carefully, with special
attention to describing when a search task is easy and when it is difficult,
(2) Point out some easy/difficult search tasks that make it hard to conclude
that search is “just” object recognition, (3) Propose a model of visual search
that allows us to make good guesses about when a search task will be easy or
difficult. Ideally, this model should allow us to talk about what it’s like to
search for Waldo, to look for information in a cluttered user interface (See
picture below), or to predict what people will do in lab settings.
Let’s begin by formalizing visual search tasks more
carefully and identifying ways to measure performance and evaluate different
ways that search tasks can be made more or less difficult. A typical visual
search task in a laboratory will probably look a lot like the figure below: On
any individual trial of an experiment, you would be shown some kind of array of
objects (in this case, dots of different colors) and asked to make a judgment
about the presence of a target within
that array. By a target we’re
referring to an object or objects that differ in some objective way from the
other objects in the array, which we will refer to as distractors. Your job in a typical search might be to locate the
target, perhaps by pointing to it or using a mouse to click on it.
Alternatively, you could also be asked to report the presence of absence of a
target without actually locating it. That is, the experimenter may choose to
present you with some arrays that have a target and other arrays that don’t
have a target. In this case, the question is how accurate you are at
distinguish between these two scenarios. In either case, there are a number of
things the experiment may choose to vary across different trials to present you
with arrays of objects that may be more or less difficult for you to evaluate:
(1) The experimenter may change how similar the target is to the distractors,
(2) The experimenter may change how many distractors there are, (3) The
experimenter may change how different the distractors are from one another.
There are lots of other factors they may also change, but these will be the
basis for a lot of important stuff for us to try and explain. 
Figure 3 - Laboratory search tasks typically look like this: Participants need to either locate a target object, or report its presence or absence in an array of non-targets (distractors).
Now that we know how to run experiments on visual search,
the next question is how we measure which tasks are easy and which tasks are
hard. Broadly speaking, there are two ways we can measure your skill at finding
targets in an array of distractors: (1) We can measure how accurate you are,
(2) We can measure how fast you are. While both of these are good ways to
measure search difficulty, easily the most widely used measure of search
performance is something called the set-size
slope, which is a way of relating your speed at finding targets to the
number of items in the array. Specifically, the set-size slope is defined as
the slope of the line we obtain if we plot your response time to correctly find
a target against the number of distractors in the array of objects. The logic
here is that in an easy search task, adding more distractors shouldn’t slow you
down very much – each new object incurs a small cost to your speed, so the line
described above rises slowly. On the other hand, a difficult search task has
the opposite property – each new object costs you a lot in terms of speed, so
that line rises much more quickly. The nice thing about this measure is that we
can easily compute it for any task in which we can vary the number of
distractors in an array, and it can vary continuously from very easy tasks to
very difficult ones.
Remember, our starting point (a null hypothesis of sorts) is
that maybe we can understand visual search just by understanding object
recognition. Maybe visual search is just the repeated application of object
recognition to tell a target apart from distractors. Almost immediately,
however, we run into some difficulty with this story once we do a little work
to establish how good people are at search in different settings. Consider the
picture below, which shows you the set-size slope for a simple search task:
finding a black target in an array of white distractors.
This search is so incredibly easy that the set-size slope is
zero! This kind of search is often referred to as a pop-out or parallel
search to reflect the fact there is a sort of all-at-once quality to how you
see the target in the larger display. You don’t really have to look for it, it
just sort of appears immediately regardless of how many distractors there are. Contrast
this with the picture below, that shows the set-size slope for a more difficult
search task: finding the black-on-white X amid the white-on-black distractors.
This search is much harder – adding more distractors does
make you take longer and longer, indicating that you might be doing something
like looking at each item one at a time to see if it’s the target or not. This
kind of search is called serial search
to reflect that one-item-at-a-time quality.
Together these, two displays have already given us some
things to think about. First, whatever is happening in the first kind of task
suggests that you are able to do something to evaluate the entire array of
objects at once, which seems a little different than the way we thought about
object recognition working before. Here’s another thing that’s a little odd: If
search was just object recognition, we’d think that search tasks would be
easier or harder based on how similar the target is to the distractors – a
similar target should be harder to tell apart from the distractors around it.
But how similar is the target to the distractors in the easy search task above?
The only difference between them is the color of the two lines, right? OK, now
what’s different about the two kinds of X’s in the harder search display? The
color of the two lines, right? So why are these two tasks so different in terms
of how difficult they are? We either need to think more carefully about what
similarity is (and that’s possible) or we need to consider that search is
subject to different principles than object recognition.
Here’s another search task that turns out to be pretty
tough. Compare the array on the left to the array on the right. In both cases,
you’re looking for the same target (a vertically-oriented white rectangle), but
it tends to be a pop-out, parallel search on the left and a slower, serial
search on the right. Again, this is a little strange: If the issue is how
similar the target is to each of the distractors, each of the comparisons
between a single target and a distractor in the array on the right is just as
easy as the target/distractor comparisons on the left. What makes the display
on the right harder to work with then? This kind of search (called a conjunction search, to reflect the fact
that the target is defined by two attributes together) is a good hint that it’s
not just the similarity between the target and the distractors that matters,
there must also be some property of all the distractors together that also
affects what we’re doing when we try to search for a target. Distractors that
are different from one another seems to be harder to deal with than distractors
that are all very similar. 
Figure 6 - Feature search tasks, like the one on the left, are easy because there is only one piece of visual information you need to know to identify the target. Conjunction search tasks, like the one on the right, are harder because you need to know about the conjunction of two pieces of information (white and vertical) to find the target.
 How can we use these
various observations do develop a technique for guessing how hard a search task
will be? We have a number of things to think about that we know affect search
difficulty, so let’s recap: (1) The number of distractors may matter, (2) The
similarity between the target and the distractors will almost certainly matter
(see below), and (3) How different the distractors are from one another may
also matter. What we’d like is a way to combine these three properties of an
array of objects into some kind of measure that tells us how different the
target is likely to look from the distractors – an answer of “very different”
should imply that the search task will be easy, and an answer of “very similar”
should imply that the search will be difficult.
The key insight we will use to introduce such a measure is
this: Using these properties to quantify how different a target is from a
distractors sounds an awful lot like doing statistics. In many statistical
tests, we’re asking questions very much like this: How different is a
distribution from a single value? How different are two distributions from each
other? In the context of statistics, these questions are typically answered by
calculating a particular test statistic
that encapsulates information about things like the difference between the
single value and the average value of a sample (like or target/distractor
similarity above), the number of samples (like our number of distractors), and
the variability within the sample (or our need to consider how different
distractors are from one another). What we’re going to do then, is adapt a
simple technique from statistics to make guesses about how difficult search
displays should be – specifically, we’re going to imagine that visual search is
really a single-sample t-test.
Consider the formula above, which is the expression for the
test statistic, ‘t’, that one uses to determine if a distribution of values is
different from a single value. First, note that this is more or less what we
want to do, except we’d say it a little differently: We want to know if a
single item (the target) ‘stands out’ from a sample (the distractors). Second,
recall from the statistics class that I hope you’ve taken (and no doubt done
very well in, too) that large values of ‘t’ imply just this – that there is a
significant difference between the single value and the distribution of values
in the sample. So how will we use this to evaluate how difficult a search
display is?
            We have to
begin by having some means of describing the target and the distractors
numerically, and this will depend on the search task at hand. If we’re looking
for a red target among green distractors, for example, we may want to use the
LMS values for each color. If we’re talking about looking for a vertical line
among tilted lines, we may use the orientation of each item instead.
Regardless, we need to be able to do two things: Say what value belongs to each
item (target or distractor) and calculate the difference between those values.
That difference is easy to compute when we use a single number per item – we
just subtract. If we’re using something like LMS values that assign 3 numbers
to each item, we’ll have to do something more complicated like compute the distance between elements in the array
rather than just a simple subtraction. Luckily, these are all things you know
how to do by now!
            So what are
all these terms? Let’s start with the easy ones in the numerator: The symbol m stands for the value
assigned to the target, so that’s straightforward. Now, what about the X with a
bar over it? This stands for the average
value of the distractors. To calculate this, you will add up all the distractor
values and divide by the total number of distractors. Again, for a
single-valued distractor this is easy, but for a multi-valued one it’s not much
harder: You’d just average each LMS number separately, for example. Now that
you’ve got an average value for your distractors and a single value for your
target, the numerator is just the difference (or distance) between those two.
In this way, the numerator captures how the similarity between the target and
the group of distractors contributes to the overall difficulty of the search
task. 
            What about
the denominator? This gets a little trickier, but not by much. First, the easy
part – that ‘n’ just stands for the number of distractors, so it should be no
big deal to take the square root of that. The ‘s’ stands for the sample
standard deviation, which is a measure that tells you how ‘spread out’ the
distractor values are around their own average. You calculate ‘s’ using the
following expression (see below) or you use something like Excel, Matlab, or R
to tell you what it is for a list of elements. Either way, it’s also defined
either for single-valued distractors or multi-valued distractors. Once you have
it in hand, you’ve now got a denominator that tells you how similar the distractors
are to each other, which also governs how hard or easy a search task is.
With the numerator and the denominator in hand, you can just
divide and come up with a value of ‘t’ – the larger it is, the easier the
search task will be. Ideally, this will let us predict something like the
set-size slope for a range of different tasks, including the ones we’ve already
seen and new ones like the one pictured below:
Figure 10 - How hard will it be to find the pink dot in the midst of all these other dots? If we knew the LMS values of each dot, we could use these to calculate a t-statistic that would predict how hard this task should be.
Let’s walk through a simple example of how we might calculate
the value of ‘t’ for two different search displays. In each case, we’re going
to be looking for a medium-white dot among 4 distractors that are different
brightnesses. In the first case, all 4 distractors will mostly be dark grey. In
the second case, the 4 distractors will have a range of different gray values.
Which task should be more difficult?
Figure 11 - We can use the t-statistic to determine which search task should be harder. In both cases, your target is the dot with a gray value of 192.
We need to start by describing each item numerically, and in
this case, the intensity of the light being reflected by the dot is probably
our best bet. I’ve written these gray-values next to each dot in the figure
above so you can see them, and these will be the numbers we’re using to
calculate the values in our t-statistic. First, the easy stuff, the target in
each case has a gray-value of 192, so we just plug that in. Second, the average
value of the distractors is also easy to calculate, so let’s just do that – in
the first case, we get an average value of 58.75, and in the second case, we
get an average value of 112.5, so we can plug those in as well and subtract to
get our numerators (133.25 in the first case, and 79.5 in the second). Now for
the denominators – we need to calculate the sample standard deviation for each
set of distractors. In the first case, this comes out to 6.29, and in the
second case, this comes out to 75.44. (I’m not spelling out how to use the
formula for this because you can either ask Excel to do it for you, or just
plug and chug based on the expression above. Now these get plugged in as well,
along with the value of ‘n’ (being 4, this gives us a nice square root of 2).
My two denominators are thus 3.15 and 37.72. The final step is to just divide
each numerator by each denominator, yielding a value of ‘t’ – I get 42.30 in
the first case, and 2.1 in the second case. This suggests that first search
should be much easier than the second, since the target is statistically more
different. Without actually measuring performance, we have a good guess
regarding how difficult each task might be.
            Here’s the
thing, though, this model isn’t perfect and like all of our algorithms it
relies on a number of assumptions. Here at the end of all things, I’m not going
to point these out for you because I hope you’ve come far enough along that you
can look at something like this procedure and think through what might make it
do something a little strange that doesn’t match with your visual experience! If nothing else, however, this is one last example of how we can use some simple calculations to make estimates of how our vision will work: What it can do, what it can't, and how difficult it will be to make certain judgments. 











Comments
Post a Comment