Machine Learning
A strategy
for object recognition and visual search could
incorporate a modeling of the visual system for
those tasks involving attention. Some models
have been tested successfully already using this
approach (Grossberg et al., 1998; Deco and Zhil,
2001). A good way to test such a system would be
with visual search tasks. Then, a comparison
with human performance could be performed.
To be biologically
plausible, the model should have a hierarchy of
the different areas thought to participate in
visual search and object recognition, from bottom
to top: V1, V2, V3, V4 and IT. Also, the model
should incorporate a winner-take-all strategy of
attention to select candidate locations/objects.
Some basic low level features can
be extracted from different studies: edges and
bars (thought to involve V1), contrast (V1, V2),
colour differences (V1, V2) and constancy (mainly
V4), size and scale (V1-IT) and motion (V1-MT).
It is already known how V1 neurons
respond to different features and they have been
modeled with a difference of Gaussians or a Gabor
filter. Less is known about the other areas. But,
in all areas, studies have found differences in
response to attention vs not-attention condtions.
Also, receptive fields have increasing size as we
go up in the hierarchy, this increase is not
linear. Finally, the input and output from each
area are known, so, some kind of integration from
the different areas is performed.
V4
has been modelled as concentrically organized
(Wilson and Wilkinson, 1998) based on results from
psychophysics. This organization explains the
response to simple contours as curves. TEO neurons
(PIT) can codify intermediate features as basic
combinations of shape and orientation, color and
orientation, etc.
Finally, TE (AIT) can codify complex features and
objects (Tanaka, 1996). For combinatorial
explosion reasons, it is not possible to have one
neuron per view of one object or even per object.
It has been proposed that the codification of
objects is distributed along the brain, different
combinations of neurons codify different objects.
One conclusion we can extract from Tanaka’s group
data is that TEO respond to differences in size,
but TE is quite invariant to this property. For
this, it seems that TE would accomplish size
invariance. Another conclusion is that TE’s
columnar organization can account for a continuum
of features and also for position invariance.
Finally, an object can be encoded by the
combination of several TE columns representing
different complex features.
The organization of these layers
is hierarchical, from bottom to top:
V1-V2-V4-TEO-TE. Neuron receptive fields increase
going up in the hierarchy. For feature extraction
and modelling, the path would be from bottom to
top. First, basic features would be extracted,
these features would be combined at each layer
until a representation of the object is formed at
the top of the hierarchy. This representation of
the object at the top could be stored in a dynamic
memory. For object recognition, this memory could
have the representation of the object in an
object-centred frame. For visual search, the
memory would have the spatial relation as well as
features of the target and the items in the
display. When performing visual search tasks and
object recognition, the dynamic memory would be at
the top, and it would interact with TE and then go
from top to bottom in the hierarchy.
In this hierarchy, we need a
strategy to analyze the scene. This strategy is
attention. Attention would prime the objects in
the scene more similar to the object we want to
search or recognize. Models of attention seem to
converge in a top-down fashion over the hierarchy
with some sort of winner-take-all strategy. The
Selective Tuning Model (Tsotsos et al., 1995)
would fit this task of selection.
|