Computational Neurosciece / Computer Vision / Machine Learning /

Computational models of visual processes are of interest in fields such as cybernetics, robotics, computer vision and others. My thesis provides an analysis of a model of attention and of intermediate representation layers in the visual cortex that have direct impact on the next generation of object recognition strategies in computer vision. Following previous authors such as Steve Zucker and David Marr, I show that deeper understanding of visual processes in humans and non-human primates can lead to important advancements in computational perception theories and systems. My research consists of the following main areas:

Computational Neuroscience

2dSIL architectureMy Shape Representation model -2DSIL (Rodríguez-Sánchez & Tsotsos, 2010, 2011, 2012)- makes several contributions to research in computer vision and computational neuroscience. First, it provides a biologically plausible hypothesis on how to achieve shape representation in a hierarchy of layers of neurons. Second, it demonstrates the importance of endstopping for curvature and shape. We have shown how a hierarchy starting from basic simple edge detectors, which combine into complex neurons and further endstopped and local curvatures neurons can obtain neurons that are selective for shape stimuli. And third, it validates the design of the Shape-selective neurons by matching their response to that of real neurons in area V4 of monkey with high accuracy.

The model local curvature neurons do not provide an exact value of curvature but can discriminate between degrees of curvature. This was done using a starting point where V1 is composed of neurons of different sizes. Through the use of different neuronal sizes and the integration of model simple neurons into model complex neurons we obtained model endstopped neurons able to bandpass between degrees of curvature, from very sharp to very broad.

Given that it seems accepted that the visual system computes increasingly abstract quantities as a signal ascends the visual processing hierarchy, are those quantities computed by applying the same computation and thus neural convergence alone suces to achieve abstraction, or, is it truly necessary to include more sophisticated computations layer by layer? This is not easy to answer in the general case. However, we can point to one important instance that supports the latter position. For shape representation, although our approach is also based on a hierarchical set of computations, we deploy di erent processes at each layer, not simply repetitions of the same process. Those different processes are intended to reject the reality of the different neural computations in the visual cortex. Our approach is distinct in that we perform a direct computation of curvature and the sign of curvature. We develop that computation using well documented neural computation types that include not only oriented simple cells and complex cells (as the pooling layer of others is intended to capture) but also endstopped cells, curvature cells, and curvature sign cells. These naturally provide a sucient basis for the de nition of shape cells, a basis that not only mirrors neurophysiological reality of the visual cortex better, but also provides a richer substrate for shape de nition than piecewise linear components. This is the rst model of shape representation (to the best of our knowledge) to include aforementioned cells in intermediate layers departing from the near universal previous use of Fukushima's S and C types of cells.

Computer Vision

A lot of studies about the human visual system deal with visual search. A visual search task is to find an object (target) among a set of distractors. The target is usually different to the distractors on one or more features.

Several theories have appeared to explain the results obtained from visual search tasks. The Feature Integration Theory proposed by Treisman and Gelade (1980) was the pioneer. Today it is an obsolete theory that cannot explain new visual search results, but several important conclusions for other theories have be extracted. After Treisman and Gelade, other theories have appeared. All of them seem to agree on four main aspects:
− Basic features for visual search include colour, orientation, size, etc.
− Grouping of features and objects
− Parallelization of feature/objects processing inside these groups
− Efficiency goes down when the distractors become more similar to the target

Grouping of objects is important for visual search and grouping of features is important for object recognition. There is evidence that the processes encoding the relation between objects are different than the ones involving the grouping of features and are found in different hemispheres, not only that, they seem to work in parallel. These two processes could be primed in a different way depending on the task to develop (task-based selection), this priming would come from attention (Humphreys, 1999). But, the strategy to apply to both groupings would be the same.

I believe that visual attention is a requirement to perform non-detection object recognition tasks. Visual search is closely related to object recognition. For Besl and Jain (1985), the problem of object recognition comprises the following steps:

1. Given a set of objects, examine each object and label it.
2. Given an array of pixels from a sensor and a list of objects, those questions arise:
(a) Is the object present in the scene?
(b) If so, how many times does it appear?
(c) For each occurrence find its location in the scene and determine its translation and rotation parameters referred to a known coordinate system.

            It is important to note step 2(a), Is the object present in the scene? this is exacly what visual search experiments consist of. Computational models such as the Selective Tuning (Tsotsos et al., 1995) presents a framework for modeling attention. I showed how it performs in covert visual search tasks. My experiments showed the relevance of a model of attention for object recognition and illustrated the biological plausibility of the Selective Tuning model (Rodriguez-Sanchez, Simine & Tsotsos, 2007).


Machine Learning

A strategy for object recognition and visual search could incorporate a modeling of the visual system for those tasks involving attention. Some models have been tested successfully already using this approach (Grossberg et al., 1998; Deco and Zhil, 2001). A good way to test such a system would be with visual search tasks. Then, a comparison with human performance could be performed.

To be biologically plausible, the model should have a hierarchy of the different areas thought to participate in visual search and object recognition, from bottom to top: V1, V2, V3, V4 and IT. Also, the model should incorporate a winner-take-all strategy of attention to select candidate locations/objects.

Some basic low level features can be extracted from different studies: edges and bars (thought to involve V1), contrast (V1, V2), colour differences (V1, V2) and constancy (mainly V4), size and scale (V1-IT) and motion (V1-MT).

It is already known how V1 neurons respond to different features and they have been modeled with a difference of Gaussians or a Gabor filter. Less is known about the other areas. But, in all areas, studies have found differences in response to attention vs not-attention condtions. Also, receptive fields have increasing size as we go up in the hierarchy, this increase is not linear. Finally, the input and output from each area are known, so, some kind of integration from the different areas is performed.

Selective Tuning
                          modelV4 has been modelled as concentrically organized (Wilson and Wilkinson, 1998) based on results from psychophysics. This organization explains the response to simple contours as curves. TEO neurons (PIT) can codify intermediate features as basic combinations of shape and orientation, color and orientation, etc.
Finally, TE (AIT) can codify complex features and objects (Tanaka, 1996). For combinatorial explosion reasons, it is not possible to have one neuron per view of one object or even per object. It has been proposed that the codification of objects is distributed along the brain, different combinations of neurons codify different objects. One conclusion we can extract from Tanaka’s group data is that TEO respond to differences in size, but TE is quite invariant to this property. For this, it seems that TE would accomplish size invariance. Another conclusion is that TE’s columnar organization can account for a continuum of features and also for position invariance. Finally, an object can be encoded by the combination of several TE columns representing different complex features.

The organization of these layers is hierarchical, from bottom to top: V1-V2-V4-TEO-TE. Neuron receptive fields increase going up in the hierarchy. For feature extraction and modelling, the path would be from bottom to top. First, basic features would be extracted, these features would be combined at each layer until a representation of the object is formed at the top of the hierarchy. This representation of the object at the top could be stored in a dynamic memory. For object recognition, this memory could have the representation of the object in an object-centred frame. For visual search, the memory would have the spatial relation as well as features of the target and the items in the display. When performing visual search tasks and object recognition, the dynamic memory would be at the top, and it would interact with TE and then go from top to bottom in the hierarchy.

In this hierarchy, we need a strategy to analyze the scene. This strategy is attention. Attention would prime the objects in the scene more similar to the object we want to search or recognize. Models of attention seem to converge in a top-down fashion over the hierarchy with some sort of winner-take-all strategy. The Selective Tuning Model (Tsotsos et al., 1995) would fit this task of selection.