Sei sulla pagina 1di 8

Arti cial Intelligence and Simulation of Behaviour, April 1993, Birmingham, England.

Computer Vision: What Is The Object?


James V. Stone
Computer Science, University of Wales, Aberystwyth, Wales.

Abstract. Vision consists of a multiplicity of tasks, of which object identi ca-


tion is only one. We in the computer vision community have concentrated our
e orts on object identi cation, and have thereby ensured that the formulation
of the problem of vision provides methods which are not of general utility for
vision. Ironically, one consequence of this is that computer vision may not even
be of use for object identi cation.
An analysis of why computer vision has become synonymous with object iden-
ti cation is presented. The implications of this analysis for object identi cation
and for interpreting neurophysiological evidence in terms of `feature detectors'
are presented. A formulation of the problem of vision in terms of spatio-
temporal characteristics is proposed.

1 Object Identi cation in Human and Computer Vision


The hardest part of any scienti c investigation is not solving a particular problem, but formulating
questions that focus attention on important aspects of the phenomena under investigation. This paper
attempts to step back from conventional formulations of the `problem of vision' by making explicit the
unspoken assumptions upon which conventional formulations of the problem are based.
Historically, the primary goal of computer vision has been to identify objects. One of the most
in uential books [1] on computer vision states in its preface:
\Computer vision is the construction of explicit, meaningful descriptions of physical objects
from images."
Ballard and Brown, page xiii, 1982[1].
There is no hint here that computer vision may consist of more than object identi cation1. And, from
a psychophysiological perspective:
\The brain's task then, is to extract the constant, invariant features of objects from the
perpetually changing ood of information it receives from them."
Zeki, page 43, 1992[2]. (Italics added).
\The goal of vision is to inform us of the identity of objects in view and their spatial
positions",
Cavanagh, page 261, 1989[3]. (Italics added).
The cited texts are more general than these quotes might suggest, but these quotes demonstrate
a prevalent view of the role of both human and computer vision. More recently there has been a
move away from traditional formulations of the problem of vision [4, 5]. These approaches, whilst
commendable in many respects, do not make explicit the central role of the notion of `object'. In
Ballard's [4] excellent account of the advantages of animate vision there is no reference made as to why
objects might be a useful way to represent the visual world, nor why the particular representations of
objects used (colour histograms) might be formed by an animate vision system.
Formulations of the problem of computer vision in terms of objects may have arisen because the
most obvious concomitant of vision in humans is their ability to identify objects. However, most of
human and animal vision has little to do with identifying objects. More typically, vision is used to guide
limbs, to track motion, to detect changes in motion/lighting/colour/depth, to estimate relative depths
of surfaces using parallax/stereo/motion/texture. Whilst some of these tasks may require the detection
of objects, none of these tasks requires that those objects be recognised or identi ed. Moreover, these
tasks involve computation of quantities (such as position, depth, velocity, colour, gradient) which are
not necessarily required in order to compute the identity of an object.
1 I use the term `object identi cation' instead of the more usual `object recognition' because I utilise a distinction
between recognition (e.g. familiarity with an object) and identi cation (i.e. classifying or naming an object).
1
It is tempting to use easily quanti able tasks, such as object identi cation, in order to compare the
performance of a seeing machine to that of a human. A machine that can identify objects provides a
tangible demonstration that it can do what humans do. It is tempting to suppose that if a machine
can identify objects then it can `see' as humans do. However, the fact that humans can identify
objects does not imply that this is the only task humans use vision for; and a demonstration that both
humans and machines can identify objects is not a demonstration that a machine can `see' in the sense
normally associated with seeing humans. It is less easy to measure how well a human uses vision to
aid walking/climbing/reaching/grasping, even though there is ample evidence that vision is essential
for these tasks.
The conventional formulation of the problem of object recognition implies that objects consist of
well de ned features, and that objects can be identi ed by rst extracting these features and then
matching them to stored representations. If we accept that vision includes object identi cation, but
that the visual mechanisms we posses evolved in order to perform many other visual tasks, then the
conventional formulation of the problem of vision appears not only simplistic, but also peculiarly biased
toward a task (object identi cation) that is an important, but relatively small part of, what vision is
for.
It seems likely that our sophisticated object recognition ability is a relatively recent evolutionary
development. Like most evolutionary innovations the ability to recognise objects was probably synthe-
sized from pre-existing computational mechanisms. Consequently, if a mechanism is useful for object
identi cation only, then it is unlikely that it forms a part of the solution implemented by human visual
systems. Conversely, if a mechanism subserves other forms of visually guided behaviour and object
recognition then it is likely that that mechanism forms part of the human visual system. Within com-
puter vision we pride ourselves on the correspondence between our methods and the computational
mechanisms observed in the human visual system. However, if we wish to achieve object recognition
by modelling the computational properties of the human visual system then we could do so by paying
more attention to the types of tasks for which our visual systems evolved to deal with.

1.1 Object Identi cation: A Conceptual Analysis


The notion of `object' is so deeply ingrained in our language that its logical status is rarely questioned.
If the question does arise then it is usually addressed by invoking sub-objects, or `features'. For
example, a face may be de ned in terms of `features' such as nose, eye and mouth. However, such
sub-objects are logically indistinguishable from the objects of which they are a part. (A similar type
of dilemma is yet to be recognised in the connectionist literature where the notion of `micro-feature'
is currently used as an explanatory concept).
If we accept that objects can be de ned in terms of sub-objects or features (and that this recursive
process is nite) then it should be possible to di erentiate two objects on the basis of their respective
features. However, it is possible for two di erent types of object to be speci ed by a single feature-based
description. An example of this is the letter `O' and the number `O'. If features are clearly insucient
to distinguish between two objects then the notion of context and/or function is often invoked.
Invoking context to disambiguate two physically similar objects has the e ect of moving the nub
of the problem from the structure of the object to the structure of its spatio-temporal neighbourhood.
However objects and contexts are usually de ned using similar types of primitives. Consequently, in-
voking context usually results in resolution of the immediate problem without addressing its underlying
cause, and whilst creating a set of similar problems (such as how to identify a given context).
As with `context', it might be supposed that two physically similar objects can be disambiguated
by appealing to their functional attributes. In order to provide a functional method for distinguishing
between a pillow and a cushion, the latter might be described as `used to support back when sitting',
whereas a pillow might be described as `used to support head while sleeping'. Such descriptions clearly
create more problems than they solve. What does it mean to support? How are back and head de ned,
in terms of still more features?
Of course, a list of features does not constitute an adequate description of an object. It is im-
portant to consider the relationships between di erent features. However, choosing a set of relations-
between-features is analogous to the problem of choosing features, and is subject to the same types
of pitfalls[6](p376) Moreover, the type of problem described above with respect to classifying objects
according to features arises with respect to relations-between-features.
2
Several issue are raised by the fact that a well de ned set of primitive descriptors for objects is
not available, and that there does not seem to be a principled method for obtaining such a set. Does
the problem of object identi cation have a robust set of descriptors which can be used to recognise
objects? Or, is object identi cation an inappropriate formulation of the problem represented by vision?
By formulating a more general de nition of the problem of vision, conventional formulations of the
problem of object recognition may become irrelevant, not because object identi cation isn't required,
but because it is addressed as part of a more general computational problem. The e ect of this is to
de-emphasise object identi cation as the primary objective in computer vision. Object identi cation
is an integral, but subsidiary, part of visual behaviour, and can be realised as part of a solution to a
more general formulation of the problem of computer vision.

1.2 Features and `Grandmother Cells'


The tendency to describe objects in terms of features exists in several related elds, psychology, arti cial
intelligence, computer vision and neurophysiology. The last is particularly interesting because, unlike
the others, it aspires to making direct contact with the computational machinery responsible for
perception. Yet even here, neurons quite close to the retinal input are described as feature detectors.
Indeed, early attempts to account for these ndings proposed that the function of these neurons is
to signal the presence of these features [7, 8]. However, simulations using an arti cial neural network
(ANN) to perform a simple shape from shading task [9] have demonstrated that the types of feature
detectors observed in the retina and in the primary visual cortex (V1) can arise (in an ANN) in the
absence of corresponding `retinal' features. The `edge detectors' identi ed in this ANN developed
in the absence of contrast edges in the shaded images used to train the ANN. Additionally, `feature
detection' theories predict the existence of increasingly response-speci c neurons. With few exceptions
such neurons have not been identi ed. The exceptions involve neurons that respond to ethologically
relevant stimuli such as faces [10], but evidence for the existence neurons responding to other types of
stimuli is not compelling (see [11]). In particular, the reductionist approach adopted by Fujita et al.
suggests that such complex stimuli may not be the optimal stimuli for neurons in the inferotemporal
cortex (even if they are then this might tell us little about the function of such neurons, see below).
Fujita et al. de ned the optimal response properties of neurons in the anterior inferotemporal cortex
(\the nal station of the visual cortical stream crucial for object recognition", [11], p343) in terms of
spatially de ned features of simple geometric objects2 . By progressively simplifying `optimal' stimuli,
Fujita et al. found that neurons responded selectively to line drawings of simple shapes, intensity
or colour contrasts, or luminance gradations. Neurons in the same column shared the same optimal
stimulus, and adjacent neurons usually shared similar response properties. Whilst the importance of
these results cannot be over-emphasised, the interpretation placed upon them by the authors and an
accompanying review of the paper (in the same edition as [11]) is consistent with the assumption that
these neurons are only used for object recognition. Indeed this assumption appears to be implicit in the
title, \Columns for visual features of objects in monkey inferotemporal cortex" (italics added). Whilst
there seems little doubt that these neurons are involved in the perception of form, this more general
characterisation of their function admits a larger set of visual tasks than is implied by a characterisation
in terms of object recognition alone.
Just as in [9] `edge detecting' units in an ANN performing a shape from shading task were observed,
so it may be that the feature detecting neurons observed by Fujita et al. have much to do with form
perception, but are not uniquely associated with object recognition. Indeed, the thesis of [9] is that
the role of neurons which respond selectively to visual inputs cannot be deduced from the nature of
those inputs. Results from psychophysical experiments on the role of oriented receptive retinal elds
suggest that, \oriented lters are not `orientation detectors, but are precursors to a more subtle stage
that locates and represents spatial features"[12](p235). Together, these data suggest that the function
of Fujita et al.'s feature detecting neurons may not be to detect certain features. Instead, the function
of those neurons may be similar in type to the units and neurons described in [9] and [12], respectively.
In addition to the objections raised above against labelling response-speci c neurons as feature
detectors, this class of theory creates several obvious, but fundamental, problems. First, if a single cell
codes for a particular retinal feature then the death of that cell would eliminate the ability to recognise
that feature. Second, there are not enough neurons to code for every possible combination of features.
However, Hinton [13] and Ballard [14] have proposed the use of coarse coding ANN units and feature
subspaces, respectively, to ameliorate this combinatorial problem.
2 Of the set of objects tested these were found to elicit maximal responses.
3
Although workers in neurophysiology state that `feature detector' is only a convenient term to
describe their ndings the implication that the function of each neuron is to signal the presence of a
single feature is pervasive. It is pervasive to the extent that researchers using ANNs con dently refer
to `feature detecting units - but removing instances of such units usually has a minimal e ect on the
ability of the ANN to utilise information associated with the putative feature. Whereas most units
contribute to the nal output of an ANN, the functional signi cance of each unit in responding the
presence of certain features is not known. Whilst there seems little doubt that units in ANNs and
neurons in the primary visual cortex respond to contrast edges, and that neurons in the inferotemporal
cortex respond selectively to visual con gurations, this functionally neutral description is often not the
language used to describe the behaviour of such units and neurons.
I am not arguing that neurons do not code for visual features. I am proposing that the strong bias
amongst vision workers from many elds to identify vision as being synonymous with object recognition
results in an interpretation of data which assumes that the response characteristics of neurons can only
be explained in terms of their role in object recognition. Given our evolutionary history, and our skill
in performing di erent types of visually guided behaviours, it seems unparsimonious to propose that
neurons which respond to features that are parts of objects are only involved in object recognition.

2 Linguistic Anthropomorphism in Computer Vision


Linguistic descriptions of the physical world tend to be expressed in terms of objects and their associ-
ated properties and processes. An example is: `The green stone is sinking'. Here the primary descriptor
is the stone, with green (property) and sinking (process) being `attached' to the stone. However this
linguistic description belies a bias which may bear little relation to the type of quantities computed
by a visual system. Within the visual system the motion, colour and spatio-temporal integrity of the
stone are not necessarily subsidiary to each other, they are simply computable attributes of a physical
scenario.
I believe that this linguistic bias has in uenced much work in computer vision, and has retarded
progress by identifying computer vision with object identi cation, rather than with what vision is
used for in biological vision systems. More recently, the apparent intractability of computer vision as
object identi cation, and the consequent lack of practical use of much of computer vision work, has
led to a re-appraisal of what vision is for. This move away from conventional computer vision and
toward animate vision[4] is motivated by the modest practical success of computer vision systems. It
is now accepted by some[4] that computer vision was asking the wrong questions. Rather than asking
\How can we get a machine to name objects?", perhaps it should have been asking, \What is vision
for?". The `new' computer vision recognises what went wrong with computer vision, but not why it
went wrong, nor why the error was repeated by successive generations of researchers. By making the
problem explicit we may be able to avoid mistakes of this type in the future.
As computer vision researchers we accept that the conventional formulation of the problem of object
identi cation represents a tractable problem. We therefore implicitly accept the status of objects as the
primary descriptors of a given visual scenario. Moreover, the compelling psychological importance of
objects suggests that each object can be recognised on the basis of a purely spatial parameterisation of
that object. The sentence, `The green stone is sinking', makes sense to us because it is consistent with
our own perceptions. It is tempting to model our own perceptual capabilities using primitives (object,
property and process) such that the computational precedence of each matches our own linguistic
precedence. Although linguistic descriptions are consistent with experience, they are not necessarily
determined by such experience. Instead of modelling human capabilities in terms of objects with
properties and processes, we could equally well describe the world in terms of processes with objects
and properties; e.g. `The sinking is green stone'. Here, the process of `sinking' is the hook upon which
the `stone' object and `green' property hang. In such a linguistic world objects and properties would
be subsidiary to processes. If our language were organised in this manner then computer vision would
probably consist of recognising entities such as `sinking', `blowing', and objects would be treated as
attributes of these primary descriptors. This ending is paragraph.
There are no logical reasons for partitioning the world into objects with subsidiary properties and
processes; though there may be sound computational reasons for doing so. Whorf[15] would argue that
such a partitioning determines how we perceive the physical world. However, I am less concerned here
with why we partition the world in this way than with the consequences of any particular partitioning.
The point is that partitioning the world along dimensions of process rather than objects is logically
4
indistinguishable from a conventional partitioning if each provides the same amount of information
about the physical world. From an ethological perspective there is little point in knowing that a
predator is present if it is not known in which direction the predator is moving. Conversely there is
little point in knowing in which direction an object is moving if it is not known what type of object
(predator/prey) it is. Both the object type and the processes associated with an object are required.
How we choose to describe such physical scenarios with language is immaterial provided that both
types of information are communicable within the language.
I am not proposing that scenes should be described with processes as primary descriptors. The
point of the preceding discussion is that language imposes precedence, or hierarchy, on our descriptions
of the visual world. How elements are ordered in this hierarchy matters less than which types of entities
constitute the elements of the hierarchy. I believe that the current formulation of problems in computer
vision (and AI) re ects much about the structure of linguistic hierarchy of the English language, and
little about the underlying computational processes which generate the elements of the hierarchy3 .

3 Spatial and Spatio-Temporal Characteristic Views


3.1 Spatial Characteristic Views
It is only by observing how a thing appears to change that its invariant properties can be gauged.
Rotating a cup does not alter the cup, but it does alter the cup's appearance. If it is known which
properties characterise a cup, as viewed from any angle, then the cup may be recognised from any
viewpoint.
Marr [18] suggested that objects can be recognised by making use of characteristic views of those
objects. These are views which are relatively stable with respect to rotation. This approach has been
developed and implemented in [19], and more recently in [20]. In support of Marr there is evidence [21]
that object identi cation in humans makes use of sets of characteristic views. This evidence suggests
that recognition of an object presented at a particular orientation occurs by matching to a view which
is interpolated across several stored views of that object. The questions of how it is decided which set
of stored views (object) to use for interpolation, and how the interpolation process is executed, remain
unanswered. A computer demonstration of the utility of this approach for objects de ned as sets of
3D points is given in [22].

3.2 Spatio-Temporal Characteristic Changes of View


Note: Throughout this section the term `object' is used for the sake of brevity only. The following
discussion is intended to apply to any con guration of points in 3D space (e.g. rock faces, the surface
of a path, the surface of turbulent water) for which it is desirable to compute some attribute (e.g.
motion, distance, orientation, hardness) of the set.
Whereas characteristic views can be used to recognise con gurations of points by interpolating over
those views[22], spatio-temporal characteristic views can be used for recognition by interpolating over
a set of stored spatio-temporal characteristic views. For example, the set of retinal changes induced by
the rotation of a set of 3D points is sucient to specify not only the rotation, but the relative positions
of points in 3D space. In short, whereas characteristic views can be used to specify particular 3D
spatial relations between points which characterise a given object, spatio-temporal characteristic views
can be used to specify spatio-temporal relations between points which characterise a given object and
parameters associated with its motion. The relative retinal motions of those points not only specify the
relations between the corresponding 3D points, but also their collective motion in 3-space. A simple
example of recognition of a particular type of motion (not of an object) via spatio-temporal cues is
that of the motion of wavelets on a river surface. To paraphrase from the previous section: Rotating
a cup does not alter the cup, but it does alter the changes over time in the cup's appearance. If it is
known which changes characterise the rotating cup then the cup may be recognised if it is rotating.
It might be mistakenly thought that the above is no more than an interpretation of obtaining
3 Whilst there is no empirical evidence that the linguistic precedenceused in languagesis re ected in the computational
organisation of the brain, there is evidence that `what' and `where' attributes are computed in di erent parts of the
brain [16, 17]. It appears that the linguistic distinction between `what' and `where' (but not its precedence) has a
corresponding functional distinction in the brain.
5
`structure from motion'. Using a set of spatio-temporal characteristic views for recognition is di erent
from using `structure from motion'. The latter uses motion to infer the atemporal structure of a 3D
scene, whereas the former uses the e ects of motion as a cue for recognition. These cues are not
necessarily interpreted in terms of the 3D structure of a scene. In support of this argument three
examples are described in which the response to visual stimuli cannot be explained only in terms of
the spatial structure of visual stimuli.
Example 1. In investigating the speci city in the reactions of young geese to birds of prey ying
overhead a model was constructed with symmetric anterior and posterior wing edges, and with a
`head' at each end of the body[23]. One head had a short `neck', and the other head had a long `neck'.
This model elicited an escape response in young birds only when it was moved in the direction of
the short `neck'. When the model was moved in this direction it gave the impression of a hawk in
ight, when it was moved in the opposite direction it gave the impression of a goose. Thus it was not
the atemporal shape that determined the responses because spatial structure was common to both
directions; instead the response was determined by shape in relation to the direction of movement (i.e.
the spatio-temporal structure of the stimulus).
Example 2. The ability of mosquito-hunting dragon ies to recognise their prey does not depend upon
the shape of the prey[23]. Instead, dragon ies react speci cally to the type of motion associated with
ying mosquitoes.
Example 3. With regard to the case made for identi cation via dynamic cues, Johansson has provided
ample evidence [24] that observers can identify moving objects (humans) for which the sequential
process extracting spatially de ned features, followed by identi cation appears to be impossible. Jo-
hansson's experiments consisted of showing observers a lm of a person walking in the dark with a
light attached to each major joint of each limb. Under these conditions a single frame is usually insuf-
cient to evoke the perception of a person. However, only a few frames allow the observer to perceive
a moving person. It is the relative motion of points(lights), and not their static con guration in any
single frame, that evokes the perception of a person. Certainly observers are familiar with the relative
positions of major joints on a static human body, but this is insucient to evoke the perception of a
person. However, observers are also familiar with the sets of changes in the relative positions of major
joints associated with walkers. Unlike purely positional information, derivatives of positional informa-
tion with respect to time are sucient to unambiguously specify the gure of a person. Johansson
claims that such changes over time are sucient to adequately specify a person, and even the identity
of a person:
\I do know him by his gait;
He is a friend."
Julius Caesar, W. Shakespeare.
Mather et al. investigated the cues responsible for perceiving Johansson gures. In support of the
proposal (above) that the e ects of motion can be used directly as a cue for recognition without rst
interpreting such motion in terms of the 3D structure of a scene they conclude:
\The visual system may rely heavily on detecting such [wrist and ankle] characteristic
movement patterns during recognition of moving images, rather than on constructing a full
structured representation of the body."
Mather et al., page 155, 1992[25].
Evidence that neurons in the temporal cortex respond selectively to walking humans is provided in
[26]. Each of these neurons responded only to the image of a human walking forward. Images of a
human walking backwards in the same direction (as the forward walker) did not evoke a response.
About 40% of neurons that responded to human walkers also respond to Johansson movies of human
walkers.
Proposing that spatio-temporal characteristic views should be used for computer vision does not
specify how this could be accomplished. For Johansson gures this could be implemented by modifying
the ANN described in [22] so that each unit is associated with a particular set of contiguous views of
a Johansson gure. Each unit speci es the extent to which the inputs match its own preferred input.
The nal output is obtained by interpolating over several sets of contiguous views (that is, by forming
a superposition of unit outputs). As in [22], the `preferred' set of views of each unit can be adapted so
as to optimise performance of the ANN.
As a researcher engaged in the construction of computational models of vision I am acutely aware
that such suggestions are easier to propose than to implement. In defence, I refer the reader to the rst
6
sentence of this paper. The purpose of this paper is to propose an approach (which is inevitably more
nebulous than a computational theory) which is a rst step in the process of constructing computational
theories that are consistent with this approach.

4 Conclusion
I have argued that object identi cation is not an adequate characterisation of the problem of vision,
and that too much emphasis has been placed on object identi cation within computer vision.
The problem of vision consists of a multiplicity of tasks, of which object identi cation is only
one. Others include visually guided behaviours such as walking, grasping, climbing. By continuing to
concentrate e orts on object identi cation there is a danger that the formulation of the problem (of
object identi cation) will provide methods which are not of general utility for vision, and which may
not even be of use for object identi cation.
Object identi cation is an integral part of the general problem of vision. Consequently it is likely
that solutions to the problem of vision and object identi cation share a common set of computational
mechanisms. A broader de nition of computer vision ensures that computer vision will be useful for
general visually guided behaviours, as well as for identi cation of objects.
I have suggested that one way to usefully broaden the de nition of the problem of vision is to
consider the use of spatio-temporal cues, not only as a means of estimating the atemporal structure
of a scene, but directly as cues for accomplishing particular visual tasks which include (but are not
necessarily uniquely associated with) object identi cation.
Acknowledgements: Thanks to Stephen Isard for comments on longer versions of this paper, and
for suggesting Tinbergen's experiment as an example of the use of a spatio-temporal stimulus. Thanks
to Raymond Lister and Helen Peddington for useful discussions on drafts of this paper. Thanks also
to Mark Lee, Marcus Rodrigues, David Cli and Inman Harvey for comments on an earlier draft of
this paper.
This work was undertaken as part of an MRC/JCI grant awarded to Mark Lee at the Department
of Computer Science, University of Wales, Aberystwyth.

7
References
[1] DH Ballard and CM Brown. Computer Vision. Prentice-Hall Inc., New Jersey 07632, 1982.
[2] S Zeki. The visual image in mind and brain. Scienti c American, 267(3), Sept. 1992.
[3] P Cavanagh. Multiple analyses of orientation in the visual system. In Neural Mechanisms of Visual
Perception, Lam, D and Gilbert, C (Eds)), pages 261{279, 1989.
[4] DH Ballard. Animate vision. Arti cial Intelligence, 48:57{86, 1991.
[5] RA Brooks. A robust layered control system for a mobile robot. IEEE J. Robot Autom., 2:14{23, 1986.
[6] S Watanabe. Knowing and Guessing. J Wiley and Sons, 1969.
[7] DH Hubel and TN Wiesel. Receptive elds, binocular interaction, and functional architecture in the cat's
visual cortex. Journal of Physiology, 160:160{154, 1962.
[8] HB Barlow. Single units and sensation: A neuron doctrine for perceptual psychology? Perception,
1:371{394, 1972.
[9] SR Lehky and TJ Sejnowski. Neural network model of visual cortex for determining surface curvature
from images of shaded images. Proc R. Soc. London (B), 240:251{278, 1990.
[10] DI Perrett, E Rolls, and W Caan. Visual neurones responsive to faces in the monkey temporal cortex.
Exp. Brain Res., 47:329{342, 1982.
[11] I Fujita, K Tanaka, M Ito, and K Cheng. Columns for visual features in monkey inferotemporal cortex.
Nature, 360(26):343{346, 1992.
[12] MA Georgeson. Human vision combines oriented lters to compute edges. Proc. Roy. Soc. London, (B),
249:235{245, 1992.
[13] GE Hinton. Shape representation in parallel systems. In Proc 7th IJCAI, Vancouver BC, pages 1088{1096,
1981.
[14] DH Ballard. Parameter nets. Arti cial Intelligence, 22:235{267, 1984.
[15] BL Whorf. Language, Thought, and Reality. MIT Press, New York, 1956.
[16] JHR Maunsell and WT Newsome. Visual processing in monkey extrastriate cortex. Ann. Rev. Neuro-
science, 10:363{401, 1987.
[17] M Mishkin, LG Ungerleider, and KA Macko. Object vision and spatial vision: Two cortical pathways.
Trends in Neurosciences, 6:414{417, 1983.
[18] D Marr. Vision. Freeman: New York, 1982.
[19] D Lowe. Perceptual Organisation and Visual Recognition. Kluwer Academic Publishers, Boston MA, 1985.
[20] Bray, A J. Recognising and Tracking Polyhedral Objects. PhD thesis, School of Cognitive and Computing
Sciences, University of Sussex, UK, 1990.
[21] S Edelman and H Bultfo . Viewpoint-speci c representations in three-dimensional object recognition.
MIT AI Memo No. 1239, 1990.
[22] T Poggio and S Edelman. A network that learns to recognize three-dimensional objects. Nature, 343:263{
266, 1990.
[23] N Tinbergen. The Study of Instinct. Clarendon), Oxford University Press, 1951.
[24] G Johansson. Visual perception of biological motion and a model for its analysis. Perception and Psy-
chophysics, 14:201{21, 1973.
[25] G Mather, K Radford, and S West. Low-level visual processing of biological motion. Proc. Roy. Soc.,
249:149{155, 1992.
[26] D Perrett, M Harries, AJ Mistlin, and AJ Chitty. Three stages in the classiication of body movements
by visual neurons. In Images and Understanding, Barlow H, and Blakemore C, Weston-Smith, M (Eds),
pages 95{107, 1990.

Potrebbero piacerti anche