Sei sulla pagina 1di 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/235626691

The vision of David Marr

Article  in  Perception · January 2012


DOI: 10.1068/p7297 · Source: PubMed

CITATIONS READS

4 3,729

1 author:

Kent A. Stevens
University of Oregon
62 PUBLICATIONS   1,558 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Kent A. Stevens on 07 March 2017.

The user has requested enhancement of the downloaded file.


Perception, 2012, volume 41, pages 1061 – 1072

doi:10.1068/p7297

The vision of David Marr

Kent A Stevens
Department of Computer and Information Science, University of Oregon, Eugene, OR 97403,
USA; e-mail: kent@cs.uoregon.edu
Received 24 May 2012, in revised form 22 August 2012

Abstract. Marr proposed a computational paradigm for studying the visual system, wherein aspects of
vision would be amenable to study with what might be regarded a computational–reductionist approach.
First, vision would be cleaved into separable ‘computational theories’, in which the visual system is
characterized in terms of its computational goals and the strategies by which they are carried out. Each
such computational theory could then be investigated in increasingly concrete terms, from symbols
and measurements, to representations and algorithms, to processes and neural implementations. This
paradigm rests on some general expectations of a symbolic information processing system, including
his stated principles of explicit naming, modular design, least commitment, and graceful degradation.
In retrospect, the computational framework also tacitly rests on additional assumptions about the nature
of biological information processing: (1) separability of computational strategies, (2) separability of
representations, (3) a pipeline nature of information processing, and that (4) the representations employ
primitives of low dimensionality. These assumptions are discussed in this retrospective.
Keywords: visual perception, representation, 3D perception, computational theory

1 Introduction
Computation—the concept of information processing as abstracted from its embodiment
and the specifics of its implementation—was key to the paradigm pioneered by David Marr.
He emphasized the distinction between what visual information processing occurs within the
visual system and how that processing is accomplished. A series of symbolic representations
was proposed, each storing information extracted by computations on measurements derived
earlier in the visual system, and each serving as the source of visual information for the
subsequent stage. Marr expected that the underlying computational strategies for computing
each stage would be understandable, without having to know the details of their neural
implementations. Proposed strategies for aspects of human vision could then be explored
by computational experiments that capture the gist, but not the specifics, of their neural
implementations. When a computer implementation successfully replicates aspects of
human perceptual behavior, a case would be made that they share aspects of the underlying
strategies. Hypothesizing that information processing strategies are independent of their
implementation, Marr suggested that we can understand what before how, and, as a practical
paradigm for studying biological vision, to proceed in that order.
The expectation that information is the commodity, the stuff, of neural processing is today
uncontroversial. But in the early 1970s, as a new paradigm for studying vision, this computational
notion contrasted starkly with a precursor alternative approach, in which information was
regarded as latently available—delivered to the doorstep of vision, but not followed on inside.
Information had been described in terms of ‘affordances’ ie relationships between properties
of 2D images and the 3D world (Gibson 1966, 1979). Information seemed to be carried by
some abstraction or titration of the retinal image, as suggested by the immediacy with which
the contours in an outline drawings convey meaning. While images clearly contain information
that ultimately afford perceptions, Marr considered representational schemata that could make
concrete these otherwise abstract notions of information. “In its extreme form, this presumption
1062 K A Stevens

implies that what we call the ‘perception’ of an object … corresponds rather directly to the
making, in some central place, of one or more abstract assertions about that object; and to the
consequent availability of other knowledge related to that percept.” (Marr 1974, page 2).
While relatively new to the study of visual perception, symbolic computations were very
much part of the Zeitgeist in the early 1970s. Programming languages based on the lambda
calculus (Church 1932) facilitated the mechanization of symbolic reasoning. Largely due to the
facility with which the programming language Lisp permitted the expression and manipulation
of logical assertions (such as relationships between objects and attributes of objects), research
in ‘artificial intelligence’ enjoyed a period of rapid proliferation, with attempts to mechanize
various aspects of intelligent problem-solving such as qualitative reasoning about physics,
mathematical theorem discovery, linguistics, and block stacking by robot (McCarthy 1960;
Russell and Norvig 2010). Programs were written to manipulate symbolic descriptions, eg of
the arrangement of blocks in a scene (Winograd 1972), or of the luminance features in a “blocks
world” image, from which their 3D arrangement might hopefully be inferred (Waltz 1975).
Hence, to shift from measurement to assertion, from continuous to discrete, was not regarded
as difficult and this served as the ‘enabling technology’ that permitted Marr to then consider the
implications of symbolic computation towards understanding biological vision.
To the extent these computer programs exhibited human-like capabilities, they could be
considered models for the corresponding biological strategies underlying visual perception,
planning, or other cognitive activities. And, even if not biological models, these computer
programs often showed promise for a variety of application areas in the then-new field of
artificial intelligence. Thus, while Marr focussed exclusively on understanding biological
information processing, the broader artificial intelligence community provided the symbolic
programming techniques to explore computational aspects of human vision.
This essay reflects on Marr’s paradigm at the time of its conception, from the perspective
of having been one of his students from 1975. The following is not a broad survey of the
current state of vision research, but rather a reflection on my research of some decades ago,
performed as a follower of his school. While the psychophysical results may stand, their
interpretation from the computational perspective requires revisiting.
In the mid-1970s David Marr had just joined the MIT Artificial Intelligence Laboratory
from University of Cambridge, and was writing a series of foundational essays (Marr 1974,
1976a, 1976b, 1976c) which, nearly 40 years later, likely provides the best record of the
origins of his paradigm. At the time, his essays were revolutionary in the clarity with which
a path appeared to extend before us. Marr proposed that we could describe vision at three
increasingly specific levels (Marr 1982, page 25, figures 1–4), namely:
(i) ‘computational theory’ (“what is the goal of the computation, and what is the logic of the
strategy by which it can be carried out?”),
(ii) ‘representation and algorithm’ (“How can this computational theory be implemented?
In particular, what is the representation for the input and output, and what is the algorithm for
the transformation?”),
(iii) ‘hardware implementation’ (“How can the representation and algorithm be realized
physically?”).
In retrospect, that initial clarity derived from David Marr’s expectation that the goals of
the computation are sufficiently well defined and functionally separable as to be enumerated
and studied as individual ‘computational theories’. If Marr had believed otherwise, for
instance that the strategies were intrinsically intertwined with the fundamentals of the neural
networks that implemented them, then the task of unraveling the ‘what’ from the ‘how’ would
be most daunting, for the neural mechanisms would somehow have to be understood in their
complexity before finally understanding what they were computing. Such was not the case;
the key to understanding vision appeared to be though studying its computations.
The vision of David Marr 1063

David Marr at MIT. Photo by the author.

2 Assumptions underlying the computational paradigm


Marr’s computational framework rested on some general principles (Marr 1976b, pages
485–486): the principle of explicit naming (central to the concept of symbolic computation),
the principle of modular design (an engineering principle which he regarded as necessary,
discussed further below), the principle of least commitment (“… that one should never do
something that may later have to be undone”), and the principle of graceful degradation
(wherein perceptual performance diminishes gradually with degradation of sensory input).
In retrospect, his framework also tacitly rested on the following additional assumptions about
the nature of biological information processing, discussed below:
(i) The assumption of separable computational strategies,
(ii) The assumption of separable representations,
(iii) The assumed pipeline nature of information processing, and
(iv) The assumption that representations employ primitives of low dimensionality.
2.1  Separable computational strategies
By Marr’s ‘Principle of Modular Design’, “any large computation should be split up and
implemented as a collection of small sub-parts that are as nearly independent of one another
as the overall task allows” (Marr 1976b, page 485). Note that modularity is expected of the
computation, and not just of the underlying neural architecture, ie that the visual system can
be partitioned into modules corresponding to computational strategies, each understandable in
terms of associated goal and “logic of the strategy”. There was no expectation, however, that
their implementations be physically modular (eg no ‘stereopsis module’ in some dedicated
cortical area). While modularity at this strategic level would not necessarily be reflected in
the large-scale neuroanatomical architecture, the various proposed computational strategies
are expected to be functionally distinct, as apparent in their ability to function in isolation,
as demonstrated experimentally.
1064 K A Stevens

Digital systems are composed of building blocks, functional modules that have minimal
dependencies and highly controlled interactions as a matter of practical necessity; computer
system architects only know how to make complex computational structures that adhere to
strict design practices that maximize regularity and symmetry of communication protocols,
localize function into modules, minimize module interdependence, and achieve scalability by
creating architectural layers that permit isolation by abstraction barriers (eg to isolate what
from how to minimize implementation details from interaction protocols across aspects of
a complex system). While computer architectures exhibit much regularity and modularity,
that highly structured design actually permits generality in the tasks (processes) they can
perform. Superimposed upon this essentially static and orderly physical organization
are myriad patterns of activity, fleeting structures and patterns of information flow and
organization that would be extraordinarily difficult to detect and track, since they manifest
themselves primarily temporally. Hence processes associated with very different programs
(ie computational strategies working towards very different goals) would result in nearly
indistinguishable patterns of observable activity.
The observed physical modularity and regularity of neuroanatomical structures may
likewise reflect an architecture with broad computational flexibility, not narrow specialization
of function. A given neural architecture might thus support many different processes from
moment to moment, as demanded by the perceptual or cognitive tasks at hand, and give little
clue by their structure as to the specific computations in which they are engaged.
There are suggestions of modularity at the level of computational strategy given the visual
ability to derive perceptions individually and independently from different sources of 3D
information (eg binocular disparities, motion, shading, contour) each in functional isolation
of other sources of 3D information through the design of the psychophysical experiments.
Furthermore, an extensive literature on acquired and genetic perceptual deficits also suggests
modularity at the level of computational strategy. Soon after Marr (1976b) described the
implications of subjective contours for symbolic vision, a visual-agnosia patient was found to
present with an inability to see ‘subjective contours’ in a variety of classical demonstrations
(wherein a phantom object is interpreted as partly occluding others), and yet normal
perception of occlusion boundaries in random-dot stereograms (Stevens 1983b). The patient’s
interpretations of pictorial depictions never involved seeing one object as occluding another.
In the Kanizsa triangle and other figures, where line terminations and edges align along a path
as occluded by some interposed yet hidden form, the patient could appreciate their alignment,
replicate the illustration with geometric accuracy, yet not imagine an interposed phantom
shape, hence did not construct illusory contrast along its occluding edges. As a computational
strategy, subjective contours could be seen as constructions to represent a particular spatial
interpretation when the observer hypothesizes the presence of occluding surfaces. Their
construction is contingent on the interpretation, as the patient saw illusory contrast along the
depth edges in random-dot stereograms (and drew them in as lines, when asked to replicate
what he saw in the stereogram), but not in the monocular depictions. With no need to create
a contour, no illusory contrast is asserted, nor contrast seen. This fitted well with Marr’s
notions of edges as symbolic assertions. Reflecting on that study, it  remains fascinating to
consider that the perception of a subjective edge co‑occurs with an interpretation involving
occlusion. Is the illusory edge tantamount to the corresponding visual interpretation? Does
it assert the boundaries of an hypothesized occluding object, giving it a silhouette as needed
for subsequent description of shape? Given the consistency with which the patient failed
to interpret monocular evidence of occlusion and yet presented normal ability to describe
object shapes when their occluding contours, where objectively presented (by either outline
or contrast edge), might that suggest a selective deficit of a computational strategy?
The vision of David Marr 1065

Conceiving of vision as organized into modular, separable computational strategies seems


to some degree both useful and valid, but if surfaces can be perceived independently from
stereopsis and shading by separable strategies, for example, how many more might eventually
be identified? Would their total number be but few score, or might there be hundreds or more?
Note that for these questions to be well defined, and indeed for the notion of a computational
strategy as well, there needs to be some sense in which they are functionally separable and their
independent contributions observable. Over-zealous reductionism could result in proposing
separate computational modules based on some taxonomy of experimental stimuli—different
shape-from-modules—where in fact those stimuli might be functionally equivalent input to a
common perceptual strategy that does not distinguish between them.
The expectation that visual perception is comprised of clearly separable, discrete
strategies might therefore be largely a reflection of reductionism. The expectation is that
the perceptual tasks are achieved by separate strategies, and that their apparent equivalence
reflects their contributing to a common representational scheme of surfaces and objects.
When we eventually understand that representation, however, the seeming multiplicity of
discrete 3D computational strategies might later be unified when they are found to be different
manifestations of fewer, but more complex, computational strategies which are currently
beyond our comprehension. In the meantime, conceiving of vision in terms of separable
computational strategies is engaging—just perhaps overly simplistic.
2.2  The presumption of distinguishable representations
Just as computational strategies are presumed separable, so too would be the various stages
of representation. Early on, Marr (1974, 1976a, 1976b) emphasized the discrete nature of
such representational schemes, each comprised of a relatively small symbol system (such as
the edge, bar, blob, terminations, and other assertions of the ‘primal sketch’). Contrast-based
measurements at the retina and primary visual cortex were being modeled as the input to
local descriptions image measurements, with some illusions arising as direct consequences
of measurement errors (eg Lulich and Stevens 1989).
At some point vision becomes unquestionably symbolic, as when visually counting discrete
objects. But how early in visual processing might one usefully regard visual processing as
symbolic? Marr proposed that symbolic processing started with constructing a description of
luminance features in an image, preferring the term ‘construction’ over ‘detection’ to emphasize
the sense to which a description is created. The perceived edge-like quality of a subjective
contour is given as evidence: “One cannot argue that Fourier detection methods will produce
it, because it really is not there. This contour is not being detected, it is being constructed.”
(Marr 1976a, page 653). Likewise, dot patterns can be constructed for which the perception of
geometric structure cannot merely be attributed to detection based on their low‑spatial frequency
content; their local geometry contributes orientation information even when not detected
in the spatial frequency domain (Stevens 1978; Brookes and Stevens 1991). Line  and edge
terminations, small blobs, cusp-like discontinuities in tangent (Stevens and Brookes 1988b), and
other distinguished 2D points seem to be symbolically marked as ‘place tokens’ (Marr 1976c),
and their local arrangement analyzed as part of a process of describing texture and figure.
The primal sketch and the subsequent ‘2½D sketch’ of visible surfaces described relative
to the observer were distinct in that their construction would involve very different strategies,
and their descriptions would employ very different symbols (Marr 1976c). The discrete nature
of the two symbol systems fairly leads to expecting functionally distinct representations for
images versus surfaces, that the 2D and the 2½D representations are computed consecutively
(or preponderantly so—see section 2.3). The proposal that the surface representation is
derived from the image representation was central to the proposed computational theory of
stereopsis (Marr and Poggio 1976, 1979).
1066 K A Stevens

Perhaps in keeping with the relatively few cortical areas that had been identified in the
early 1970s, and the categorization of neurons into a few receptive field types in the retina
and primary visual cortex, the expectation was that the number of computational strategies,
the number of visual representations, and the number of symbols comprising each, would all
be countable and small.
Marr proposed a coarse categorization of visual representations (corresponding to 2D,
2½D and, 3D stages of information extraction), each with a vocabulary that was sufficiently
rich as to permit (in some sense) complete symbolic descriptions at each discrete stage.
Alternatively, a single representational scheme with enormous computational complexity
would not be conducive to analysis as a symbol system style of computation, nor would it likely
adhere to the principle of modular design. At the other extreme, neither would a very large
number (thousands or an effectively arbitrary number) of weakly coupled representations,
each with primitives for registering or abstracting specialized descriptors of corresponding
2D and 3D configurations (see section 2.4). Instead, conceiving of vision as a computation
achieved by discrete symbol systems organized into a few discrete stages of visual information
processing was hoped to facilitate understanding of their corresponding neural embodiments.
If the actual correspondence between symbol and neural implementation is in fact less direct,
the explanatory utility of the purported visual representation would be questionable.
2.3  The presumed pipeline nature of information processing
There was a tacit expectation for a sufficient preponderance of directional flow of information
from earlier to later stages of information processing to justify distinguishing ‘earlier’ from ‘later’.
The presence of substantial bidirectional, concurrent streams of information and control would
greatly diminish the likelihood that vision could be understood one-stage-at-a-time. Computers
are often organized around pipeline architectures to achieve concurrency of processing. Data are
queued in buffers as they cascade between successive stages. System architects increase the
flexibility of the pipeline by allowing earlier stages to influence not only the data provided
to later stages, but how these data are executed. The flow of information and control is quite
asymmetrical in computers, however, both in terms of the bandwidth of data flowing ‘upstream’
and the ability for later stages to influence the processing of earlier. Specialized computer
architectures (eg graphical processing units) achieve efficiency for certain computations with
a cascading of early (more sequential) to later (more parallel) processing stages, where each
computational step at the entry to the pipeline decomposes into many independent subtasks that
are distributed across many processors further down the pipeline. Computers have yet to be
devised, however, with a high degree of concurrency at all stages in a computational pipeline.
It is quite beyond current theoretical understanding to make effective use of a highly parallel
bidirectional-pipeline architecture, let alone an architecture comprising multiple interacting
pathways, each with bidirectional flow that are connected topologically to create a connected,
cyclic graph of pathways—not merely a radiation or forking of a processing stream into
multiple pathways. Since the computational properties of such complex system architectures
were very poorly understood, biological systems were modeled using computational metaphors
that could be understood, namely pipelines with essentially unidirectional information flow.
While neurophysiologists in the early 1970s were beginning to study the interconnections
between visual areas and recurrent pathways (Allman and Kaas 1974; Zeki 1975), an
appreciation for the topological complexity, as it has later come to be understood (Van Essen
and Gallant 1994; Van Essen and Anderson 1995; Fox et  al 2005), would have raised questions
regarding ‘early’ versus ‘later’ representations, ie the temporal order of construction of visual
representations, which has profound impact on the nature of the corresponding computational
theories. The potential that high-level knowledge might to be brought to bear on the creation
of very ‘early’ representations was not fully recognized. This is not merely a matter of
The vision of David Marr 1067

implementation, for computational strategies could then capitalize on ‘later’ information


to assist in the bootstrapping of low-level (‘early’) visual representations. A primal sketch
would not be restricted to just describing the information as carried by an afferent pathway;
contours could be completed “subjectively”, forms isolated, geometric patterns organized, and
‘place tokens’ selected and organized, all with the assistance of (or imposed by) more global,
attentive, experience-driven constraints and knowledge. The distinction between ‘earlier’ and
‘later’ would be blurred, however, and the range of computational strategies far broader than if
vision were conceived as primarily a pipeline, as a cascading of visual representations.
2.4  Visual representations are presumed to have primitives of low dimensionality
Auditory pitch can be represented by a single perceptual quantity encoded in a tonotopic map
(Reale and Imig 1980). Range or distance information, eg for echolocation of a target object
(O’Neill and Suga 1982), is likewise a scalar. In vision, distance perception results in range,
relief (variations in distance across an extended smooth surface), and ordinal depth (either
between two points or across a step discontinuity or surface boundary). Depth seems to be
mapped across visual space, ie mediated by an underlying spatial representation indexed
by visual direction, in short, a ‘depth map’. Mathematically, this map is a function D : v → d
from visual direction v to distance d; computationally a function d = D (v). Note that the map
does not necessarily require an underlying retinotopic map laid out in cortex; the perception
of d might be derived on demand (ie by ‘lazy evaluation’) for a specified direction v.
Apparent depth plays a large role in our conscious appreciation of the 3D world.
Computationally, however, it is uncertain what role this scalar quantity plays in the perception
of surfaces. Is it a fundamental representational primitive for the description of local surface
shape in 3D, a precursor from which to analyze local surface topography, determine spatial
layout, and perform object recognition? It has been widely presumed so. Similarly, apparent
surface orientation is expected to be a primitive percept. Surface orientation o is vector-like,
equivalent to the projection of the surface normal onto the image plane. The two degrees
of freedom of orientation correspond to the direction (tilt) and magnitude (slant) of the
gradient of distance from viewer to surface.(1) Orientation is commonly expected to also be
mapped O : v → o. Thus distance and its gradient across smooth surfaces are experienced
as depth and slant, respectively, with sharp discontinuities in these measures seen as depth
edges and surface creases, respectively. Analysis of spatial variations across these primitive
depth and orientation maps would then reveal surface topography (much as edges are derived
from luminance measurements).
The immediacy with which we can report which of two targets is the nearer (or brighter),
might suggest that those tasks are mediated by our accessing corresponding maps of depth or
brightness. But considering advances in information processing by computer, our intuitions are
gradually changing regarding what is computationally easy and immediate. For instance, search
engines are now capable of near-instant delivery of information regardless of query content,
because the internal representations are designed to serve such generality. Their computations
have become more tractable as the representations they are applied to have become richer. The
principles by which visual information is represented biologically to achieve both generality and
speed as yet allude us, and the impression that immediacy reflects simple (ie low-dimensional)
internal representations may be an illusion. Considerable representational complexity may
actually underlie what appear to be primitive perceptual competences.
(1)
The directional component, tilt, corresponds to the image direction to which the surface normal
would project into the image plane, and varies from 0 to 2r. In orthographic projection, tilt might only
be determined from 0 to r, permitting reversals in direction as demonstrated, eg by the familiar Necker
cube. Slant, the magnitude component, corresponds to the angle between the normal and the line of
sight, and varies from 0 to 2 r (Stevens 1983a).
1
1068 K A Stevens

The distance, surface orientation, and curvature associated with points across a smooth
surface are related mathematically: orientation to distance by spatial differentiation (slant
corresponding to the magnitude, and tilt to the direction of the gradient of distance across
a small patch), and curvature to surface orientation also through spatial differentiation.
Binocular parallax, motion parallax, texture foreshortening, shading, and so forth provide
more-or-less specific 3D information. Despite the fact that some cues are more directly
related to orientation or curvature than they are to depth, they are collectively called ‘depth
cues’. Natural scenes present multiple such cues simultaneously. Since they are mutually
consistent in general, they provide redundant 3D information about the viewpoint and the
underlying surface.
Analogous to the computational benefits of processing luminance information in terms
of higher spatial derivatives (Marr 1976b), surface information is carried most reliably where
the depth cues vary nonlinearly. Surface boundaries are reflected by sharp discontinuities in
any of these measures; surface curvature by nonlinear variations, and slanted surface patches
by linear gradients.(2) Surface curvature is particularly salient, while constant (or zero)
gradients are less so (Stevens and Brookes 1988a; Stevens 1991; Stevens et al 1991).(3) The
familiar ‘depth’ cues presented by natural images generally correlate more with topography
(creases, curvature, and boundaries) than with slant, and least of all with distance per se.
While we see ‘in depth’, depth is unlikely the primitive perceptual quantity. The impression
of depth derived from binocular disparities (potentially one of the most direct sources of
distance information) exhibits simultaneous-contrast and induction effects, analogous to those
reported for brightness in the luminance domain (Brookes and Stevens 1988, 1989). Stereo
depth and brightness appear to be reconstructed by the spatial integration of higher-order
spatial derivatives of disparity and luminance, respectively.
An alternative to low-dimensional representational primitives, therefore, would be
viewer-centric topographical descriptors (measures of extrinsic surface geometry—see
below and Stevens 1995). Spatiotemporally correlated measurements of this sort could serve
as keys into an associative memory for 3D shape recognition, which in turn could support
their particular interpretation in terms of depth and orientation, say. Depth or slant might then
be indirectly derived from a learned association with the so-called ‘depth cues’, but only after
they are integrated on the basis of the common extrinsic geometry the depth cues suggest.

3 What are percepts for?


We certainly experience the visual world in terms of simple qualities such as brightness, color,
and depth; we have immediately cognitive access to these quantities. Are these representations
constructed for our private appreciation of the world—essentially end percepts—or do they
correspond to essential computational stages that support subsequent computations involving
relative judgments (such as comparisons of the relative color of two surface patches, or to
compare with a color previously experienced, to compare the relative distances of two surface
patches or, again, to compare with a 3D memory of a shape in 3D)?
For purposes of visual object recognition, Marr and Nishihara (1978) proposed that objects
are described by viewer-independent 3D models. To create such a model would entail fully
(2)
While the direction of the gradient indicates the surface tilt, an alternative to finding that direction
would be to determine the direction of least variation in the given variable (Stevens 1981b).
(3)
This matches our everyday experience when viewing pictorial depictions, such as illustrations
presented on a flat display or the printed page. The various ‘depth cues’ are not necessarily consistent,
but it hardly matters. Motion parallax, binocular disparities, and perspective suggest the objectively
correct situation of viewing a flat plane, while the content of that illustration printed or displayed on
that plane suggests otherwise. Remarkably, the pictorial depiction usually wins out, and we readily
disregard the ample evidence that the illustrations are in fact planar.
The vision of David Marr 1069

compensating for the effects of foreshortening, perspective, and viewpoint—a very difficult
computational task indeed. Their proposal was to proceed from a viewer-centric representation
(eg the 2½D Sketch) to a viewer-independent model which would be matched against other
viewer-independent descriptors. But, alternatively, and computationally much more tractably,
would be for viewer-centric 3D primitives to index or key directly into a viewer-centric
associative memory (Stevens 1995). This approach stops short of reconstructing a viewer-
independent 3D model prior to recognition (which might well be computationally intractable).
Instead of attempting to describe intrinsic geometry, the extrinsic geometry of surfaces would
be described and matched against an associative memory of extrinsic shape (ie learned
associations of how 3D surface patches appear from different viewpoints). Moreover, the path
from viewer-centric ‘depth cues’ to extrinsic shape descriptors are also more direct, at least
conceptually (Stevens 1995). In essence, a visual system would learn to associate, for each
3D shape to be memorized, how that shape would appear from enough different perspectives
(or vantage points). The reconciliation of multiple depth cues would arise from their projecting
to the same associated shape, ie integration by association (Stevens 1995). Consistent cues
would increase the confidence in the match. Depth, then, would not be the medium with which
cues would be reconciled nor shape recognition would be achieved.
It is ironic that slant and depth, quantities that we feel intuitively to be the core of
our perceptions (measurements accurately provided by range finders, photometers, etc), are
actually the hardest to perceive accurately, the least reliably provided in natural images, and
(in my opinion) the least likely to be the primitives of the representation of visual surfaces.
Depth is nonetheless a useful experimental variable, even if derived secondarily from some
other, more primitive, 3D representation. The challenge for theorists, however, is to propose
the representational primitives that capture the information (and preserve the same ambiguity)
as that used to encode perceptions in actual visual systems. Introspection may not serve us
best in this pursuit: the form in which information appears to be accessed and delivered to
consciousness is not necessarily the form in which it is represented internally. The latter
form may be more complicated than we can easily visualize. This is a familiar problem to
other scientific disciplines that routinely manipulate models that involve entities with many
degrees of freedom or dimensions. The fact that biological vision is likely a mapping of
spatial measurements across a region of image space into topographical surface descriptors
containing many orthogonal components should encourage the search for computational
models in which 3D perception is achieved without attempting to reduce all those dimensions
to but a few terms (such as slant and depth).

4 Difficulties in inferring visual representation schemata


Marr (1982, page 337) states “… the computational theory of a process is rather independent
of the algorithm or implementation levels, since it is determined solely by the information-
processing task to be solved”, while “… the algorithm depends heavily on the computational
theory, of course, but it also depends on the characteristics of the hardware in which it is
to be implemented”. This gives the hopeful impression that the computational paradigm
allows successive refinement from what to how—and progressing from the theoretical to
the pragmatic—since the latter supposedly places little constraint on the former. But there is
potential circularity in this argument, however. What constitutes an “information-processing
task to be solved”? The goals of biological information processing system, after all, are not
explicitly specified as they are for engineered systems of our own design. In biology, the
computational tasks must be inferred, in part from externally observed behavior, and in part
by attempting to guess the function by studying its internal workings. This approach has
worked well in organisms with far simpler nervous systems and far more restricted behavioral
repertoires for which the processes seem more ‘hard-wired’. When Marr was setting out his
1070 K A Stevens

ideas about computational vision, some aspects of early vision, and of depth from binocular
stereopsis seemed to correspond to unambiguous ‘information-processing tasks’. Without
identifying the ‘client’ processes served by stereopsis, it was presumed sufficient to model
the output as an array of depth values. But such presumptions are not appropriate for more
abstract acts of visual perception, such as appreciating and understanding our immediate
surrounds, detecting and reacting to the events occurring around us, and informing us as we
plan our next action within the world. The “tasks to be solved”, that those visual abilities
entail, share so many computational steps, so many common streams of information flow,
and so many practical limitations due to the legacy of our evolution, that a computational
theory would be difficult to bootstrap.
And specifically, when should a given visual competence be regarded as directly
reflecting a specific computational goal? If given two similar competences (such as shape
from shading and shape from texture), should their perceptual solutions be regarded as
achieved by separate computational theories, or might there be one computational theory
that unifies and explains both competences? While those questions regard the level of
computational theory, similar concerns arise with attempts to determine the specific nature
of the visual representations. If human subjects can competently report a given perceptual
quantity (such as the presence of a depth discontinuity, the orientation of a surface patch,
or some other measure of low dimensionality), it does not necessarily follow that such a
reported quantity is explicitly represented; rather; it could be derived indirectly from another
representational scheme for which the conversion could be performed ‘on demand’ for the
experimental task. This would appear to be the case where monocular depictions of surfaces
in purely orthographic projection can evoke a measurable impression of depth as well as
slant (Stevens 1981a; Stevens and Brookes 1987). Since distance covaries with surface
orientation, their perceptual counterparts would as well, and their correlation could be used to
infer depth from information about surface orientation and vice versa. One cannot conclude,
therefore, that experimental judgments of either depth or surface orientation derive directly
from corresponding explicit representations; both could be derived indirectly from some
other representation. The immediacy with which a seemingly open-ended set of perceptual
judgments might be performed poses a significant challenge to determining the nature of the
underlying specific visual representations.

5 Conclusions
The computational paradigm expects that (1) the visual system is comprised of separable
strategies, (2) organized into early-to-later stage computations that create descriptions, (3) using
distinguishable, symbolic representational schemes, (4) where the symbol systems underlying
these representations have discrete vocabularies of primitives of low dimensionality that
encode relatively direct assertions about the visual world (eg image-related assertions of
bars, edges, blobs and surface-related assertions of orientation, distance, and discontinuities
thereof ). In the thirty years since publication of Marr (1982), one can reflect back on
these assumptions and the computational paradigm they frame. Each of the expectations,
(1)–(4) above, may well reflect not so much the nature of biological information processing,
as the hope that it is understandable in our terms (ie simple enough for us to understand by
analogy to the sorts of computational architectures that we can create).
Those specific assumptions and caveats of the computational paradigm may all be invalid
for biological vision. For instance, regarding the expectation for separable computational
strategies [(1) above], such separability would be difficult to preserve as a system evolves
greater complexity through adaptation of existing architectures and sharing of pathways.
It is as if one expects both modularity and superposition of function simultaneously; they are
usually regarded as mutually exclusive. Assuming separability of strategies is beneficial to the
The vision of David Marr 1071

theoretician, but it also suggests that simpler-is-better for biological information processing,
which in reality may not draw benefit from highly decoupled system designs. Likewise, pipeline
architectures [(2) above] are easy to understand and to create, but rarely used in biological
information processing. Simplifying the modeling of a pathway to a unidirectional stream
may trivialize it to being little more than a conduit, a cable, while a bidirectional pathway (or
more-richly connected network) would eventually be understood to constitute a computational
unit, not separate units connected by wires. Conceiving of vision as symbolic [(3) above]
is to embrace one aspect that surely has its place in biological information processing, but
unfortunately, we don’t fully understand the place of symbol systems in biological vision.
And, finally, [(4) above], our tendency to equate what we experience (brightness, color, depth,
etc) with the representational primitives of our perceptions is perhaps the most self-limiting
assumption of the computational paradigm. To reduce perception to the creation of maps
(of depth, or color, or any other quantity) is to trivialize the act of perceiving. Vision cannot be
a matter of converting a bundle of rays to a bundle of values.
More generally, biological information processing seems amenable to a description in
terms of tasks. But we lack a methodology for discovering the tasks that underlie vision. The
phrase ‘visual system’ usually implies some system boundary, so that the inputs and outputs
may be identified. For engineered devices of our making, the system boundaries are clear,
for we specified them. We need to better understand the clients of the visual system, for their
computational requirements seemingly would dictate the tasks of the visual computations
that serve those clients.
Of Marr’s accomplishments, causing people to think differently about biological
computation was likely foremost. Vision as information processing is fundamentally different
from vision as signal processing. Encouraging the use of theoretical abstractions and levels
of explanation was a close second, taught by his example. As for the specifics, the caveats of
the computational paradigm were only beginning to be laid out when his contributions to the
effort were curtailed. Our expectation for which computational principles underlie vision have
changed, but the basic concept that there are such principles remains valid and valuable.
Acknowledgments. I wish to thank Philip Quinlan and an anonymous referee for the valuable
comments they provided. I thank David Marr for giving us so much to think about, and for the gifts of
his enthusiasm, and brilliance, and warmth.
References
Allman J M, Kaas J H, 1974 “A crescent-shaped cortical visual area surrounding the middle temporal
area (MT) in the owl monkey ( Aotus trivirgatus)” Brain Research 81 199–213
Brookes A, Stevens K A, 1988 “Binocular depth from surfaces vs. volumes” Journal of Experimental
Psychology: Human Perception and Performance 15 479–484
Brookes A, Stevens K A, 1989 “The analogy between stereo depth and brightness” Perception 18
601–614
Brookes A, Stevens K A, 1991 “Symbolic grouping versus simple cell models” Biological Cybernetics
65 375–380
Church A, 1932 “A set of postulates for the foundation of logic” Annals of Mathematics, Series 2 33
346–366
Fox M D, Snyder A Z, Vincent J L, Corbetta M, Van Essen D C, Rachel M E, 2005 “The human brain
is intrinsically organized into dynamic, anticorrelated functional networks” Proceedings of the
National Academy of Sciences of the USA 102 9673–9678
Gibson J J, 1966 The Senses Considered as Perceptual Systems (Boston, MA: Houghton Mifflin)
Gibson J J, 1979 The Ecological Approach to Visual Perception (Boston, MA: Houghton Mifflin)
Lulich D P, Stevens K A, 1989 “Differential contributions of circular and elongated spatial filters to the
Café wall illusion” Biological Cybernetics 61 427–435
McCarthy J, 1960 “Recursive functions of symbolic expressions and their computation by machine,
part I” Communications of the ACM 3 184–195
1072 K A Stevens

Marr D, 1974 “On the purpose of low-level vision”, MIT AI Laboratory Memo 324
Marr D, 1976a “Analyzing natural images: A computational theory of texture vision” Cold Spring Harbor
Symposia on Quantitative Biology 40 647–662
Marr D, 1976b “Early processing of visual information” Philosophical Transactions of the Royal Society
of London 275 483–524
Marr D, 1976c “Analysis of occluding contour”, MIT AI Laboratory Memo 372
Marr D, 1982 Vision. A Computational Investigation into the Human Representation and Processing
of Visual Information (New York: W H Freeman)
Marr D, Nishihara H, 1978 “Representation and recognition of the spatial organization of three-
dimensional shapes” Proceedings of the Royal Society of London 200 269–294
Marr D, Poggio T, 1976 “Cooperative computation of stereo disparity” Science 194 283–287
Marr D, Poggio T, 1979 “A computational theory of human stereo vision” Proceedings of the Royal
Society of London B 204 301–328
O’Neill W E, Suga N, 1982 “Encoding of target range and its representation in the auditory cortex of
the mustached bat” Journal of Neuroscience 2 17–31
Reale R A, Imig T, 1980 “Tonotopic organization in auditory cortex of the cat” Journal of Comparative
Neurology 192 265–291
Russell S, Norvig P, 2010 Artificial Intelligence: A Modern Approach 3rd edition (Upper Saddle River,
NJ: Prentice Hall)
Stevens K A, 1978 “Computation of locally parallel structure” Biological Cybernetics 29 19–28
Stevens K A, 1981a “The visual interpretation of surface contours” Artificial Intelligence 217 47–74
Stevens K A, 1981b “The information content of texture gradients” Biological Cybernetics 42 95–105
Stevens K A, 1983a “Slant-tilt: The visual encoding of surface orientation” Biological Cybernetics
46 183–195
Stevens K A, 1983b “Evidence relating subjective contours and interpretations involving occlusion”
Perception 12 491–500
Stevens K A, 1991 “Constructing the perception of surfaces from multiple cues” Mind and Language
5 253–266
Stevens K A, 1995 “Integration by association: Combining three-dimensional cues to extrinsic surface
shape” Perception 24 199–214
Stevens K A, Brookes B, 1987 “Probing depth in monocular images” Biological Cybernetics
56 355–366
Stevens K A, Brookes B, 1988a “Integrating stereopsis with monocular interpretations of planar surfaces”
Vision Research 28 371–386
Stevens K A, Brookes B, 1988b “The convex cusp as a determiner of figure–ground” Perception
17 35–42
Stevens K A, Lees M, Brookes B, 1991 “Combining binocular and monocular curvature features”
Perception 20 425–440
Van Essen D, Anderson C H, 1995 “Information processing strategies and pathways in the primate
visual system”, in An Introduction to Neural and Electronic Networks 2nd edition, Eds S Zornetzer,
J L Davis, C Lau, T McKenna (New York: Academic Press) pp 45–76
Van Essen D, Gallant J L, 1994 “Neural mechanisms of form and motion processing in the primate
visual system” Neuron 13 1–10
Waltz D, 1975 “Understanding line drawings of scenes with shadows”, in The Psychology of Computer
Vision Ed. P H Winston (New York: McGraw-Hill) pp 19–91
Winograd T, 1972 Understanding Natural Language (New York: Academic Press)
Zeki S M, 1975 “The functional organization of projections from striate to prestriate visual cortex in
the rhesus monkey” Cold Spring Harbor Symposia on Quantitative Biology 40 591–600

© 2012 a Pion publication

View publication stats

Potrebbero piacerti anche