|
There are many conflicting views concerning
the nature of distributed representation, its compatibility or
otherwise with symbolic representation, and its importance in
characterizing the nature of connectionist models and their relationship
to more traditional symbolic approaches to understanding cognition.
Many have simply assumed that distribution is merely an implementation
issue, and that symbolic mechanisms can be designed to take advantage
of the virtues of distribution if so desired. Others, meanwhile,
see the use of distributed representation as marking a fundamental
difference between the two approaches. One reason for this diversity
of opinion is the fact that the relevant notions - especially
that of distribution - are rarely adequately characterized
before addressing the issues. At this level of generality, an
adequate characterization is one that is sufficiently abstract
to subsume most paradigm cases of representation of a given type,
yet also sufficiently precise to give real theoretical bite when
addressing questions such as those raised above. This paper advances
a definition of distributed representation and shows that, understood
this way, distribution is in fact incompatible with the core notion
of symbolic representation found in the cognitive science literature.
For this reason, genuinely distributed connectionist models cannot
be, or implement, physical symbol systems (Newell & Simon
1976) or "classical" symbolic models (Fodor and Pylyshyn
1988). Thus, I am endorsing the view that distributed connectionist
models do indeed present a radical new approach to modeling cognitive
processes.
1. The Nature of Distribution Despite the fact that distribution is a central feature of a very large proportion of connectionist models, almost no attention has been given to the problem of providing a comprehensive, systematic definition of the concept. Numerous brief characterizations have been offered - see van Gelder (1990b) for a survey of scores of attempts - but when examined closely they turn out to draw on a wide range of themes, ranging from relatively trivial notions of spatial or neural "spread-out-ness" at one extreme to complete functional equipotentiality at the other. As a consequence, distribution is currently one of the murkiest concepts in the whole of cognitive science. Fortunately, one concept in particular both figures in a relatively large proportion of characterizations and turns out to describe a very high proportion of the paradigm cases of distribution: namely, the non-discreteness or non-localizability of representations. From this perspective a representation is distributed when multiple items are encoded over the same extent of the available resources, without any more fine-grained correspondence of items to particular locations. For an obvious example, consider the connection weights in a standard feed-forward connectionist network, encoding many different associations over the very same connections. If the representings of a number of different items are in fact fully superimposed, every part of the representation R must be implicated in representing each item. If this is achieved in a non-trivial way there must be some encoding process that generates R given the various items to be stored, and which makes R vary, at every point, as a function of each item. This process will be implementing a certain kind of transformation from items to representations. This suggests thinking of distribution more generally in terms of mathematical transformations exhibiting a certain abstract structure of dependency of the output on the input. More precisely, define any transformation from a function F to another function G as strongly distributing just in case the value of G at any point varies with the value of F at any point; the Fourier transform is a classic example. Similarly, a transformation from F to G is weakly distributing, relative to a division of the domain of F into a number of sub-domains, just in case the value of G at every point varies as a function of the value of F at at least one point in each sub-domain. The classic example here is the linear associator, in which a series of vector pairs are stored in a weight matrix by first forming, and then adding together, their respective outer products. Each element of the matrix varies with every stored vector, but only with one element of each of those vectors. Clearly, a given distributing transformation yields a whole space of functions resulting from applying that transformation to different inputs. If we think of these output functions as descriptions of representations, and the input functions as descriptions of items to be represented, the distributing transformation is defining a whole space or scheme of distributed representations. To be a distributed representation, then, is to be a member of such a scheme; it is to be a representation R of a series of items C such that the encoding process which generates R on the basis of C implements a given distributing transformation. Distributing transformations (and hence distributed representations) are ubiquitous in connectionist models. Consider for example the transition from input to hidden-layer representation in a fully connected feed-forward network. If we think of the represented items as the elements of the input vector, then the transition is implementing a simple case of a strongly distributing transformation since the activation of any given hidden unit varies as a function of the activation of every input unit. The precise form of this distributing transformation can easily be written down in an equation in terms of the connection weights and the activation function of the units. Because the transformation which generates the hidden unit pattern is strongly distributing, the hidden unit pattern itself is appropriately classified as a (strongly) distributed representation. An excellent example of weakly distributed representation is the working memory (WM) in Touretzky and Hinton's (1988) Distributed Connectionist Production System model (DCPS). This model processes triples of basic elements, where each triple is encoded as a unique pattern of activity over 2000 binary units. Patterns are generated by means of a coarse-coding scheme which activates approximately 28 units for each triple. At any time a number of these triple patterns can be stored in a central 2000-unit WM by activating the relevant units for each pattern. The process of storing patterns in WM - essentially just vector addition - is one that superimposes the basic patterns. It implements a weakly distributing transformation because the activation level of any given unit in WM varies with every pattern to be stored, but at one and only point in that pattern (i.e., whether a given unit is activated depends on whether there is a 1 or a 0 at one specific point in each of the patterns to be stored).
Note that, although this characterization of
distribution in terms of a core notion of superposition captures
most of the standard cases of distributed representation found
in connectionist work, some familiar examples are excluded. An
example is the "Wickelfeature"-based representation
of verb forms in Rumelhart & McClelland's (1986) well-known
model of past tense acquisition . (Care is needed here, for the
connections which represent the associations of present with past
tense forms do in fact constitute a genuinely distributed
representation). For this reason the present characterization
is sometimes accused of being too narrow. Now, it is true
that my analysis disagrees with previous usage to some extent,
as in this case. However, given the extraordinary state of disarray
of the concept, any decent analysis will inevitably have
to reject some previous usage as mistaken or misleading. The
most important thing is that the analysis itself carve up the
relevant phenomena at their true conceptual joints. Any taxonomy
of forms of representation which casually lumps together genuinely
superposed representations and merely feature-based representations
is failing to recognize deep differences and is therefore too
wide to be really useful. (Further argument that superposition
is in fact the really central feature of genuinely distributed
representations - and that this category does indeed deserve the
label "distributed" - is given in van Gelder 1990b.)
2. Symbolic Representation There is already considerable consensus in the cognitive science literature on the nature of symbolic representation. The following definition is merely a synthesis of proposals advanced by Newell and Simon, Haugeland, Fodor, and Pylyshyn among others. A scheme of symbolic representation consists of: (a) a primitive vocabulary consisting of a finite set of disjoint and digital symbol classes (or types); each class is made up of a potentially unbounded number of physical tokens known as symbols; (b) a set of grammatical rules governing the combining of symbols; (c) a concatenative mode of combination; (d) an unbounded set of expression classes (or types), where expression tokens are constructed by concatenation of symbols in conformity with the grammatical rules; (e) primitive semantic assignments to symbols; and (f) principles for making semantic assignments to expressions on the basis of the primitive assignments and the syntactic structure of the expression. For a particular representation to count as symbolic it must belong to such a scheme, and consequently must itself satisfy the above conditions. Some details are worth noting. First, basic symbol classes are completely disjoint - i.e., no primitive symbol token belongs to more than one class. A consequence of this condition is that expression classes themselves are disjoint. Second, these disjoint classes are digital, which is to say that it is always possible to determine, positively and reliably whether a given token falls into a particular symbol class (see Haugeland 1981). In practice, symbol tokens usually instantiate some characteristic physical shape or configuration, and it is this fact which underlies the digital separability of symbol classes. (Thus, it is because tokens of the word "cat" have a characteristic shape that we can reliably distinguish them from tokens of the word "bat".) Third, condition (3) makes explicit the requirement that when primitive symbols are grammatically combined to generate compound expressions, actual tokens of the primitive symbol classes can be found physically instantiated in the expression itself. A concatenative mode of combination just is a way of combining symbols to obtain expressions such that each of the expression's primitive constituents are "tokened" every time the expression itself is tokened. As Newell & Simon (1976) put it, "...a symbol structure is composed of a number of instances (or tokens) of symbols related in some physical way (such as one token being next to another)." It is satisfying this requirement, more than any other, which justifies the description symbolic.
This last point is worth stressing. Only when
defined in the above reasonably strong and precise terms does
symbolic representation form the basis of the computational theory
of mind as that theory has been articulated by Fodor, Pylyshyn
and others. Though it is not difficult to find weaker formulations
in the literature, on any such weaker account symbolic representation
is not appropriately connected with the notion of computation
construed as symbol manipulation. In particular, if we surrendered
the concatenation requirement, we would thereby be surrendering
the kind of rule-governed structure-sensitive algorithmic operations
that lie at the heart of the computational approach, since those
operations rely directly on the causal role of the constituents
of the expressions being transformed. (For elaboration and defense
of this point see van Gelder 1990a, Fodor & McLaughlin (forthcoming),
Fodor & Pylyshyn 1988, Pylyshyn 1984 Ch. 3.)
3. Incompatibility It should be obvious that distribution and symbolic representation have significantly different flavor. It is moreover hardly controversial that some cases of distributed representation are patently non-symbolic, and vice versa. But can there be, nevertheless, an overlap between the two categories? Can symbolic representation ever be genuinely distributed? The surprising answer is no. This can be demonstrated relatively easily with the above clarifications of the relevant concepts in hand. In a nutshell, the argument is this: as indicated above, there are quite precise formal and semantic conditions that representations have to satisfy in order to count as symbolic, and it is impossible to satisfy these while remaining genuinely distributed. 3.1 Formal Incompatibility To count as symbolic a representation must satisfy at least three purely formal conditions: it must belong to a space of expression tokens that is digitally structured; the expression itself must be grammatically well-formed; and it must be concatenatively structured. Distribution is incompatible with symbolic representation because distribution, by its very nature, typically violates the first two conditions and always violates the third. (a) Analog nature of distributed schemes While symbolic representation is essentially digital, distributed schemes are typically analog in that they allow a smooth continuum of acceptable representation instances, and so fail to guarantee the possibility of unambiguous determination of a given representation's type identity. Interestingly, this particular difference is often touted as one of the virtues of distribution, giving rise to computational advantages such as the ability to handle very fine shades of meaning. An explanation of the analog nature of most distributed schemes is to be found in the fact that nothing in the definition of distributing transformations, around which distributed schemes are constructed, requires that the output be digitally structured; indeed, the most natural mathematical form for distributing transformations to take is continuous. (b) Distributed representations are standardly non-grammatical To count as symbolic, a representation must be grammatically well-formed; it must be constructed in accordance with the rules of the scheme in question. Distributing transformations, however, are typically not grammatically constrained; they will happily output a single representation of any series of items they are presented with. Whenever generation of the distributed representation is not governed by grammatical rules, the representation cannot be properly regarded as symbolic; yet nothing in the nature of distribution provides for such grammatical constraints. (c) Distributed representations are invariably non-concatenative Although these first two considerations are generally sufficient in practice to differentiate distributed and symbolic representations, they cannot conclusively establish incompatibility since there are ways in which the transformations generating distributed representations can be externally constrained to produce digital output in accordance with grammatical formation rules (for an example see below). However it is not possible to design distributing transformations producing representations meeting the third requirement on symbolic representations, that of concatenative structure. It is in the very nature of distributing transformations that, when a number of items are superimposed to form one representation, the items themselves are lost, in the sense that there are no longer distinct tokens of the stored items to be found. In short, it is impossible to combine symbol tokens to form grammatically well-formed structures in a way that both superimposes them and concatenates them. 3.2 Incompatibility of Semantic Structure An important advantage of symbolic representation is that it is generally possible to determine the meaning of the whole on the basis of basic semantic assignments to its primitive constituents. This is because such representations are constructed out of tokens of their parts, and there is a localist correspondence between parts of the representation and basic features of the represented domain. Thus, when we have a symbolic representation of the situation where the cat is on the mat, there is a localist correspondence between the cat itself and the term "cat". A consequence is that a local change in the situation being represented only requires local, "modular" variation in the representation; for example, if the cat is now replaced by a dog, we need only change "cat" to "dog" to get an accurate representation of the new situation.
Contrast this with distributed representation,
where (by definition) the representation at any point varies as
a function of the content at every point. There is no localist
correspondence of features of the representation to features of
the world at all; rather, all the representation corresponds
to the whole (i.e., each part) of the represented situation. Consequently,
any change in the represented situation requires changes across
the whole representation. This fundamental difference is often
conveyed by describing distributed representations as "holistic,"
or by pointing out that storage is context dependent: i.e., any
given item is only stored in the context of other items
or content parts.
4. Case Study: BoltzCONS One of the main sources of resistance to incompatibility arguments of the above kind is the existence of connectionist models utilizing representations that at least appear to be both distributed and symbolic. Strategically at least is essential for me to show clearly why such models do not in fact constitute counterexamples. Basically, in such models, where the representations are genuinely distributed they turn out to be non-symbolic; and where symbolic, they are not genuinely distributed (though they may have some features on the basis of which current confused usage often classifies them as distributed). Thus, consider Touretzky's BoltzCONS extension of the DCPS model. This "distributed symbol processing" model contains a WM of essentially the same type as DCPS, except that in this case the basic triple-patterns stored in WM are treated like LISP "cons" cells. By carefully storing the right combination of triples, the overall state of WM is able to function as a representation of a complex data structure. Now, there is no doubt that this memory is genuinely distributed, since the basic patterns were stored there by a weak distributing transformation (see Sec. 1). A number of considerations seem to suggest that it is also symbolic. Symbols are stored in WM and can be retrieved; these basic symbols are sufficiently distinct that, under normal working conditions, WM states fall into a digitally structured space; and just which symbols are stored at any time is governed by what can be regarded as grammatical constraints.
The crucial difference, however, is that
the BoltzCONS WM is not concatenatively structured.
Recall that a concatenatively structured representation contains
actual tokens of its basic constituents. But when triple A and
triple B are stored in WM as part of the representation of a complex
expression, it is impossible to find that particular, distinctive
pattern of approximately 28 out of 2000 units which the coarse-coding
scheme assigned to triple A. That pattern was lost when
it was stored in memory with the pattern for triple B. So where
is the required token of triple A? We cannot say that the current
overall pattern is itself a token of triple A (i.e., including
the current pattern which is the state of WM among the class of
A-tokens), since - by the very same reasoning - that pattern would
also have to count as an instance of triple B. This would be
an egregious violation of the requirement that the basic symbol
classes be disjoint.
5. Discussion The upshot of these arguments is no representation can be both distributed (i.e., belonging to a scheme fixed by a given distributing transformation) and symbolic at the same time. It is important to see that this is not just terminological bickering; rather, it follows from the very nature of the forms of representation themselves. It has been shown that representations with certain properties do not have certain other properties. These properties are those which are central the categories of distributed representation and symbolic representation respectively. The only matter of terminology is whether it is wisest to use the labels "distributed" for the first category and "symbolic" for the second; this latter relatively trivial matter has not been discussed here. What consequences does this have for our understanding of connectionism and its relation to classical models of cognition? First, a crucial clarification. While no representation can be both distributed and symbolic, it is quite possible to represent a symbolic structure in distributed form - i.e., to have a distributed representation of a symbolic structure. (The WM of BoltzCONS is a pertinent example; alternatively, think of a hologram of a page of text.) In such a case the representation itself is distributed while its content is a symbolic entity. It is essential to distinguish between the form of a representation and the form of its content, whatever that content may happen to be. Now, in a recent authoritative restatement of the classical, symbol-processing conception of cognitive processes, Fodor and Pylyshyn have argued that the use of symbolic representations and structure-sensitive processes lies the very heart of that approach. From this and the general incompatibility thesis it follows that connectionist models based on distributed representations cannot be, or implement, any classical symbolic model. (The flip side, of course, is this: if the brain turns out in fact to be a genuinely distributed connectionist-style machine, the Language of Thought hypothesis will have been proven false.) Where does this leave connectionist modeling of cognitive processes? There are, broadly speaking, three basic strategies, each of which currently has its adherents: (a) Reject distribution in favor of symbolic representations. This strategy directly implements classical symbol-processing models of cognition in purely localist connectionist networks. Note that such an approach may still have computational advantages over standard implementations even if distributed representations and processes are nowhere employed. (b) Construct hybrid models which utilize various possible combinations of symbolic and distributed representations. DCPS/BoltzCONS is a good example: the WM is a distributed central store, while the real processing takes place on symbols in auxiliary networks. (c) Reject symbolic representations in favor of a wholesale move to genuinely distributed representations and processes (e.g., Pollack 1988; Chalmers in press). In cases where symbol structures are themselves the target of processing - e.g., when modeling language processing capacities - this kind of connectionist model operates on the basis of distributed representations of the symbol structures in question. Insofar as connectionist modeling takes this third option, it presents a truly radical and interesting new alternative to the classical approach.
In my view the main practical benefit of the
analysis sketched in this paper is that it clearly delineates
this third approach. It is now apparent that models of cognition
can be constructed on the basis of representations and processes
that are very different from standard symbolic paradigms, and
that this is true even when the domain being modeled itself includes
linguistic or symbolic structures. Constructing such models means
focusing attention on the distinctive properties of distributed
representations themselves, developing optimal distributed schemes,
and developing processes suited to dealing with information represented
in that form. This shift to the wholly distributed arena has
a liberating effect in that cognitive modeling need not be dominated
by the kind of algorithmic, rule-governed processes which are
only natural as long as information is represented in strictly
symbolic form. If this is correct, we should expect the continued
emergence of connectionist models in which cognitive functions
which previously seemed to require complex symbol-processing are
achieved on the basis of direct transformations of distributed
representations.
References Chalmers D.J. (in press) Syntactic transformations on distributed representations. Connection Science. Fodor J. & McLaughlin B. (forthcoming) What is wrong with tensor product connectionism? in Horgan T. & Tienson J. (eds) Connectionism and the Philosophy of Mind. Fodor J.A. & Pylyshyn Z.W. (1988) Connectionism and cognitive architecture: A critical analysis. Cognition ; 28: 3-71. Elman J. L. (1989) Representation and structure in connectionist models. CRL Technical Report 8903, Center for Research in Language, University of California San Diego La Jolla CA 92093. Haugeland J. (1981) Analog and analog. Philosophical Topics; 12: 213-225. Newell A. and Simon H. (1976) Computer science as an empirical inquiry. Communications of the Association for Computing Machinery; 19: 113-126. Pollack J. (1988) Recursive auto-associative memory: Devising compositional distributed representations. Proceedings of the Tenth Annual Conference of the Cognitive Science Society. Montreal, Quebec, Canada. Pylyshyn Z. (1984) Computation and Cognition: Toward a Foundation for Cognitive Science. Cambridge: MIT/Bradford. Rumelhart and McClelland (1986): On learning the past tenses of English verbs. in McClelland J.L., Rumelhart D.E. and the PDP Research Group Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge MA: Bradford/MIT Press; 216-271. Touretzky D.S. & Hinton G.E. (1988) A distributed connectionist production system. Cognitive Science; 12: 423-466. Touretzky D. S. (1989) BoltzCONS: Dynamic Symbol Structures in a Connectionist Network. Technical Report CMU-CS-89-182, Department of Computer Science, Carnegie Mellon. (To appear in a special issue of Artificial Intelligence on connectionist symbol processing.) van Gelder T.J. (1990a) Compositionality: A Connectionist Variation on a Classical Theme. Cognitive Science (forthcoming).
- (1990b) What is the 'D' in 'PDP'? An Overview
of the Concept of Distribution. forthcoming in Stich S., Rumelhart
D. & Ramsey W. (eds) Philosophy and Connectionist Theory
Hillsdale N.J.: Lawrence Erlbaum Associates 1990. |
|
This page, its contents and style, are the responsibility of the author and do
not represent the views, policies or opinions of The University of Melbourne.
|