Around 2017 I was thinking about the problem of automatic translation and how ambiguity and idioms could be dealt with. One idea was to tag every word $w$ with one of Roget's 1000 categories (represented by a set $C$). Thus we have map $\kappa: W \rightarrow P(C)$. Roughly speaking an ambiguous word $w$ will have $\kappa(w)$ with cardinality at least 2. Given a context $T$ and a word $w$ occurring in $T$ my idea was to devise an algorithm which functioned a little bit like a Sudoku puzzle using a concept of 'semantic distance'. We find a word $w$ such that based on the current words $v$ in the context with singleton $\kappa$ we can determine which $c \in \kappa(w)$ is 'closest' to the set of $c$'s inhabiting the singletons of such words. We then make the choice and this should lead to finding further words that can be resolved and so forth. Of course the problem is how to define such semantic distance as well as to guarantee that the process achieves its goal and does not get stuck (but we could introduce random choices). If we view Roget's 1000 categories as organized as leaves (or even nodes) of a binary tree then there is an obvious definition. For instance 'rotation' is semantically closed to 'motion' than it is to 'feeling'.
There is a problem with compositionality for idioms like 'raining cats and dogs' or for a term like 'white rhinocerous'. Thus composition of meanings is in general multi-valued. We have uploaded a small text about this on researchgate and other platforms.
The vector representations used in LLMs suggest the following speculation. Could it be that meaning can be coherently constructed out of complex entities which themselves have no intrinsic (or easily assignable) meaning ? Or like in quantum mechanics the wave function (proto-meaning) is essentially a superimposition of eigenstates corresponding to actual observables (meanings).
In this note we present an abstract formal approach to the basic problems regarding texts and meaning.
We start with a non-empty set $W$ of word-forms, expressions which have no meaning-bearing parts. We consider a subset $T \subset W^\star$ of possible meaningful texts. $T$ must satsify the following conditions:
\[ W \subset T\]
\[ \forall t \in T \, t\notin W \rightarrow \exists s,u \in T \, t = su\]
The last axiom means that every text has at least on syntactic decomposition - a binary tree expressing sucessive division into meaningful elements down to the level of $W$.
We are given furthermore a non-empty set $M$ of possible meanings.
We postulate a map \[\sigma : T \rightarrow P(M)\] satisfying $\sigma(t) \neq \emptyset$ for all $t \in T$ and a map \[\lambda : M \rightarrow P(T)\] satisfying $\lambda(m) \neq \emptyset$ for all $m \in M$. The first map gives the possible meanings of a text $t$ and the second map gives the possible ways to express a given meaning $m$.
Recall that for any set we have a map $P(P(X)) \rightarrow P(X)$ obtained by taking unions. Using this map we can compose $\sigma$ and $\lambda$ to obtain maps \[\pi : T \rightarrow P(T)\] and \[\alpha : M \rightarrow P(M)\] which express the ways to paraphrase a given text and the possible linguistic ambiguities (semantic misreadings) conveyed by expressions of a given meaning.
Given $s,t \in T$ we define $s \subset t$ to mean that there exists $s_1,...,s_n \in T$ and $u_2,...,u_{n-1} \in T$ such that (for $ n> 1$, $s_n = t$,
$s_i = s_{i-1}u_{n-1}$ or $s_i = u_{n_1}s$ for $i > 1$ and $s_1 = s$. For $n = 1$ we require that $s = t$.
We postulate that there is a family of maps $\rho_{ts} : \sigma(t) \rightarrow \sigma(s)$ for each $t,s \in T$ with $s \subset t$ which satisfies the properties $\rho_{tt} = id$ and $\rho_{su}\circ\rho_{ts} = \rho_{tu}$.
Thus we boldy state that there is a restriction to 'white' of the standard meaning of 'white rhinocerous'. This means that the meaning of a text uniquely determines the meanings of all its meaningful components.
Furthermore $\rho$ must satisfy the fundamental property: Given $t = s_1s_2$ with $s_1,s_2 \in T$ and given $m_1 \in \sigma(s_1)$ and $m_2 \in \sigma(s_2)$ there exists at least one $m \in \sigma (t)$ such that $\rho_{ts_1}(m) = m_1$ and $\rho_{ts_2}(m) = m_2$.
An important corollary is that if $s \subset t$ then for every $m \in \sigma(s)$ there exists $n \in \sigma(t)$ such that $\rho_{ts}(n) = m$.
We hold that a lawful syntactical combination of meaningful expressions has some meaning. Indeed 'green idea' is just as meaningful as Polonius' 'green girl'. Metaphor should be at the heart of any cogent linguistic philosophy and formal linguistics. But our principle might seem to fail here because the restriction of the meaning of this expression to 'green' has a different meaning from the ordinary perceptive meaning of 'green'. Metaphoric green is a valid meaning for the expression 'green' and thus the meaning of the combination 'green idea' is valid. However even the expression 'green idea' with 'green' in its perceptual sense is just as meaningful as Meinong's 'square circle'.
The new senses of 'green' and 'white' adquired by restriction from 'green girl' and 'white rhinocerous' recall some of the mechanisms for attention in LLMs.
We define the set of $t$-contexts for $t\in T$ to be $C(t) = \{(t_1,t_2) : t_1,t_2 \in T \& t_1t_2 \in T\}$. For $c = (t_1,t_2) \in C(t)$ we write $c[t]$ for $t_1tt_2$.
Finally we postulate that, for $t \in T$, there is a map $\kappa : \sigma(t) \rightarrow C(t)$ satisfying the following property for all $m \in \sigma(t)$:
\[ \forall n \in \sigma (\kappa (m)[t])\, \rho_{\kappa(m)[t] t}(n) = m \]
This means that given an expression $t$ there is a context such that even if ambiguous still determines the meaning of $t$ uniquely.
We can think of contexts for 'white rhinocerous' involving 'stuffed animal' and involving 'species' which determine distinct meanings.
The above theory could be carried out using the set of terms generated freely by a non-associative operation $W^\bullet$ and by postulating that each $t \in T$ has a unique syntactic decomposition tree in $W^\bullet$ - but this is a little too restrictive perhaps.
We can rewrite the above theory in probabilistic terms, specially $\kappa$.
We can replace contexts with co-occurrences. And turn $\kappa$ into map which from the data of certain $t_i$s co-occuring with $t$ (at certain distances) assigns a certain probability distribution to possible meanings of $t$.
But let us now consider another approach. We suppose that there is a 'metric' defined on $M$, $d: M \times M \rightarrow \mathbb{R}_0^+$. To simplify things we can also consider a finite disjoint decomposition of $M$ into categories $M = \bigcup_{c\in C} M_c$ (for example something based on the 1000 categories of Roget' thesaurus, but these of course only work for $W$ or small texts) and consider the distance only on $C$ and extend it in a trivial way to $M$. And important question: what is the distance of $m \in \sigma(ts)$ to $\rho_{(ts)t}(m) \in \sigma(t)$ ? Note that each $t \in T$ determines a subset $\sigma(t) \subset M$. Suppose we have a context $c \in C(t)$ and consider $\sigma(c) \subset M$. How do we define the distance between a $x \in M$ and a subset $ X \subset M$ ? One possibility is $d(x,X) = min \{d(x,z) : z \in X \}$ but this is not what we want. Rather we need an average over all the distances $d(x,z)$ for $z \in X$. For finite $M$ this is easy. Otherwise we need to introduce a measure on $M$. Now we can consider a map
\[\kappa: C(t) \rightarrow \sigma(t) \]
which associates to each $c \in C(t)$ the element $m \in \sigma(t)$ such that $d(m, \sigma(c))$ is smallest. Or from a practical perspective we can use the distance induced by the finite decomposition into categories. This works best (for the case of Roget's categories) if instead of using $\sigma(c)$ we use the subset of $M$ determined by all of the $\sigma(w)$ for $w$ occuring in $c$. The problem with this approach is that the minimum may not be unique. But for a text $t$ we can decompose it into a context $c[s]$ for different $s \subset t$. Our algorithm would first seek the right $s$ so that $\kappa$ determines a unique minimum. This will allows us to refined the images in $M$ for the other contexts (strickly speaking this is no longer the union of the $\sigma(w)$ but a refined subset). It is to be hoped that this process could be continued until the entire $t$ is disambuiguated. Is this not what the human mind does ?
It may be worth considering that $M$ has a more refined description in terms of subsets of a certain 'semantic space' $\Sigma$ in such a way that the $\sigma$ of elements of $W$ are like 'points', their concatenations like 'lines' and so forth. And we can define the metric on $\Sigma$ rather than directly on $M$.
Note that we must not confuse the mathematical with the philosophical aspect. If we postulate that meaning can be formally analized and studied via context this is not meant to imply that meaning is context, anymore than studying a recursive set as a set means that we are identifying the associated algorithm with the set itself.
Given a $T \subset W^\star$ (for simplicity we assume it is finite) as in section 1, let us consider a map $c: T \rightarrow P(T)$ which associates to each $t \in T$ the set set of all $s' \in T$ such that $t$ occurs in $s$ (contexts). From a practical perspective it is better to restrict $c$ to contexts having a certain limit of length, which we fix to be $n$ (and denote by $T_n$) and denote the resulting map by $c_n$. Of course then $c_n$ becomes empy for $t$ precisely of length $n$. We can define a notion of distance $d(t_1,t_2)$ between $t_1,t_2 \in T$. This is given by
\[\frac{\Sigma_{t \in T_n: t_1,t_2 \in t} dist(t_1,t_2,t) }{|T_n|}\]
where $dist$ is the ordinary distance between $t_1,t_2$ in $t$. There are various ways to define distances of sets. In our case we should use $d(A,B)$ being equal to the average distance. The idea is to decompose $c_n(t)$ into subsets which are far apart, reflecting distinct semantic categories. If for a given radius the elements of $T_n$ in that radius are random and dispersed then this is not possible. Ideally we wish to find 'clusters'. We can view the decomposition into clusters as writing a vector represeting a general element as a linear combination of vecotrs representing the clusters or categories.
Note that if $t_1t_2$ is know to be part of a defined cluster then it is plausible that this cluster also determines one for $t_1$ and $t_2$ thus revindicating the existence of the map $\rho$.
Let us look more closely at $T$. It can mean all possible texts in a given language.
But if we consider a person $p$ then we can take $T_p$ to mean the subset generated by the meaningful subsegments of the text $t_p$ consisting of all linguistic material either thought in inner verbal discourse, heard or spoken from the moment of birth to the moment of death. Or we can consider
in some sense all the possible such maximal onto-texts. We can also define the same kind of text for communities and their history. $T$ is then like a book whose pages correspond to a possible total text of linguistic material in a certain possible history of a community (cosmo-texts). The set of possible books of the world encompassing the total linguistic-consciousness of all people throughout history.
A text is mainly about certain topics. A biography is about a person. Can we give a purely textual definition of aboutness ?
Since a cosmo-text is finite there will be meaningful texts which will escape it. Is there an English text of 1000 words which will never be read by mankind and yet is highly meaningful (or would be meaningful in its possible encompassing cosmo-text) ? Can we give a purely textual combinatorial definition of 'highly meaningful' ? We give an abstract answer. Consider a set $\Sigma$ of symbols and suppose we have a metric $d$ on $\Sigma^\star$. Let $S \subset \Sigma^\star $. Let $S(w)$ denote all elements in $S$ with initial segment $w$. Then a word $w \in \Sigma^\star$ more impactful (meaningful) than $w' \in \Sigma^\star$ for $s$ in $S$ if the rough 'distance' between $S(sw)$ and $S(sw')$ is significant (again we need a good notion of distance between sets). This would correspond to an 'influential' or 'seminal' work in the global cosmo-text.
Large Language Models such as ChatGPT-3 use high dimensional ($dim V$ = 12,288) vector-space representations of meanings of certain textual units ('tokens'). These are generated from context in large data sets. The idea of having certain semantic 'atoms' (sememes) from which are combinatorically constructed possible meanings can be found for instance in Greimas (cf. Osgood's semantic differential for studying the variation of connotation across different cultures). Some (such as René Thom) have claimed that the idea that meaning should have a continuous, geometric aspect is found in Aristotle. Leibniz' characteristica used 'primitive terms' but it is not clear if they are combined in a simple algebraic, combinatorical or mereological way, or if complex logical expessions must be involved (or associated semantic networks). But in embedding matrices we have what would seem to be a quantification of meaning, each 'sememe' is given a 'weight' which determines its geometric relation to other meaning-vectors in a crucial way (the weights cannot be dismissed as probabilistic or 'fuzzy' aspects). To us this would correspond to the 'more-or-less' aspect of species in Aristotle. A very interesting aspect of embedding matrices is how they capture analogy through simple vector operations. This suggests another possible formalization of Aristotelian 'difference' , the same difference operating on two different genera. We get a notion of semantic distance and semantic relatedness. This also revindicates Thom's perception of geometry and dynamics in the spaces of genera.
Some questions to ask: are these token-meaning-vectors linearly independent ? If not can we work with a chosen basis ? If the token is ambiguous is the corresponding vector a kind of superposition of possible meanings, as in quantum theory ? How are we to understand the idea of the meaning of complex expressions being linear combinations of the meaning representations of the tokens occuring in the expression ? It would of course be interesting to analyze these questions relative to the other fundamental components of LLMs (attention in transformers, multi-layer perceptrons) - even if these are more practically oriented rather than reflecting actual linguistic and cognitive reality.
Suppose we are given a large text $T$ generated by a set of words $W$ and a context window $S$ of size $n$. Suppose we wished to represent the elements of $W$ as vectors of some vector space $V$ in such a way that given v,w in W the modulus of the inner product $|\langle v,w\rangle|$ gives the probability of the two words being co-occurent in contexts S. Consider the situation: it is very rare for words $s_1$ and $s_2$ to co-occur but words $s_1$ and $s_3$ co-occur sometimes as do $s_2$ and $s_3$. But there is also a word $s_4$ which never co-occurs with $s_3$ but has the same co-occurrence frequencies with $s_1$ and $s_2$ as does $s_3$. Then it is easy to see that there is no way to represent $s_1$,$s_2$,$s_3$,$s_4$ in the same plane in such a way that these properties are expressed by the inner product. Thus the dimension must go up by one value. We can define the geometric $n$-co-occurence dimension as the minimal dimension of a vector space adequate to represent co-occurrence frequencies by an inner product. We can ask what happens as $n$ increases, does the geometric dimension also increase (and in what manner) or does it stabilize after a certain value ?
No comments:
Post a Comment