Around 2017 I was thinking about the problem of automatic translation and how ambiguity and idioms could be dealt with. One idea was to tag every word $w$ with one of Roget's 1000 categories (represented by a set $C$). Thus we have map $\kappa: W \rightarrow P(C)$. Roughly speaking an ambiguous word $w$ will have $\kappa(w)$ with cardinality at least 2. Given a context $T$ and a word $w$ occurring in $T$ my idea was to devise an algorithm which functioned a little bit like a Sudoku puzzle using a concept of 'semantic distance'. We find a word $w$ such that based on the current words $v$ in the context with singleton $\kappa$ we can determine which $c \in \kappa(w)$ is 'closest' to the set of $c$'s inhabiting the singletons of such words. We then make the choice and this should lead to finding further words that can be resolved and so forth. Of course the problem is how to define such semantic distance as well as to guarantee that the process achieves its goal and does not get stuck (but we could introduce random choices). If we view Roget's 1000 categories as organized as leaves (or even nodes) of a binary tree then there is an obvious definition. For instance 'rotation' is semantically closed to 'motion' than it is to 'feeling'.
There is a problem with compositionality for idioms like 'raining cats and dogs' or for a term like 'white rhinocerous'. Thus composition of meanings is in general multi-valued. We have uploaded a small text about this on researchgate and other platforms.
The vector representations used in LLMs suggest the following speculation. Could it be that meaning can be coherently constructed out of complex entities which themselves have no intrinsic (or easily assignable) meaning ? Or like in quantum mechanics the wave function (proto-meaning) is essentially a superimposition of eigenstates corresponding to actual observables (meanings).
Around 2017 I was thinking about the problem of automatic translation and how ambiguity and idioms could be dealt with. One idea was to tag every word $w$ with one of Roget's 1000 categories (represented by a set $C$). Thus we have map $\kappa: W \rightarrow P(C)$. Roughly speaking an ambiguous word $w$ will have $\kappa(w)$ with cardinality at least 2. Given a context $T$ and a word $w$ occurring in $T$ my idea was to devise an algorithm which functioned a little bit like a Sudoku puzzle using a concept of 'semantic distance'. We find a word $w$ such that based on the current words $v$ in the context with singleton $\kappa$ we can determine which $c \in \kappa(w)$ is 'closest' to the set of $c$'s inhabiting the singletons of such words. We then make the choice and this should lead to finding further words that can be resolved and so forth. Of course the problem is how to define such semantic distance as well as to guarantee that the process achieves its goal and does not get stuck (but we could introduce random choices). If we view Roget's 1000 categories as organized as leaves (or even nodes) of a binary tree then there is an obvious definition. For instance 'rotation' is semantically closed to 'motion' than it is to 'feeling'. There is a problem with compositionality for idioms like 'raining cats and dogs' or for a term like 'white rhinocerous'. Thus composition of meanings is in general multi-valued. We have uploaded a small text about this on researchgate and other platforms. The vector representations used in LLMs suggest the following speculation. Could it be that meaning can be coherently constructed out of complex entities which themselves have no intrinsic (or easily assignable) meaning ? Or like in quantum mechanics the wave function (proto-meaning) is essentially a superimposition of eigenstates corresponding to actual observables (meanings). In this note we present an abstract formal approach to the basic problems regarding texts and meaning.
We start with a non-empty set $W$ of word-forms,
expressions which have no meaning-bearing parts. We consider a subset $T
\subset W^\star$ of possible meaningful texts. $T$ must satsify the
following conditions: \[ W \subset T\] \[ \forall t \in T \, t\notin W
\rightarrow \exists s,u \in T \, t = su\] The last axiom means that
every text has at least on syntactic decomposition - a binary tree
expressing sucessive division into meaningful elements down to the level
of $W$.
We are given furthermore a non-empty set $M$ of possible
meanings. We postulate a map \[\sigma : T \rightarrow P(M)\] satisfying
$\sigma(t) \neq \emptyset$ for all $t \in T$ and a map \[\lambda : M
\rightarrow P(T)\] satisfying $\lambda(m) \neq \emptyset$ for all $m \in
M$. The first map gives the possible meanings of a text $t$ and the
second map gives the possible ways to express a given meaning $m$.
Recall
that for any set we have a map $P(P(X)) \rightarrow P(X)$ obtained by
taking unions. Using this map we can compose $\sigma$ and $\lambda$ to
obtain maps \[\pi : T \rightarrow P(T)\] and \[\alpha : M \rightarrow
P(M)\] which express the ways to paraphrase a given text and the
possible linguistic ambiguities (semantic misreadings) conveyed by
expressions of a given meaning.
Given $s,t \in T$ we define $s
\subset t$ to mean that there exists $s_1,...,s_n \in T$ and
$u_2,...,u_{n-1} \in T$ such that (for $ n> 1$, $s_n = t$, $s_i =
s_{i-1}u_{n-1}$ or $s_i = u_{n_1}s$ for $i > 1$ and $s_1 = s$. For $n
= 1$ we require that $s = t$. We postulate that there is a family of
maps $\rho_{ts} : \sigma(t) \rightarrow \sigma(s)$ for each $t,s \in T$
with $s \subset t$ which satisfies the properties $\rho_{tt} = id$ and
$\rho_{su}\circ\rho_{ts} = \rho_{tu}$. Thus we boldly state that there
is a restriction to 'white' of the standard meaning of 'white
rhinocerous'. This means that the meaning of a text uniquely determines
the meanings of all its meaningful components.
Furthermore
$\rho$ must satisfy the fundamental property: Given $t = s_1s_2$ with
$s_1,s_2 \in T$ and given $m_1 \in \sigma(s_1)$ and $m_2 \in
\sigma(s_2)$ there exists at least one $m \in \sigma (t)$ such that
$\rho_{ts_1}(m) = m_1$ and $\rho_{ts_2}(m) = m_2$. An important
corollary is that if $s \subset t$ then for every $m \in \sigma(s)$
there exists $n \in \sigma(t)$ such that $\rho_{ts}(n) = m$. We hold
that a lawful syntactical combination of meaningful expressions has some
meaning. Indeed 'green idea' is just as meaningful as Polonius' 'green
girl'. Metaphor should be at the heart of any cogent linguistic
philosophy and formal linguistics. But our principle might seem to fail
here because the restriction of the meaning of this expression to
'green' has a different meaning from the ordinary perceptive meaning of
'green'. Metaphoric green is a valid meaning for the expression 'green'
and thus the meaning of the combination 'green idea' is valid. However
even the expression 'green idea' with 'green' in its perceptual sense is
just as meaningful as Meinong's 'square circle'.
The new senses
of 'green' and 'white' adquired by restriction from 'green girl' and
'white rhinocerous' recall some of the mechanisms for attention in
LLMs. We define the set of $t$-contexts for $t\in T$ to be $C(t) =
\{(t_1,t_2) : t_1,t_2 \in T \& t_1t_2 \in T\}$. For $c = (t_1,t_2)
\in C(t)$ we write $c[t]$ for $t_1tt_2$.
Finally we postulate
that, for $t \in T$, there is a map $\kappa : \sigma(t) \rightarrow
C(t)$ satisfying the following property for all $m \in \sigma(t)$: \[
\forall n \in \sigma (\kappa (m)[t])\, \rho_{\kappa(m)[t] t}(n) = m \]
This means that given an expression $t$ there is a context such that
even if ambiguous still determines the meaning of $t$ uniquely. We can
think of contexts for 'white rhinocerous' involving 'stuffed animal' and
involving 'species' which determine distinct meanings.
The
above theory could be carried out using the set of terms generated
freely by a non-associative operation $W^\bullet$ and by postulating
that each $t \in T$ has a unique syntactic decomposition tree in
$W^\bullet$ - but this is a little too restrictive perhaps. We can
rewrite the above theory in probabilistic terms, specially $\kappa$. We
can replace contexts with co-occurrences. And turn $\kappa$ into map
which from the data of certain $t_i$s co-occuring with $t$ (at certain
distances) assigns a certain probability distribution to possible
meanings of $t$.
But let us now consider another approach. We
suppose that there is a 'metric' defined on $M$, $d: M \times M
\rightarrow \mathbb{R}_0^+$. To simplify things we can also consider a
finite disjoint decomposition of $M$ into categories $M = \bigcup_{c\in
C} M_c$ (for example something based on the 1000 categories of Roget'
thesaurus, but these of course only work for $W$ or small texts) and
consider the distance only on $C$ and extend it in a trivial way to $M$.
And important question: what is the distance of $m \in \sigma(ts)$ to
$\rho_{(ts)t}(m) \in \sigma(t)$ ? Note that each $t \in T$ determines a
subset $\sigma(t) \subset M$. Suppose we have a context $c \in C(t)$ and
consider $\sigma(c) \subset M$. How do we define the distance between a
$x \in M$ and a subset $ X \subset M$ ? One possibility is $d(x,X) =
min \{d(x,z) : z \in X \}$ but this is not what we want. Rather we need
an average over all the distances $d(x,z)$ for $z \in X$. For finite $M$
this is easy. Otherwise we need to introduce a measure on $M$. Now we
can consider a map \[\kappa: C(t) \rightarrow \sigma(t) \] which
associates to each $c \in C(t)$ the element $m \in \sigma(t)$ such that
$d(m, \sigma(c))$ is smallest. Or from a practical perspective we can
use the distance induced by the finite decomposition into categories.
This works best (for the case of Roget's categories) if instead of using
$\sigma(c)$ we use the subset of $M$ determined by all of the
$\sigma(w)$ for $w$ occurring in $c$. The problem with this approach is
that the minimum may not be unique. But for a text $t$ we can decompose
it into a context $c[s]$ for different $s \subset t$. Our algorithm
would first seek the right $s$ so that $\kappa$ determines a unique
minimum. This will allows us to refined the images in $M$ for the other
contexts (strickly speaking this is no longer the union of the
$\sigma(w)$ but a refined subset). It is to be hoped that this process
could be continued until the entire $t$ is disambuiguated. Is this not
what the human mind does ? It may be worth considering that $M$ has a
more refined description in terms of subsets of a certain 'semantic
space' $\Sigma$ in such a way that the $\sigma$ of elements of $W$ are
like 'points', their concatenations like 'lines' and so forth. And we
can define the metric on $\Sigma$ rather than directly on $M$. Note
that we must not confuse the mathematical with the philosophical aspect.
If we postulate that meaning can be formally analized and studied via
context this is not meant to imply that meaning is context, anymore than
studying a recursive set as a set means that we are identifying the
associated algorithm with the set itself.
Given a $T \subset
W^\star$ (for simplicity we assume it is finite) as in section 1, let us
consider a map $c: T \rightarrow P(T)$ which associates to each $t \in
T$ the set set of all $s' \in T$ such that $t$ occurs in $s$ (contexts).
From a practical perspective it is better to restrict $c$ to contexts
having a certain limit of length, which we fix to be $n$ (and denote by
$T_n$) and denote the resulting map by $c_n$. Of course then $c_n$
becomes empy for $t$ precisely of length $n$. We can define a notion of
distance $d(t_1,t_2)$ between $t_1,t_2 \in T$. This is given by
\[\frac{\Sigma_{t \in T_n: t_1,t_2 \in t} dist(t_1,t_2,t) }{|T_n|}\]
where $dist$ is the ordinary distance between $t_1,t_2$ in $t$. There
are various ways to define distances of sets. In our case we should use
$d(A,B)$ being equal to the average distance. The idea is to decompose
$c_n(t)$ into subsets which are far apart, reflecting distinct semantic
categories. If for a given radius the elements of $T_n$ in that radius
are random and dispersed then this is not possible. Ideally we wish to
find 'clusters'. We can view the decomposition into clusters as writing a
vector represeting a general element as a linear combination of vecotrs
representing the clusters or categories.
Note that if $t_1t_2$
is know to be part of a defined cluster then it is plausible that this
cluster also determines one for $t_1$ and $t_2$ thus revindicating the
existence of the map $\rho$. Let us look more closely at $T$. It can
mean all possible texts in a given language. But if we consider a
person $p$ then we can take $T_p$ to mean the subset generated by the
meaningful subsegments of the text $t_p$ consisting of all linguistic
material either thought in inner verbal discourse, heard or spoken from
the moment of birth to the moment of death. Or we can consider in some
sense all the possible such maximal onto-texts. We can also define the
same kind of text for communities and their history. $T$ is then like a
book whose pages correspond to a possible total text of linguistic
material in a certain possible history of a community (cosmo-texts). The
set of possible books of the world encompassing the total
linguistic-consciousness of all people throughout history. A text is
mainly about certain topics. A biography is about a person. Can we give a
purely textual definition of aboutness ?
Since a cosmo-text is
finite there will be meaningful texts which will escape it. Is there an
English text of 1000 words which will never be read by mankind and yet
is highly meaningful (or would be meaningful in its possible
encompassing cosmo-text) ? Can we give a purely textual combinatorial
definition of 'highly meaningful' ? We give an abstract answer. Consider
a set $\Sigma$ of symbols and suppose we have a metric $d$ on
$\Sigma^\star$. Let $S \subset \Sigma^\star $. Let $S(w)$ denote all
elements in $S$ with initial segment $w$. Then a word $w \in
\Sigma^\star$ more impactful (meaningful) than $w' \in \Sigma^\star$ for
$s$ in $S$ if the rough 'distance' between $S(sw)$ and $S(sw')$ is
significant (again we need a good notion of distance between sets). This
would correspond to an 'influential' or 'seminal' work in the global
cosmo-text.
Large Language Models such as ChatGPT-3 use high
dimensional ($dim V$ = 12,288) vector-space representations of meanings
of certain textual units ('tokens'). These are generated from context in
large data sets. The idea of having certain semantic 'atoms' (sememes)
from which are combinatorically constructed possible meanings can be
found for instance in Greimas (cf. Osgood's semantic differential for
studying the variation of connotation across different cultures). Some
(such as René Thom) have claimed that the idea that meaning should have a
continuous, geometric aspect is found in Aristotle. Leibniz'
characteristica used 'primitive terms' but it is not clear if they are
combined in a simple algebraic, combinatorical or mereological way, or
if complex logical expessions must be involved (or associated semantic
networks). But in embedding matrices we have what would seem to be a
quantification of meaning, each 'sememe' is given a 'weight' which
determines its geometric relation to other meaning-vectors in a crucial
way (the weights cannot be dismissed as probabilistic or 'fuzzy'
aspects). To us this would correspond to the 'more-or-less' aspect of
species in Aristotle. A very interesting aspect of embedding matrices is
how they capture analogy through simple vector operations. This
suggests another possible formalization of Aristotelian 'difference' ,
the same difference operating on two different genera. We get a notion
of semantic distance and semantic relatedness. This also revindicates
Thom's perception of geometry and dynamics in the spaces of genera. Some
questions to ask: are these token-meaning-vectors linearly independent ?
If not can we work with a chosen basis ? If the token is ambiguous is
the corresponding vector a kind of superposition of possible meanings,
as in quantum theory ? How are we to understand the idea of the meaning
of complex expressions being linear combinations of the meaning
representations of the tokens occuring in the expression ? It would of
course be interesting to analyze these questions relative to the other
fundamental components of LLMs (attention in transformers, multi-layer
perceptrons) - even if these are more practically oriented rather than
reflecting actual linguistic and cognitive reality. Suppose we are
given a large text $T$ generated by a set of words $W$ and a context
window $S$ of size $n$. Suppose we wished to represent the elements of
$W$ as vectors of some vector space $V$ in such a way that given v,w in W
the modulus of the inner product $|\langle v,w\rangle|$ gives the
probability of the two words being co-occurent in contexts S. Consider
the situation: it is very rare for words $s_1$ and $s_2$ to co-occur but
words $s_1$ and $s_3$ co-occur sometimes as do $s_2$ and $s_3$. But
there is also a word $s_4$ which never co-occurs with $s_3$ but has the
same co-occurrence frequencies with $s_1$ and $s_2$ as does $s_3$. Then
it is easy to see that there is no way to represent
$s_1$,$s_2$,$s_3$,$s_4$ in the same plane in such a way that these
properties are expressed by the inner product. Thus the dimension must
go up by one value. We can define the geometric $n$-co-occurence
dimension as the minimal dimension of a vector space adequate to
represent co-occurrence frequencies by an inner product. We can ask what
happens as $n$ increases, does the geometric dimension also increase
(and in what manner) or does it stabilize after a certain value ?
No comments:
Post a Comment