Around 2017 I was thinking about the problem of automatic translation and how ambiguity and idioms could be dealt with. One idea was to tag every word $w$ with one of Roget's 1000 categories (represented by a set $C$). Thus we have map $\kappa: W \rightarrow P(C)$. Roughly speaking an ambiguous word $w$ will have $\kappa(w)$ with cardinality at least 2. Given a context $T$ and a word $w$ occurring in $T$ my idea was to devise an algorithm which functioned a little bit like a Sudoku puzzle using a concept of 'semantic distance'. We find a word $w$ such that based on the current words $v$ in the context with singleton $\kappa$ we can determine which $c \in \kappa(w)$ is 'closest' to the set of $c$'s inhabiting the singletons of such words. We then make the choice and this should lead to finding further words that can be resolved and so forth. Of course the problem is how to define such semantic distance as well as to guarantee that the process achieves its goal and does not get stuck (but we could introduce random choices). If we view Roget's 1000 categories as organized as leaves (or even nodes) of a binary tree then there is an obvious definition. For instance 'rotation' is semantically closed to 'motion' than it is to 'feeling'.
No comments:
Post a Comment