Word exchange algorithm

[Kneser and Ney 1993]14.7 describes an algorithm to build a class map by starting from some initial guess at a solution and then iteratively searching for changes to improve the existing class map. This is repeated until some minimum change threshold has been reached or a chosen number of iterations have been performed. The initial guess at a class map is typically chosen by a simple method such as randomly distributing words amongst classes or placing all words in the first class except for the most frequent words which are put into singleton classes. Potential moves are then evaluated and those which increase the likelihood of the training text most are applied to the class map. The algorithm is described in detail below, and is implemented in the HTK tool CLUSTER.

Let $ \mathcal{W}$ be the training text list of words $ (w_1, w_2, w_3,
\ldots)$ and let $ \mathbb{W}$ be the set of all words in $ \mathcal{W}$. From equation 14.1 it follows that:

$\displaystyle P_\mathrm{class}(\mathcal{W}) \;=\; \prod_{x, y \in \mathbb{W}} P_\mathrm{class}(x \;\vert\; y)^{C(x,y)}$ (14.9)

where $ (x, y)$ is some word pair `$ x$' preceded by `$ y$' and $ C(x, y)$ is the number of times that the word pair `$ y$ $ x$' occurs in the list $ \mathcal{W}$.

In general evaluating equation 14.9 will lead to problematically small values, so logarithms can be used:

$\displaystyle \log P_\mathrm{class}(\mathcal{W}) \;=\; \sum_{x, y \in \mathbb{W}} C(x, y) . \log P_\mathrm{class}(x \;\vert\; y)$ (14.10)

Given the definition of a class $ n$-gram model in equation 14.8, the maximum likelihood bigram probability estimate of a word is:

$\displaystyle P_\mathrm{class}(w_i \;\vert\; w_{i-1})$ $\displaystyle =$ $\displaystyle \frac{C(w_i)}{C(G(w_i))}
\times
\frac{C\left(G(w_i), G(w_{i-1})\right)}
{C(G(w_{i-1}))}$ (14.11)

where $ C(w)$ is the number of times that the word `$ w$' occurs in the list $ \mathcal{W}$ and $ C(G(w))$ is the number of times that the class $ G(w)$ occurs in the list resulting from applying $ G(.)$ to each entry of $ \mathcal{W}$;14.8 similarly $ C(G(w_x), G(w_y))$ is the count of the class pair `$ G(w_y)$ $ G(w_x)$' in that resultant list.

Substituting equation 14.11 into equation 14.10 and then rearranging gives:

$\displaystyle \log P_\mathrm{class}(\mathcal{W})$ $\displaystyle \;=\;$ $\displaystyle \sum_{x,y \in \mathbb{W}} C(x,y) . \log\left(
\frac{C(x)}{C(G(x))} \times \frac{C(G(x),G(y))}{C(G(y))}
\right)$  
  $\displaystyle \;=\;$ $\displaystyle \sum_{x,y \in \mathbb{W}} C(x,y) . \log
\left(\frac{C(x)}{C(G(x))...
...sum_{x,y \in \mathbb{W}} C(x,y)
. \log\left(\frac{C(G(x),G(y))}{C(G(y))}\right)$  
  $\displaystyle \;=\;$ $\displaystyle \sum_{x \in \mathbb{W}} C(x) . \log \left(\frac{C(x)}{C(G(x))}\right)
\;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log\left(\frac{C(g,h)}{C(h)}\right)$  
  $\displaystyle \;=\;$ $\displaystyle \sum_{x \in \mathbb{W}} C(x) . \log C(x)
\;-\; \sum_{x \in \mathbb{W}} C(x) . \log C(G(x))$  
    $\displaystyle \;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h)
\;-\; \sum_{g \in \mathbb{G}} C(g) . \log C(g)$  
  $\displaystyle \;=\;$ $\displaystyle \sum_{x \in \mathbb{W}} C(x) . \log C(x)
\;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h)$  
    $\displaystyle \;-\; 2 \sum_{g \in \mathbb{G}} C(g) . \log C(g)$ (14.12)

where $ (g,h)$ is some class sequence `$ h$ $ g$'.

Note that the first of these three terms in the final stage of equation 14.12, `` $ \sum_{x \in \mathbb{W}} C(x)$ $ .$ $ \log(C(x))$'', is independent of the class map function $ G(.)$, therefore it is not necessary to consider it when optimising $ G(.)$. The value a class map must seek to maximise, $ F_{\mathrm{M}_\mathrm{C}}$, can now be defined:

$\displaystyle F_{\mathrm{M}_\mathrm{C}}$ $\displaystyle \;=\;$ $\displaystyle \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h)
\;-\; 2 \sum_{g \in \mathbb{G}} C(g) . \log C(g)$ (14.13)

A fixed number of classes must be decided before running the algorithm, which can now be formally defined:

\framebox[13.5cm]{\parbox{12cm}{\vspace{0.5cm}
\begin{enumerate}
\item {\bfserie...
...hrm{M}_\mathrm{C}}$\end{enumerate}\end{enumerate}\end{enumerate}\vspace{0.5cm}}}

The initialisation scheme given here in step 1 represents a word unigram language model, making no assumptions about which words should belong in which class.14.9 The algorithm is greedy and so can get stuck in a local maximum and is therefore not guaranteed to find the optimal class map for the training text. The algorithm is rarely run until total convergence, however, and it is found in practice that an extra iteration can compensate for even a deliberately poor choice of initialisation.

The above algorithm requires the number of classes to be fixed before running. It should be noted that as the number of classes utilised increases so the overall likelihood of the training text will tend tend towards that of the word model.14.10 This is why the algorithm does not itself modify the number of classes, otherwise it would naïvely converge on $ \vert\mathbb{W}\vert$ classes.


Back to HTK site
See front page for HTK Authors