Contents of Word exchange algorithm

Let $\mathcal{W}$ be the training text list of words $(w_1, w_2, w_3, \ldots)$ and let $\mathbb{W}$ be the set of all words in $\mathcal{W}$ . From equation 14.1 it follows that:

$\displaystyle P_\mathrm{class}(\mathcal{W}) \;=\; \prod_{x, y \in \mathbb{W}} P_\mathrm{class}(x \;\vert\; y)^{C(x,y)}$

(14.9)

In general evaluating equation 14.9 will lead to problematically small values, so logarithms can be used:

$\displaystyle \log P_\mathrm{class}(\mathcal{W}) \;=\; \sum_{x, y \in \mathbb{W}} C(x, y) . \log P_\mathrm{class}(x \;\vert\; y)$

(14.10)

Given the definition of a class

-gram model in equation 14.8, the maximum likelihood bigram probability estimate of a word is:

$\displaystyle P_\mathrm{class}(w_i \;\vert\; w_{i-1})$

$\displaystyle =$

$\displaystyle \frac{C(w_i)}{C(G(w_i))} \times \frac{C\left(G(w_i), G(w_{i-1})\right)} {C(G(w_{i-1}))}$

(14.11)

$\displaystyle \log P_\mathrm{class}(\mathcal{W})$	$\displaystyle \;=\;$	$\displaystyle \sum_{x,y \in \mathbb{W}} C(x,y) . \log\left( \frac{C(x)}{C(G(x))} \times \frac{C(G(x),G(y))}{C(G(y))} \right)$
	$\displaystyle \;=\;$	$\displaystyle \sum_{x,y \in \mathbb{W}} C(x,y) . \log \left(\frac{C(x)}{C(G(x))... ...sum_{x,y \in \mathbb{W}} C(x,y) . \log\left(\frac{C(G(x),G(y))}{C(G(y))}\right)$
	$\displaystyle \;=\;$	$\displaystyle \sum_{x \in \mathbb{W}} C(x) . \log \left(\frac{C(x)}{C(G(x))}\right) \;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log\left(\frac{C(g,h)}{C(h)}\right)$
	$\displaystyle \;=\;$	$\displaystyle \sum_{x \in \mathbb{W}} C(x) . \log C(x) \;-\; \sum_{x \in \mathbb{W}} C(x) . \log C(G(x))$
		$\displaystyle \;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h) \;-\; \sum_{g \in \mathbb{G}} C(g) . \log C(g)$
	$\displaystyle \;=\;$	$\displaystyle \sum_{x \in \mathbb{W}} C(x) . \log C(x) \;+\; \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h)$
		$\displaystyle \;-\; 2 \sum_{g \in \mathbb{G}} C(g) . \log C(g)$	(14.12)

Note that the first of these three terms in the final stage of equation 14.12, `` $\sum_{x \in \mathbb{W}} C(x)$

$\log(C(x))$ '', is independent of the class map function

, therefore it is not necessary to consider it when optimising

. The value a class map must seek to maximise, $F_{\mathrm{M}_\mathrm{C}}$ , can now be defined:

$\displaystyle F_{\mathrm{M}_\mathrm{C}}$

$\displaystyle \;=\;$

$\displaystyle \sum_{g,h \in \mathbb{G}} C(g,h) . \log C(g,h) \;-\; 2 \sum_{g \in \mathbb{G}} C(g) . \log C(g)$

(14.13)

A fixed number of classes must be decided before running the algorithm, which can now be formally defined:

The initialisation scheme given here in step 1 represents a word unigram language model, making no assumptions about which words should belong in which class.^14.9 The algorithm is greedy and so can get stuck in a local maximum and is therefore not guaranteed to find the optimal class map for the training text. The algorithm is rarely run until total convergence, however, and it is found in practice that an extra iteration can compensate for even a deliberately poor choice of initialisation.

The above algorithm requires the number of classes to be fixed before running. It should be noted that as the number of classes utilised increases so the overall likelihood of the training text will tend tend towards that of the word model.^14.10 This is why the algorithm does not itself modify the number of classes, otherwise it would naïvely converge on $\vert\mathbb{W}\vert$ classes.