Understanding Beam Search

This post is mainly for me to understand how Beam Search works in detail and why we should use it. These are my notes from deeplearning.ai's Deep Learning course on Coursera.

What is Beam Search?

Beam Search is basically a pruned breadth-first search. It is often used in sequence-to-sequence modelling in the decoding step (refer to somewhere).

We want to find the most probable translation $y = <\hat{y}^{<1>}, \ldots, \hat{y}^{<1>}>$ of a given input sentence $x$.

BS has an parameter $B$, the beam width. It determines the number of most likely next words in a sequence model. For starters, we have an encoded sentence x with probability $P(x)$. We not want to find a good candidate word $y^{<1>}$ with which we can start the translated sentence. In Greedy Search, we would use the most likely candidate, i.e. $argmax P(y^{<1>} | x)$. In Beam Search, we consider the $B$ most likely candidates $y1^{<1>}$ to $yB^{<1>}$ that would yield the $B$ highest probabilities $P(y_*^{<1>} | x)$.

In the second step, i.e. when trying to find $y{<2>}$, we consider the $B$ probabilities

We then again take the $B$ most likely combinations, i.e. the $B$ pairs $(ym^{<2>}, yk^{<1>})$ with the highest probabilities $P(ym^{<2>}, yk^{<1>} | x)$.

Why is it "better" than greedy search?

Because Greedy Search (GS) only uses the most probable word as the next word.

For example, instead of "Jane is visiting Africa in September", GS would choose "going" instead of "visiting", since "going" is a more frequent word in English and thus has a higher probability to occur.

BS instead considers several

Refinements

Avoid float underrun

Instead of computing the product of many small probabilities, we could simply compute the sum of the log probabilities. Due to the monotonicity of log, this would result in the same argmax.

Length normalization

Beam Search in general finds rather short sentences (why is that?)

So we can simply normalize foo (insert me) by adding a factor $\frac{1}{T_y^\alpha}$. $\alpha$ is something around 0.7, which just happens to be a reasonable choice, no reason given.

Influence of Beam Width

Increasing beam width results in

Disadvantages of BS in contrast to BFS or DFS

Obviously, BS cannot find exact solutions, as we prune away smaller probabilities. This is in contrast to BFS or DFS, which however consider the whole space of probabilities, but would use up way more memory and run way slower.

Error analysis

y^* from human, yhat from algorithm. We let the RNN compute P(y* | x) and P(yhat | x).

P(y* | x) > P(yhat | x)

Beam search choise yhat, although y* has a higher probability. This means that Beam search is at fault since it pruned away a good path. Here we could increase the beam width $B$ to maybe find that path.

P(y* | x) < P(yhat | x)

This means that according to the RNN, the actual correct translation would achieve a lower probability than the BS generated solution. This means that the RNN outputs wrong results.