This post is mainly for me to understand how Beam Search works in detail and why we should use it. These are my notes from deeplearning.ai's Deep Learning course on Coursera.

Beam Search is basically a pruned breadth-first search. It is often used in sequence-to-sequence modelling in the decoding step (refer to somewhere).

We want to find the most probable translation $y = <\hat{y}^{<1>}, \ldots, \hat{y}^{<1>}>$ of a given input sentence $x$.

BS has an parameter $B$, the *beam width*. It determines the number of most likely next words in a sequence model.
For starters, we have an encoded sentence *x* with probability $P(x)$.
We not want to find a good candidate word $y^{<1>}$ with which we can start the translated sentence.
In Greedy Search, we would use the most likely candidate, i.e. $argmax P(y^{<1>} | x)$.
In Beam Search, we consider the $B$ most likely candidates $y*1^{<1>}$ to $y*B^{<1>}$ that would yield the $B$ highest
probabilities $P(y_*^{<1>} | x)$.

In the second step, i.e. when trying to find $y{<2>}$, we consider the $B$ probabilities

- P(y_1^{<1>} | x)
- ...
- P(y
*B^{<1>} | x) and determine $P(y^{<2>}, y*k^{<1>} | x) = P(y^{<2>} | x, y*k^{<1>}) \cdot P(y*k^{<1>} | x)$.

We then again take the $B$ most likely combinations, i.e. the $B$ pairs $(y*m^{<2>}, y*k^{<1>})$ with the highest
probabilities $P(y*m^{<2>}, y*k^{<1>} | x)$.

Because Greedy Search (GS) only uses the most probable word as the next word.

For example, instead of "Jane is visiting Africa in September", GS would choose "going" instead of "visiting", since "going" is a more frequent word in English and thus has a higher probability to occur.

BS instead considers several

Instead of computing the product of many small probabilities, we could simply compute the sum of the log probabilities. Due to the monotonicity of log, this would result in the same argmax.

Beam Search in general finds rather short sentences (why is that?)

So we can simply normalize foo (insert me) by adding a factor $\frac{1}{T_y^\alpha}$. $\alpha$ is something around 0.7, which just happens to be a reasonable choice, no reason given.

Increasing beam width results in

- better results, i.e. better jobs in maximizing $P(y | x)$, but
- it's slower (logically)
- and uses up much more memory (logically)

Obviously, BS cannot find exact solutions, as we prune away smaller probabilities. This is in contrast to BFS or DFS, which however consider the whole space of probabilities, but would use up way more memory and run way slower.

y^* from human, yhat from algorithm. We let the RNN compute P(y* | x) and P(yhat | x).

Beam search choise yhat, although y* has a higher probability. This means that Beam search is at fault since it pruned away a good path. Here we could increase the beam width $B$ to maybe find that path.

This means that according to the RNN, the actual correct translation would achieve a lower probability than the BS generated solution. This means that the RNN outputs wrong results.