* '''4. Decoding:''' Begin with start-of-sequence token. Model generates probabilities for the next token in each sequence and passes them through an output layer, including a softmax function that normalizes probabilities. Beam Search chooses which paths to continue following and prunes the rest.
* '''5. Finalize Output:''' Select Beam with highest probability as final translation, convert tokenized output back into a readable sentence with punctuation, capitalization, etc.
Now, to break down the steps of Beam Search
Conditional Probability Notes:
* After the algorithm is finished we take the best option from the candidate sequences as our output
If we have at most <math>m</math> words in our final sentence, each state has <math>n</math> word options, and we maintain a maximum bin width size of <math>k</math>, this algorithm runs with a time complexity of <math>O(nmk log(k))</math>
[[File:Screenshot 2024-05-21 at 7.50.48 PM.png|thumb]]
* lower value => faster runtime, less diversity, less optimal results
=== Other Notes ====
* '''Softmax Function: '''
[[File:Screenshot 2024-05-21 at 6.56.45 PM.png|thumb]]
* ensures that all probabilities at each state are in the range (0,1) and sum to 1'''Log semiring trick'''* Using semi-[https://en.wikipedia.org/wiki/Log_semiring log plot semiring] to deal with very small probability values trick: Given <math>P</math> as the cumulative probability score of all words in the sequence so far, <math>P=p_0*p_1*p_2...p_n</math>
* Since the value of <math>P</math> can become very small, we can run into computational rounding errors. One strategy around this is to take the natural log of the summation of all p values and use this to compare values of <math>P</math>. This works because
* <math>P=p_0*p_1*p_2...p_n</math>
* <math>log(P)=log(p_0)+log(p_1)+log(p_2)...+log(p_n)</math>
* if <math>(P_1>P_2) => log(P_1)>log(P_2)</math>, so values can still be compared in this form