Attention in the age of ADHD
I explain what is attention and self-attention in short.
Simon Li
When I first got into machine learning, I always heard Transformer this and attention that. Unfortunately and frustratingly, I had absolute ZERO idea what attention was even after some brief search on the internet. Therefore, I will try to explain the concept of attention in a way that hopefully would be straight forward even for people who’s new to machine learning or computer science.
Why attention?
Before the arrival of Transformers, Recurrent Neural Network (RNN) was one of the most popular architecture for sequential data. RNN is known for its ability to generate local temporal dependcies However, it lacks the ability for long-range time dependencies, meaning that it would favor more recent information than further ones. To battle this weakness of leveraging information from hidden layers of RRN, attention mechanism is proposed.
What’s Attention?
In short, the attention’s output sequence is the weighted average of the input sequence. More specifically, attention is a function that transforms an input sequence to an output sequence that does not necessary have the same length using a learned input-dependent weighted average.
Math behind attention
In this section, we are going to formalize how the weighted average is taken for the output sequence from the input sequence.
Suppose we have and number of input and output tokens, respectively. Then, we have an input sequence $V$ and an output sequence $Z$.
Let be the weight of input token in the output token . Then, we have that the output tokens are
Note that we’d require weighting coefficients and that .
More math behind attention
We now know how the weighted average output is calculated. But where does the weighting coefficients come from?
Suppose that we are given Query Tokens and Key tokens . Then, we can determine the weight coefficient by calculating how similar and are. We normally would use cosine similarity to calculate the similarity; however, using just the numerator of cosine similarity, which is a dot product, not only works well but also saves a heafty amount of computations.
After we obtained the raw similarity by using the inner product, we would scale the result by dividng it with where is a scaling factor. This step is necessary as due to random initialization, we could have a sharp distribution of weight coefficients , which could take the model much more time to adjust the initial peaks. With the scaling factor, it ensures that the distribution at the start is more uniform, thus guaranteeing a faster convergence.
Lastly, we normalize the value after scaling with softmax to obtain a probability distribution.
Self-Sttention
Self-attention is a special case of attention where and that all of are derived from the same input token sequence . This means that to calculate , we have learnable parameters such that
where is the dimension of the Keys and Queries tokens.
Therefore, we have that the output of the self-attention would be
Multi-Head Self-Attention
Just like a convolution layer where we can run multiple convolutions, we can run mutiple attention heads per layer! Suppose the output of each head $h_i$ is given by $Z_i$ described above. Then, the final output is obtained by concatenating all the individual head’s output and apply a linear transformation where is a learnable parameter.