What Is Attention Mechanism?

(1) Problem With Encoder & Decoder Architecture.

Problem With Encoder:

The main idea behind the Encoder & Decoder architecture is that, the encoder summarize the entire text into one vector format and from that vector we need to convert into different language.
Let us consider the below example,

Your task is to read the entire sentence first, keep all the words in mind and translate it into Hindi without seeing the sentence again.
The answer will be you can’t do this by reading the paragraph once.
It is quite difficult for anyone to read it once, summarize it keep it in mind and translate at once.
The same problem is faced by the Encoder & Decoder network.
Our Encoder & Decoder network works by reading the entire paragraph, summarizing it using a Context Vector and passing it to the Decoder layer for Hindi translation.

The problem with the Encoder & Decoder layer is that when we give bigger sentences greater than 25 words all the loads will be in the Context Vector to keep the summary of all the words in a single vector.

Problem With Decoder:

You can see that the first output is “लाइटें”, to output “लाइटें” we are passing all the words [turn, off, the, lights] as a Context Vector to the Decoder layer.
But we only need the word ‘lights’ to translate to Hindi.
To print “बंद” we only need the words [turn, off] not every word.
Now we understand the problem at any point in time in the Decoder we do not need an entire sentence to produce the output, we need a specific set of words to do translation at that time.
The problem with the Decoder architecture is that we pass the entire sentence to the Decoder layer at every time step which is unnecessary.
Because of this static representation of the input Context Vector, the Decoder faces problems with the translation.
It would be better if we pass a dynamic input to the Decoder layer and give attention to the important words for translation.

(2) Solution For Encoder & Decoder Architecture.

The solution is the same as what a human being does while reading a sentence.
While reading an entire paragraph we focus on the single word at a time. We don’t keep all the previous and afterwords in mind while reading.
While translating a sentence at a particular time in the sentence, we don’t need all the words for translation.
We focus on one particular word at a time.
This solution is called the Attention Mechanism, giving attention to a particular word at a time.

You can see in the above image that while reading an entire text we create an attention span where our focus is in.
We blur other surrounding words while reading.
Somehow we need to introduce this concept in our architecture.

While printing “लाइटें” we need to tell our Decoder layer to focus on the time step 4 of the Encoder layer.
While printing “बंद” we need to tell our Decoder layer to focus on the time steps 1 & 2 of the Encoder layer.
Hence we have to dynamically generate this information at every time step in the Decoder layer which time steps from the Encoder layer are important.
This mechanism we call as Attention Mechanism.

(3) Math’s Behind Attention Mechanism.

Mathematical Notation:

We denote hidden states of the Encoder layer as (hi).
In our example, we have four hidden states, [h0, h1, h2, h3, h4].
All of these are n-dimensional vectors.

We denote hidden states of the Decoder layer as (S0).
The Input State of the decoder layer is denoted as (Y0).
Our example has four hidden states, [So, S1, S2, S3, S4].
All of these are n-dimensional vectors.

Implementing Attention Mechanism:

In every time step of the Decoder layer, we must pass on which time steps of the encoder layer are useful.
At time step 2, we need [S1, Y1] as an input. Hence at every time step, we need two pieces of information.
But in the Attention Mechanism, we need to pass one more information, that is, which time step of the Encoder layer is important.
While printing “लाइटें” we need to tell our Decoder layer to focus on the time step 4 of the Encoder layer.

The third input which passes to the Decoder layer at every time step is called ‘Attention Input’.
We denote all the Attention Input as (Ci). In our example, we have 4 Attention Inputs, [C1, C2, C3, C4]

Hence at every time step of the Encoder layer, we need to give 3 inputs.
Inputs = [Yi-1, Si-1, Ci]
Where Yi-1 = Teacher Forcing Input.
Si-1 = Hidden State Input.
Ci = Attention Input.

Understanding Attention Input Ci:

What Is Ci ?

Ci can be a Vector, Scalar or it can be a Matrix. If it is a Vector what is its dimension etc?
While printing “लाइटें” we know that (h4) of the Encoder layer is important.
Our task is to find out which Hidden state of the Encoder layer is important for printing “लाइटें”
After that, we can pass that Hidden state of the Encoder layer as the Attention Input to the Decoder Layer.
Now we understand that Ci is a vector because we are passing the Hidden State of the Encoder as Ci.
Now the question is what is the dimension of Ci?

What Is Dimension Of Ci ?

While printing “लाइटें” we need to tell our Decoder layer to focus on the time step 4 of the Encoder layer.
While printing “बंद” we need to tell our Decoder layer to focus on the time steps 1 & 2 of the Encoder layer.
From this, you can see that there is no fixed size of the “Ci” vector.
At first case Ci = [h4], but in second case Ci = [h1, h2].
To answer this the Dimension of “Ci” should be the same as the Dimension of the “hi” vector.
Now the question is what if we need to pass more hi to the Ci vector like in our second case?
To answer this what we will do is take the weighted sum of all the hidden state inputs of the Encoder layer.

How To Calculate Ci ?

If you are printing “लाइटें” and you need to know which Hidden state of the Encoder layer is most useful [h0, h1, h2, h3, h4]?
The Attention Mechanism works by assigning weights to each hidden state of the Encoder layer.
Let’s denote these weights as (αi).
Alpha is called an Alignment Score.

Ci is the weighted sum of all the hidden state inputs of the Encoder.

The importance of the hidden state vector (hi) at any time step depends on the corresponding weights (αi)

The formula for Ci would be,

How To Calculate α ?:

Let us take the example of α21. And understand how to calculate it.
We call α as similarity score or alignment score.

First, we need to know on which factor the α depends.
As α is the similarity score it finds out the similarity between the “बंद” and “h1” vector which represents ‘turn’.
Hence α depends on the hidden state vector ‘hj’.
α also depends on the previous hidden state of the Decoder layer which is denoted by Si vector.
Let’s understand why α depends on the Si vector.
So what we do with the Attention Mechanism is that “GIVEN WHAT EVER TRANSLATION WE HAVE DONE TILL NOW BY DECODER LAYER, WHAT IS THE IMPORTANCE OF THE HIDDEN STATE OUTPUT VECTOR TO OUTPUT THE NEXT WORD FROM DECODER LAYER”.
In our example given “लाइट”, what is the importance of the hidden state vector “h1” which represents the “turn” word?
Hence α depends on the hidden state vector ‘hj’. and Si.
This means α depends on “लाइट” and “turn” two words.

Now the question is what is this function ‘f’?
How to derive the function ‘f’?
One approach is to try out different mathematical functions and see the results of each function, if a function is giving better results use that.
But it will be super time-consuming.
What the researchers did is they found a smarter way.
We already have the power of Artificial Neural Network (ANN), which is called Universal Function Approximation.
If we provide enough data to the ANN we can use it to approximate any function.
ANN will adjust weights accordingly so that it will approximate any functions.
But we can’t see that function. It will be created inside the Architecture.

How this Neural Network will be trained?
The answer is that along with the training of the Encoder & Decoder layer the Neural Network will also get trained.

(4) Improvements Due To Attention Mechanism.

In the above diagram, the ‘X-Axis’ represents the ‘Sentence Length’.
‘Y-Axis’ represents the ‘BLUE Score’, which represents the quality of the translation.
As you can see in the graph, for the 2nd, 3rd and 4th models the BLUE score decreases as we increase the word length.
But in the 1st model, the blue score remains the same after 30 words.
Because the first model is based on the Attention Mechanism.

The above image represents the English-to-French translation.
The white box represents a strong correlation.
The black box represents the weak correlation.
You can notice that the related words have a higher correlation.

(5) Problems With Above Architecture.

You can see that the training is done on a Sequential basis, which means we have to pass one word at a time. Which increases the training time.
This model will fail if we have more than 30 word sentence, because the single context vector can not encode long sentences.

Praudyog

What Is Attention Mechanism?

What Is Attention Mechanism?

Table Of Contents:

(1) Problem With Encoder & Decoder Architecture.

Problem With Encoder:

Problem With Decoder:

(2) Solution For Encoder & Decoder Architecture.

(3) Math’s Behind Attention Mechanism.

Mathematical Notation:

Implementing Attention Mechanism:

Understanding Attention Input Ci:

What Is Ci ?

What Is Dimension Of Ci ?

How To Calculate Ci ?

How To Calculate α ?:

(4) Improvements Due To Attention Mechanism.

(5) Problems With Above Architecture.

Leave a Reply Cancel reply