What Is Self Attention ?

(1) What Is The Most Important Thing In NLP Applications?

Before understanding the self-attention mechanism, we must understand the most important thing in any NLP application.
The answer is how you convert any words into numbers.
Our computers don’t understand words they only understand numbers.
Hence the researchers first worked in this direction to convert any words into vectors.
We got some basic techniques like,
1. One Hot Encoding.
2. Bag Of Words’
3. N-Grams
4. TF/IDF
5. Word2Vec
We have already studied all these topics previously.
The most advanced technique we use today is ‘Word2Vec’.
But there are some basic problems involved in the ‘Word2Vec’ model.

In word embedding, we have the power to capture the semantic meaning of the word.
Capturing the semantic meaning means capturing the meaning of the word based on the context it’s being used.
If two words are similar, then their vector representations will also be similar.

The word embedding is done based on other words in that sentence or corpus.
If the word is being used in a different sentence or corpus its vector form will be different.
Each number in the vector format represents some characteristics.

Suppose we have the above corpus with different sentences.
Let’s focus on the word ‘Apple’ and its word embedding vector.
We have decided to keep our word embeddings into a 2-dimensional vector.
The first dimension represents the test and the second dimension represents the technology part.

The problem with Word Embedding is that it does not capture the meaning of the word, it captures the Average Meaning of the word.
On average that particular word in your corpus used in which meaning.
You can see in the above example ‘Apple’ being used as a fruit most of the time.
Hence the vector representation of the ‘Apple’ will inclined towards fruit, not as a phone.
In this 2-dimensional vector [test, technology] the test component will be more compared to the technology component.
Let us assume in our corpus out of 10000 sentences, 9000 sentences contain an Apple as a fruit, and the rest of 1000 sentences contain an Apple as a phone.
The overall vectors would look something like this [0.9, 0.3]. Test component = 0.9 and the Technology component is = 0.3.

The problem is that the ‘Word Embedding’ of that word is done once but used many times in different contexts.
This means Word Embedding is static.
Apple = [0.9, 0.3] We are going to use it every time, but it will fail when we use Apple as the phone in another sentence.
Suppose we are making an NLP application which performs translation between English to Hindi.

The first thing you will do is to convert each words into its respective embedding format.
Embedding of Apple = [text, technology] = [0.9, 0.3].
This is the static embedding, and it is considering Apple as fruit wherever it’s been used.
Hence it’s the demerit of the Word2Vec embedding.
But in our above example, Apple is used as a Phone.

Ideally what should happen is that instead of static embedding we should have contextual embedding.
This means the embedding value should dynamically change based on the context in which the word is being used.
We need a “Smart Contextual Embedding” which will do embedding based on the context it’s been used.
Hence the word embedding value will change in different contexts for the same word.
To solve this problem we have a ‘Self Attention’ mechanism.

Self Attention is a mechanism where it takes static embedding as an input and generate good contextual embedding.
Which are much better to use in any kind of application.