Word2Vec Word Vectorization

Table Of Contents:

  1. What Is Word Embedding?
  2. Types Of Word Embedding.
  3. What Is Word2Vec?
  4. Why Are Word Embedding Needed?
  5. How Word2Vec Model Works?
  6. Pretrained Word2Vec Model.
  7. What Are 300 Dimension Numbers Signifies?
  8. Intuition Behind Word2Vec.
  9. Assumption Behind Word2Vec Mode.
  10. Architecture Of Word2Vec Model.
  11. Continuous Bag Of Words(CBOW).
  12. Skip-Gram Word2Vec.
  13. When To Use CBOW & Skip-Gram?
  14. How To Increase The Performance Of The Word2Vec Model?
  15. Train Word2Vec Model With Game Of Thrones Dataset.

(1) What Is Word Embedding?

  • Word embedding is a fundamental technique in natural language processing (NLP) that represents words as dense, low-dimensional vectors of real numbers.
  • The key purpose of word embeddings is to capture the semantic and syntactic relationships between words in a way that machine learning models can effectively leverage.
  • Word embedding is the term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words closer in the vector space are expected to be similar in meaning.
  • The main characteristics of word embeddings are:
  1. Dense Representation: Instead of representing words as sparse one-hot encoded vectors, word embeddings map each word to a dense vector of typically 100-300 dimensions.

  2. Semantic And Syntactic Relationships: The vector representation of words in the embedding space encodes the semantic and syntactic similarities between words. Words with similar meanings or grammatical roles tend to have nearby vector representations.

  3. Continuous Values: The vector values in a word embedding are continuous, unlike the discrete values in one-hot encoding. This allows the model to learn smooth relationships between words.

  4. Learned From Data: Word embeddings are typically learned from large text corpora using unsupervised learning techniques, such as word2vec, GloVe, or FastText. The goal is to learn the vector representations that best capture the relationships between words in the training data.

  • The benefits of using word embeddings in NLP models include:
  • Improved Generalization: The learned semantic relationships allow the model to better generalize to unseen words or contexts.
  • Reduced Model Complexity: The dense, low-dimensional vectors require fewer parameters in the model compared to one-hot encoding.
  • Ability To Capture Analogies: Word embeddings can capture analogical relationships between words, such as “king” is to “man” as “queen” is to “woman”.
  • Transfer Learning: Pre-trained word embeddings can be used as input to downstream NLP models, allowing the model to leverage the learned representations without having to train from scratch.

(2) Types Of Word Embedding.

  • There are two types of word embeddings.
    1. Frequency Based.
    2. Prediction Based.

Frequency Based:

  • Frequency-based word embeddings are a type of word embedding that utilizes the frequency of word co-occurrences to learn the vector representations of words. 
  • Examples Are:
    1. Bag Of Words.
    2. N-Grams
    3. TF/IDF
    4. GLOVE
    5. Matrix Factorization.

Prediction Based:

  • Prediction-based word embeddings are a type of word embedding technique that learns the vector representations of words by predicting the surrounding context of a word in a given text corpus. 
  • Examples Are:
    1. Word2Vec

(3) What Is Word2Vec ?

  • The Word2Vec algorithm is a widely used neural network-based approach for learning distributed representations of words, also known as word embeddings.
  • It was developed by researchers at Google, including Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
  • Word embeddings eventually help establish the association of a word with another word with a similar meaning through the created vectors.
  • As seen in the image below where word embeddings are plotted, similar-meaning words are closer in space, indicating their semantic similarity.

(4) Why Are Word Embedding Needed?

  • Consider the two sentences –
    • “You can scale your business.” and
    • “You can grow your business.”.
  • These two sentences have the same meaning. If we consider a vocabulary considering these two sentences, it will constitute of these words: 
  • {You, can, scale, grow, your, business}.
  • A one-hot encoding of these words would create a vector of length 6.
  • The encodings for each of the words would look like this:
  • You: [1,0,0,0,0,0], Can: [0,1,0,0,0,0], Scale: [0,0,1,0,0,0], Grow: [0,0,0,1,0,0], Your: [0,0,0,0,1,0], Business: [0,0,0,0,0,1]

  • In a 6-dimensional space, each word would occupy one of the dimensions, meaning that none of these words has any similarity with each other – irrespective of their literal meanings.
  • Word2Vec, a word embedding methodology, solves this issue and enables similar words to have similar dimensions and, consequently, helps bring context.

(5) How Word2Vec Model Works?

  • You can work on word2vec model in two different ways.
    1. Using Pretrained Model.
    2. Using Self Trained Model.
  • Pretrained Models are the models which has already trained on publicly available dataset.
  • Self Trained models are the models which are trained on our own dataset.

(6) Pretrained Word2Vec Model.

  • We will use pre-trained weights of word2vec that was trained on Google News corpus containing 3 billion words.
  • This model consists of 300-dimensional vectors for 3 million words and phrases.
import gensim
from gensim.models import Word2Vec, KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
  • Here ‘GoogleNews-vectors-negative300.bin’ is the pre-trained model that we will use to convert words to vectors.
model['man']
model['man'].shape
model['woman']

(7) What Are 300 Dimension Numbers Signifies?

  • These 300 numbers signify the 300 number of features of that word.
  • In other words, it also signifies the 300 dimensions of that word.
  • All these numbers signify the Simentic meaning of that word.
  • Find out the most similar words for the word ‘man’.
model.most_similar('man')
  • First the word ‘man’ is converted into 300 dimensional vector.
  • Second, it calculates the most similar vector of the ‘man’ vector.
  • The similarity is calculated using cosine similarity.
model.most_similar('happy')
  • Finding similarity between two words.
model.similarity('man', 'woman')
model.similarity('police', 'thief')
  • Finding Odd Man Out Word Between Set Of Words.
model.doesnt_match(['PHP','Java','Monkey'])
  • Doing Vector Arithmetic Operation.
vec = model['king'] - model['man'] + model['woman']
vec
model.most_similar(vec)
vec = model['INR'] - model['India'] + model['England']
model.most_similar(vec)

(8) Intuition Behind Word2Vec

  • The algorithm assumes that if two words appear in similar contexts (i.e., have similar surrounding words), then they are likely to be semantically similar.
  • By learning to predict the surrounding context words for a given word, the algorithm can capture the semantic relationships between words.
  • Suppose we have 5 words in our vocabulary: {‘King’, ‘Queen’, ‘Man’, ‘Woman’, ‘Monkey’}
  • If we use the Deep Learning model it will create some features to represent the words.
  • Suppose it has created features {‘Gender’,’ Wealth’, ‘Power’, ‘Weight’,’ Speak’}.
  • These features are created based on the words in the vocabulary.
  • Based on the features the model will assign values from 0 to 1.
  • Like for ‘King’, Gender = 1 for ‘Queen’, Gender = 0.
  • In the real world, we have a Vocabulary of words in Lakhs.
  • In our Google News example, we have a vocabulary of 30 Lakh words.
  • In the Google News example, our Word2Vec model has created 300 features to represent a word.

Note:

  • To manually create features for 30 lakhs words will be too much difficult.
  • We need to have some technique that will do it automatically.
  • To solve this we take the help of Neural Networks.
  • One downside of using Neural Network is that you will not know the features created by the Neural Networks.
  • The features will be like F1, F2, F3 and so on.
  • In the above image most similar features are represented towards red color.
  • Dissimilar features are represented towards Blue color.
  • In the second image for the marked feature for the word ‘Water’ the color is different than the other words.
  • Because the word ‘water’ is different from other words. This feature could be ‘alive/dead’.
  • For the second marked feature all the values are same, because that feature must have common to all the words.

(9) Assumption Behind Word2Vec Model.

  • The underlying assumption of Word2Vec model is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model.
  • Consider the sentence below. 
    • The “football” player took a shot.
    • The “hocky” player took a shoot.
  • Here you can notice that “Football” and “Hocky” has been used in same context.
  • Hence Word2Vec model will create similar feature values for these two words.

(10) Architecture Of Word2Vec Model.

  • There are two architecture of Word2Vec model.
    1. Continuous Bag-Of-Words (CBOW).
    2. Skip-Grams

(11) Continuous Bag-Of-Words (CBOW).

  • Directly we can’t convert any word to vector using Word2Vec, instead of that we take a fake problem and try to solve it.
  • In the process of solving that as a byproduct we will get our vector format of that word.
  • Suppose we have a corpus of 5 words.
  • {‘Watch’, ‘YouTube’, ‘For’, ‘Data’, ‘Science’}.
  • My target is to convert all these 5 words into vector format.
  • To do that I will create a dummy problem because as per the Word2Vec assumption we will create a vector representation of the word based on the context.
  • Suppose take an example.
  • Here we want to convert ‘YouTube’ to a vector hence we call it as a ‘Target Word’.
  • The before and after words we call them as ‘Context Word’.
  • The target word always comes with the context word.
  • My dummy problem will be giving ‘Context Word’ I will try to predict the target word. 
  • This will become my prediction problem.
  • Let’s take the 3 word windows, one in the middle will be the target word and both in the side will be the context word.  ( __ target __ ).
  • If we take 5 word windows one in the middle will be the target word and two words in the both the sides. ( __ __ target __ __ ).
  • Let us make the training data out of it:
  • Step-1: First we will convert all the input words into one hot encoded vectors.
  • Step-2: Create Neural Network For The Input Data
  • Input Nodes: As we have two words to feed into input layer and each word is having length of 5. Hence our input length will be 2 * 5 = 10.
  • Hidden Layer: We have decided to create 3 dimensional vector representation of the word. Hence we will create a hidden layer of 3 nodes.
  • Output Layer: The output layer will represent the probability of each word in the vocabulary.
  • Step-3: Now Train Our Neural Network With Input Data.
  • Input Nodes: As we have two words to feed into input layer and each word is having length of 5. Hence our input length will be 2 * 5 = 10.
  • Hidden Layer: We have decided to create 3 dimensional vector representation of the word. Hence we will create a hidden layer of 3 nodes.
  • Output Layer: The output layer will represent the probability of each word in the vocabulary.
  • Suppose we pass first two words ‘Watch’ & ‘For’ and we get the respective probability shown above.
  • Now you will compare this probability values [0, 0.3, 0.2, 0.3, 0.2] with the actual output.
  • Our actual output should have ‘YouTube’ which is [0,1,0,0,0]
  • Here you can see that our predicted output does not match with the actual output.
  • Hence we will calculate the loss and will update the weights of the Neural Network to minimize the loss. 
  • You will repeat these steps for all the input values.
  • Now the question is where is the word embeddings?
  • Now we will focus on the output weights 3 *5, Where we will find our word embedding for each word in the vocabulary.
  • The word embedding for “Watch” will be the weights assigned in the ‘Red’ colour line.
  • The word embedding for “YouTube” will be the weights assigned in the ‘Yellow’ colour line.
  • The word embedding for “For” will be the weights assigned in the ‘Green’ colour line.

How It Works:

  • The intuition is that after training our neural network model with a given input and targeted output, our model will be trained to predict the target word perfectly.
  • When we pass the input words [‘Watch’, ‘For’] our model will predict ‘Youtube’.
  • Output will be [0,1,0,0,0]. The weights which are activated at this time will be ‘Yellow’ colour.
  • Hence we will consider these weights as the word embedding of the word ‘YouTube’.
  • ‘YouTube’ = [0.6, 0.5, 0.3].
  • Like this, we will take the ‘Green’ connection for the word ‘For’.

Note:

  • If we want to increase the dimensions of the word embedding we can increase the nodes in the hidden layer.
  • You can increase it from 3 to 5 etc.

(12) Skip-Grams Word2Vec

  • The Skip-Gram model does the opposite of CBOW: it predicts the surrounding context words given the current word.
  • The input to the neural network is the embedding vector of the current word, and the output is the predicted context words.
  • The goal is to learn word embeddings that are useful for predicting the context words given the current word.

Example:

  • In this case, given the input as a target word we need to predict the surrounding words.
  • Our Neural Network model will look like the below image,
  • Now we will train our Neural Network model with the given input and output.
  • After you train your neural network model the weights will be adjusted to give the correct prediction.
  • Now we will use those weights for our word embedding.
  • In the SkipGram technique, we will take the weights from the left-hand side of the Hidden Layer.
  • If you pass ‘YouTube’ as an input our model will produce ‘Watch’ and ‘For’ as an output.
  • At this time the Word Embedding for the word ‘YouTube’ will be the weights assigned to the green colour.

(13) When To Use CBOW & Skip-Gram

  • It is proven in research that if you are working on small data then you should select CBOW.
  • If you are working with large data you should use SkipGram. 

(14) How To Increase The Performance Of The Word2Vec Model?

  1. Increase the training data.
  2. Increase Dimensions Of The Vector. In our case, it was three.
  3. Increase the Window size. In our case it was 3.

(15)Train Word2Vec Model With Game Of Thrones Dataset.

import numpy as np
import pandas as pd
!pip install gensim
import gensim
import os
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

story = []
for filename in os.listdir('data'):
    
    f = open(os.path.join('data',filename))
    corpus = f.read()
    raw_sent = sent_tokenize(corpus)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)
model.build_vocab(story)
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)
model.wv.most_similar('daenerys')
model.wv.doesnt_match(['jon','rikon','robb','arya','sansa','bran'])
model.wv.doesnt_match(['cersei', 'jaime', 'bronn', 'tyrion'])
model.wv['king']
model.wv.get_normed_vectors()
y = model.wv.index_to_key
from sklearn.decomposition import PCA

pca = PCA(n_components=3)

X = pca.fit_transform(model.wv.get_normed_vectors())
X.shape
import plotly.express as px
fig = px.scatter_3d(X[200:300],x=0,y=1,z=2, color=y[200:300])
fig.show()

Leave a Reply

Your email address will not be published. Required fields are marked *